Data cleaning

From LemonWiki共筆
Jump to navigation Jump to search

check if field value was not fulfilled

by purpose

purpose method (MySQL query syntax) value1:
fulfilled value what I want
value2:
fulfilled value NOT I want
value3:
0
value4:
NULL value
value5:
empty or spaces value
find not fulfilled or empty values
(not contains 0)
WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0 V V
find not fulfilled or empty values
(contains 0)
V V V
find fulfilled and non-empty values
(not contains 0)
V V
find fulfilled and non-empty values
(contains 0)
WHERE LENGTH(TRIM( column_name )) > 0 V V V
find (1) not fulfilled or empty values (2) values NOT I want
(not contains 0)
WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0 OR column_name LIKE 'values NOT I want' V V V
find (1) not fulfilled or empty values (2) values NOT I want
(contains 0)
V V V V

by datatype

VARCHAR and allows NULL value

data type of column possible column values method1:
find not fulfilled or empty values
method2:
find fulfilled and non-empty values
method3:
find NULL values
method4:
find not NULL values
VARCHAR and allows NULL fulfilled value ex:123 V V
NULL type:null V V
'NULL' type:string
0 V V
EMPTY ex: '' or space(s) ' ' V V

symbol V: means the column value will be able to find by means of the method


  • method1:
    • SELECT * FROM `my_table` WHERE COALESCE(column_name, '') = ''[1]
    • SELECT * FROM `my_table` WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0
    • SELECT * FROM `my_table` WHERE column_name IS NULL OR column_name = ''[2]
  • method2:
    • SELECT * FROM `my_table` WHERE column_name > ''
    • SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0
    • SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0
  • method3: SELECT * FROM `my_table` WHERE column_name IS NULL
  • method4: SELECT * FROM `my_table` WHERE column_name IS NOT NULL

VARCHAR or numeric

data type of column possible column values method5:
find values within the range
method6:
find values out of the range, empty & NULL values
VARCHAR or numeric values within the range ex: min ≤ value ≤ max V
values out of range V
NULL V
EMPTY ex: '' or space ' ' V
  • method5: SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max
  • method6: SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )

datetime and allows NULL value

possible column values

  1. 2024-03-29
  2. 00-00-00 00:00:00
  3. NULL

is null

Fill 0 if the value is NA or NULL

  • MySQL SQL syntax: COALESCE(): SELECT COALESCE(column_name, 0) or SELECT COALESCE(column_name, 'other_filled_value')
    • (1) Using COALESCE() function to replace the NULL value with 0.
    • (2) The case: 0/0 = null should be handled.
  • MySQL SQL syntax: combined IF() & ISNULL():SELECT IF(ISNULL((column_name), 0, column_name) or SELECT IF(ISNULL((column_name), 'other_filled_value', column_name)
  • python: pandas.DataFrame.fillna — pandas 0.16.0 documentation "Fill NA/NaN values using the specified method"

Find whether a variable is NULL. online demo

  • PHP is_null to find type:null null NOT type:string 'null' Icon_exclaim.gif
  • Google spreadsheet / Excel:
    • ISERR(value) " value - The value to be verified as an error type other than #N/A." ex: #NULL!
    • If the cell value is exactly NULL not #NULL!, You may use COUNTIF(value, "NULL") or EXACT(value, "NULL")
  • MySQL SQL syntax: SELECT * FROM table WHERE column IS NULL;[3]

Find whether a variable is NOT NULL

  • MySQL SQL syntax: SELECT * FROM table WHERE column IS NOT NULL;

Find whether a variable is NOT #N/A

  • Excel: =NOT(ISERROR(cell_value))

javascript

check if field value was not fulfilled: NULL, empty value

Icon_exclaim.gif NOT include those data which its field value fulfilled with default value automatically (demo on sqlfiddle)

  1. Good.gif quick solution: find records with NULL value OR empty, space value
    • MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0;
  2. find records with NULL value: (note: not #NULL!)
    • MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL;
    • EXCEL: =EXACT(A2, "NULL")
  3. find records with empty value: (not contains NULL value)
    • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0; Icon_exclaim.gif SQL query SELECT * FROM table_name WHERE column_name IS NOT NULL includes empty value
    • MS SQL Server: SELECT * FROM table_name WHERE LEN( LTRIM(RTRIM(column_name)) ) = 0; [4]
  4. Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...
    • solution: step1: Replace the year > 100 from this year with empty value at EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2)) (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
    • trivial approach : EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2)) Icon_exclaim.gif this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
  5. Using PHP
    • empty() function to find 0, null, false, empty string, empty array values.
    • if(empty($var) && $var !== 0 && $var !== "0"){ .. } to find null, false, empty string, empty array values BUT not 0.
  6. check if field value was NULL & not equal to some value

check if field value was fulfilled

length of string > 0

  • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0; demo[1]

column value is not null or 0

  • Excel: COUNTIFS(criteria_range1, "<>NULL", criteria_range1, "<>0")[5]

find if number or cell value is positive integer

  • EXCEL: =IFERROR(IF(AND(INT( value )= value, value>0), TRUE, FALSE), FALSE)[6] online demo

check numeric range

  • MySQL: SELECT * FROM table_name WHERE column_name BETWEEN min_number AND max_number; the value >= min_number AND value <= max_number ( min_number ≤ value ≤ max_number )

find NOT empty records means records without NULL or empty value:

  • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;
  • MySQL: SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;

Verify the format of field value

related page: Regular expression

Email contains @ symbol

  • EXCEL: =IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE) only check the field if contains @ symbol or not
    • result: (1) normal condition: return TRUE; (2) exceptional condition: return FALSE if @ symbol was not found
  • EXCEL: =FIND("@", A2, 2) only check the field if contains @ symbol or not
    • syntax: FIND(find_text, with_text, [start_num]) the start_num is 2 because the position of @ symbol should be larger than 1 (position of first char is 1)
    • result: (1) normal condition: return the number larger than 1; (2) exceptional condition: return #VALUE! if @ symbol was not found
  • PHP: PHP FILTER_VALIDATE_EMAIL Filter
    • "Returns the filtered data, or FALSE if the filter fails." quoted from PHP.net

Number precision in Excel

Number precision: 15 digits (Excel中最多的有效位數為15位)[7][8]

raw data: 1234567890123456 ->

  • (numeric format 數值格式) 1234567890123450.00 Icon_exclaim.gif losing precision
  • (general format 通用格式) 1.23457E+15 Icon_exclaim.gif losing precision
  • (text format 文字格式) 1234567890123456

large numbers

  • If the data was imported from Excel, you should notice the 15 digit precision issue.

Numeric only

  • PHP: is_numeric
  • MySQL:
    • find numeric values SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'[9]
    • find NOT numeric values SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'
  • Excel: ISNUMBER Function

Abnormal values of time data

Definition of abnormal values of the time data (time series) if they

  • were generated 10 years before or
  • newer than today

List of the possible abnormal values:

  • 0001-01 00:00:00 occurred in MySQL datetime type
  • 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ... occurred in MS Excel
  • future data: the date after today

Find the normal values:

  • MySQL: Assume the data was generated in recent 10 years & not newer than today
    • SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` < CURDATE() + 1);
      • Icon_exclaim.gif NOT `my_time_column` < CURDATE()。 ex: CURDATE() is 2024-03-29. Which is the same with 2024-03-29 00:00:00
    • SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 ) AND ( `my_time_column` < CURDATE() + 1);
  • MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.
    • :SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` <= CURRENT_TIMESTAMP);
      • You need to check the SELECT CURRENT_TIMESTAMP); if correct or not before you delete the abnormal data (timezone issue)

Check if the date valid

duplicate data

find duplicate data

  • MySQL:

Find the duplicated data for one column[11]

-- Generate test data.
CREATE TABLE `table_name` (
  `id` int(11) NOT NULL,
  `content` varchar(5) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `table_name` (`id`, `content`) VALUES
(1, 'apple'),
(2, 'lemon'),
(3, 'apple');

ALTER TABLE `table_name`
  ADD PRIMARY KEY (`id`);

-- Find duplicated data
SELECT `content`, COUNT(*) count FROM `table_name` GROUP BY `content` HAVING count > 1;

SELECT tmp.* FROM 
( 
  SELECT `content`, count(*) count FROM `table_name` GROUP BY `content` 
) tmp 
WHERE tmp.count >1;

Using CONCAT for multiple columns ex: column_1, column_2

SELECT tmp.key FROM
(
	SELECT count(*) count, CONCAT(  `column_1`, `column_2`  ) 'key'
	FROM `table_name`
	GROUP BY CONCAT(  `column_1`, `column_2`  )
) tmp
WHERE tmp.count >=2

For counting purpose: find the count of repeated id (type: int) between table_a and table_b

SELECT count(DISTINCT(id)) FROM table_a WHERE id IN
(
   SELECT DISTINCT(id) FROM table_b
)
  • Power Tools for Google Spreadsheet [Last visited: 2016-02-26]
    • Menu: Data -> Remove duplicates

deduplicate

counting number of duplicate occurrence


other

  • symbol e.g. data-mining or data_mining

outlier

  • Median
  • Range Checks

unique number of data values

If the data values were generated by different users, the unique number of data values should be larger than ____

Data handling

Remove first, last or certain characters from text

  • Excel: using RIGHT[14] + LEN[15] functions [16]
  • Excel: if the length of text was fixed after removed, you may try to use REPLACE[17] + LEN functions (demo)

Remove leading and trailing spaces from text

UPDATE `table` 
SET `column` = TRIM( `column` ) 
WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );

other text look like whitespace

Whitespace character

  1. IDEOGRAPHIC SPACE (全形空白, U+3000)[18]:
    • diaplay:
      <?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?>
    • replace with space:
      <?php echo str_replace(json_decode('"\u3000"'), " ", $string);?>
  2. ASCII Vertical Tab \v
  3. ASCII Horizontal Tab (TAB) \t
  4. ASCII Backspace \b

related pages

references