Data cleaning: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
mNo edit summary
Line 282: Line 282:
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)


=== Remove leading and trailing spaces from text ===
* PHP: [http://php.net/manual/en/function.trim.php trim function]
* Excel: [https://exceljet.net/formula/remove-leading-and-trailing-spaces-from-text Excel formula: Remove leading and trailing spaces from text | Exceljet]
* MySQL: Using [http://www.w3resource.com/mysql/string-functions/mysql-trim-function.php TRIM() function] & [http://www.w3resource.com/mysql/string-functions/mysql-length-function.php LENGTH() function ]
<pre>
UPDATE `table`
SET `column` = TRIM( `column` )
WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );
</pre>


== duplicate data ==
== duplicate data ==
Line 342: Line 333:
If the data values were generated by different users, the unique number of data values should be larger than ____
If the data values were generated by different users, the unique number of data values should be larger than ____


== data handling ==
== Data handling ==
=== remove first, last or certain characters from text ===
=== Remove first, last or certain characters from text ===
* Excel: using {{kbd | key=RIGHT}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342854.aspx RIGHT、RIGHTB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342650.aspx LEN、LENB 函數 - Excel - Office.com]</ref> functions <ref>[http://www.extendoffice.com/documents/excel/560-excel-remove-character-from-string.html How to remove first, last or certain characters from text in Excel?]</ref>
* Excel: using {{kbd | key=RIGHT}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342854.aspx RIGHT、RIGHTB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342650.aspx LEN、LENB 函數 - Excel - Office.com]</ref> functions <ref>[http://www.extendoffice.com/documents/excel/560-excel-remove-character-from-string.html How to remove first, last or certain characters from text in Excel?]</ref>
* Excel: if the length of text was fixed after removed, you may try to use {{kbd | key=REPLACE}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342844.aspx REPLACE、REPLACEB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}} functions ([http://bit.ly/1i2tsgk demo])
* Excel: if the length of text was fixed after removed, you may try to use {{kbd | key=REPLACE}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342844.aspx REPLACE、REPLACEB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}} functions ([http://bit.ly/1i2tsgk demo])
=== Remove leading and trailing spaces from text ===
* PHP: [http://php.net/manual/en/function.trim.php trim function]
* Excel: [https://exceljet.net/formula/remove-leading-and-trailing-spaces-from-text Excel formula: Remove leading and trailing spaces from text | Exceljet]
* MySQL: Using [http://www.w3resource.com/mysql/string-functions/mysql-trim-function.php TRIM() function] & [http://www.w3resource.com/mysql/string-functions/mysql-length-function.php LENGTH() function ]
<pre>
UPDATE `table`
SET `column` = TRIM( `column` )
WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );
</pre>


== related pages ==
== related pages ==

Revision as of 11:11, 6 September 2016

check if field value was not fulfilled

by purpose

purpose method (MySQL query syntax) value1:
fulfilled value what I want
value2:
fulfilled value NOT I want
value3:
0 value
value4:
NULL value
value5:
empty or spaces value
find not fulfilled or empty values
(not contains 0)
WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0 V V
find not fulfilled or empty values
(contains 0)
V V V
find fulfilled and non-empty values
(not contains 0)
V V
find fulfilled and non-empty values
(contains 0)
WHERE LENGTH(TRIM( column_name )) > 0 V V V
find (1) not fulfilled or empty values (2) values NOT I want
(not contains 0)
WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0 OR column_name LIKE 'values NOT I want' V V V
find (1) not fulfilled or empty values (2) values NOT I want
(contains 0)
V V V V

by datatype

VARCHAR and allows NULL value

data type of column possible column values method1:
find not fulfilled or empty values
method2:
find fulfilled and non-empty values
method3:
find NULL values
method4:
find not NULL values
VARCHAR and allows NULL fulfilled value ex:123 V V
NULL type:null V V
'NULL' type:string
0 V V
EMPTY ex: '' or space(s) ' ' V V

symbol V: means the column value will be able to find by means of the method


  • method1:
    • SELECT * FROM `my_table` WHERE COALESCE(column_name, '') = ''[1]
    • SELECT * FROM `my_table` WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0
    • SELECT * FROM `my_table` WHERE column_name IS NULL OR column_name = ''[2]
  • method2:
    • SELECT * FROM `my_table` WHERE column_name > ''
    • SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0
    • SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0
  • method3: SELECT * FROM `my_table` WHERE column_name IS NULL
  • method4: SELECT * FROM `my_table` WHERE column_name IS NOT NULL

VARCHAR or numeric

data type of column possible column values method5:
find values within the range
method6:
find values out of the range, empty & NULL values
VARCHAR or numeric values within the range ex: min ≤ value ≤ max V
values out of range V
NULL V
EMPTY ex: '' or space ' ' V
  • method5: SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max
  • method6: SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )

datetime and allows NULL value

possible column values

  1. 2026-04-29
  2. 00-00-00 00:00:00
  3. NULL

is null

Fill 0 if the value is NA or NULL

  • MySQL SQL syntax: COALESCE(): SELECT COALESCE(column_name, 0) or SELECT COALESCE(column_name, 'other_filled_value')
    • (1) Using COALESCE() function to replace the NULL value with 0.
    • (2) The case: 0/0 = null should be handled.
  • MySQL SQL syntax: combined IF() & ISNULL():SELECT IF(ISNULL((column_name), 0, column_name) or SELECT IF(ISNULL((column_name), 'other_filled_value', column_name)
  • python: pandas.DataFrame.fillna — pandas 0.16.0 documentation "Fill NA/NaN values using the specified method"

Find whether a variable is NULL. online demo

  • PHP is_null to find type:null null NOT type:string 'null' Icon_exclaim.gif
  • Google spreadsheet / Excel:
    • ISERR(value) " value - The value to be verified as an error type other than #N/A." ex: #NULL!
    • If the cell value is exactly NULL not #NULL!, You may use COUNTIF(value, "NULL") or EXACT(value, "NULL")
  • MySQL SQL syntax: SELECT * FROM table WHERE column IS NULL;[3]

Find whether a variable is NOT NULL

  • MySQL SQL syntax: SELECT * FROM table WHERE column IS NOT NULL;

Find whether a variable is NOT #N/A

  • Excel: =NOT(ISERROR(cell_value))

javascript

check if field value was not fulfilled: NULL, empty value

Icon_exclaim.gif NOT include those data which its field value fulfilled with default value automatically (demo on sqlfiddle)

  1. Good.gif quick solution: find records with NULL value OR empty, space value
    • MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0;
  2. find records with NULL value: (note: not #NULL!)
    • MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL;
    • EXCEL: =EXACT(A2, "NULL")
  3. find records with empty value: (not contains NULL value)
    • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0; Icon_exclaim.gif SQL query SELECT * FROM table_name WHERE column_name IS NOT NULL includes empty value
    • MS SQL Server: SELECT * FROM table_name WHERE LEN( LTRIM(RTRIM(column_name)) ) = 0; [4]
  4. Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...
    • solution: step1: Replace the year > 100 from this year with empty value at EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2)) (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
    • trivial approach : EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2)) Icon_exclaim.gif this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
  5. Using PHP
    • empty() function to find 0, null, false, empty string, empty array values.
    • if(empty($var) && $var !== 0 && $var !== "0"){ .. } to find null, false, empty string, empty array values BUT not 0.
  6. check if field value was NULL & not equal to some value

check if field value was fulfilled

length of string > 0

  • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0; demo[1]

column value is not null or 0

  • Excel: COUNTIFS(criteria_range1, "<>NULL", criteria_range1, "<>0")[5]

find if number or cell value is positive integer

  • EXCEL: =IFERROR(IF(AND(INT( value )= value, value>0), TRUE, FALSE), FALSE)[6] online demo

check numeric range

  • MySQL: SELECT * FROM table_name WHERE column_name BETWEEN min_number AND max_number; the value >= min_number AND value <= max_number ( min_number ≤ value ≤ max_number )

find NOT empty records means records without NULL or empty value:

  • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;
  • MySQL: SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;

Verify the format of field value

related page: Regular expression

Email contains @ symbol

  • EXCEL: =IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE) only check the field if contains @ symbol or not
    • result: (1) normal condition: return TRUE; (2) exceptional condition: return FALSE if @ symbol was not found
  • EXCEL: =FIND("@", A2, 2) only check the field if contains @ symbol or not
    • syntax: FIND(find_text, with_text, [start_num]) the start_num is 2 because the position of @ symbol should be larger than 1 (position of first char is 1)
    • result: (1) normal condition: return the number larger than 1; (2) exceptional condition: return #VALUE! if @ symbol was not found
  • PHP: PHP FILTER_VALIDATE_EMAIL Filter
    • "Returns the filtered data, or FALSE if the filter fails." quoted from PHP.net

Number precision in Excel

Number precision: 15 digits (Excel中最多的有效位數為15位)[7][8]

raw data: 1234567890123456 ->

  • (numeric format 數值格式) 1234567890123450.00 Icon_exclaim.gif losing precision
  • (general format 通用格式) 1.23457E+15 Icon_exclaim.gif losing precision
  • (text format 文字格式) 1234567890123456

large numbers

  • If the data was imported from Excel, you should notice the 15 digit precision issue.

Numeric only

  • PHP: is_numeric
  • MySQL:
    • find numeric values SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'[9]
    • find NOT numeric values SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'
  • Excel: ISNUMBER Function

Abnormal values of time data

Definition of abnormal values of the time data (time series) if they

  • were generated 10 years before or
  • newer than today

List of the possible abnormal values:

  • 0001-01 00:00:00 occurred in MySQL datetime type
  • 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ... occurred in MS Excel
  • future data: the date after today

Find the normal values:

  • MySQL: Assume the data was generated in recent 10 years & not newer than today
    • SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` < CURDATE() + 1);
      • Icon_exclaim.gif NOT `my_time_column` < CURDATE()。 ex: CURDATE() is 2026-04-29. Which is the same with 2026-04-29 00:00:00
    • SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 ) AND ( `my_time_column` < CURDATE() + 1);
  • MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.
    • :SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` <= CURRENT_TIMESTAMP);
      • You need to check the SELECT CURRENT_TIMESTAMP); if correct or not before you delete the abnormal data (timezone issue)


duplicate data

find duplicate data

  • MySQL:
    • Using CONCAT for multiple column ex: column_1, column_2
SELECT tmp.key FROM
(
	SELECT count(*) count, CONCAT(  `column_1`, `column_2`  ) 'key'
	FROM `table_name`
	GROUP BY CONCAT(  `column_1`, `column_2`  )
) tmp
WHERE tmp.count >=2

For counting purpose: find the count of repeated id (type: int) between table_a and table_b

SELECT count(DISTINCT(id)) FROM table_a WHERE id IN
(
   SELECT DISTINCT(id) FROM table_b
)
  • Power Tools for Google Spreadsheet [Last visited: 2016-02-26]
    • Menu: Data -> Remove duplicates

deduplicate

counting number of duplicate occurrence

outlier

  • Median
  • Range Checks

unique number of data values

If the data values were generated by different users, the unique number of data values should be larger than ____

Data handling

Remove first, last or certain characters from text

  • Excel: using RIGHT[12] + LEN[13] functions [14]
  • Excel: if the length of text was fixed after removed, you may try to use REPLACE[15] + LEN functions (demo)

Remove leading and trailing spaces from text

UPDATE `table` 
SET `column` = TRIM( `column` ) 
WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );

related pages

references