Data cleaning: Difference between revisions

Jump to navigation Jump to search
2,204 bytes added ,  25 November 2025
 
(15 intermediate revisions by the same user not shown)
Line 2: Line 2:
== Check list ==
== Check list ==


* row count
* Row count: The number of data entries is a fundamental item for data verification and is easy to observe and check. For instance, one can compare the number of entries displayed on a webpage to the number of entries after exporting to a CSV file.
* duplicate data
* Duplicate data


== Check if field value was not fulfilled  ==
== Check if field value was not fulfilled  ==
Line 73: Line 73:
</table>
</table>


=== by datatype ===
=== By datatype ===
==== VARCHAR and NOT allows NULL value ====
==== VARCHAR and NOT allows NULL value ====
Using NULLIF() function<ref>[https://www.w3schools.com/sql/func_mysql_nullif.asp MySQL NULLIF() Function]</ref>
Using NULLIF() function<ref>[https://www.w3schools.com/sql/func_mysql_nullif.asp MySQL NULLIF() Function]</ref>
Line 268: Line 268:
Validate the format of field value. Related page: [[Regular expression]]
Validate the format of field value. Related page: [[Regular expression]]


=== Email contains @ symbol ===
=== Verify the strings are in valid email format ===
Rule: Email contains @ symbol
 
* EXCEL: {{kbd | key =<nowiki>=IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE)</nowiki>}} only check the field if contains @ symbol or not
* EXCEL: {{kbd | key =<nowiki>=IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE)</nowiki>}} only check the field if contains @ symbol or not
** result: (1) normal condition: return TRUE; (2) exceptional condition: return '''FALSE''' if @ symbol was not found  
** result: (1) normal condition: return TRUE; (2) exceptional condition: return '''FALSE''' if @ symbol was not found  
Line 276: Line 278:
* PHP: [http://www.w3schools.com/php/filter_validate_email.asp PHP FILTER_VALIDATE_EMAIL Filter]
* PHP: [http://www.w3schools.com/php/filter_validate_email.asp PHP FILTER_VALIDATE_EMAIL Filter]
** "Returns the filtered data, or '''FALSE''' if the filter fails." quoted from [http://php.net/manual/en/function.filter-var.php PHP.net]
** "Returns the filtered data, or '''FALSE''' if the filter fails." quoted from [http://php.net/manual/en/function.filter-var.php PHP.net]
=== Verify the strings are in valid url format ===
Rule: Begin with http or https
* Google spreadsheet {{kbd | key =<nowiki>=REGEXMATCH(A1, "^http(s?)")</nowiki>}}


=== Number precision in Excel ===
=== Number precision in Excel ===
Line 289: Line 296:
* If the data was imported from Excel, you should notice the 15 digit precision issue.
* If the data was imported from Excel, you should notice the 15 digit precision issue.


=== Check if the column value is numeric ===
=== Verify the column values are numeric ===
Possibile values
 
Possible values


<pre>
<pre>
Line 302: Line 310:
</pre>
</pre>


==== Verify if value is number in MySQL ====
MySQL:  
MySQL:  


* Check if a value is integer e.g. 1234567
* Check if a value is integer e.g. 1234567
** Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref><ref>[https://stackoverflow.com/questions/75704/how-do-i-check-to-see-if-a-value-is-an-integer-in-mysql How do I check to see if a value is an integer in MySQL? - Stack Overflow]</ref>
** Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref><ref>[https://stackoverflow.com/questions/75704/how-do-i-check-to-see-if-a-value-is-an-integer-in-mysql How do I check to see if a value is an integer in MySQL? - Stack Overflow]</ref>
* Find the records which the value of `my_column` is not exactly 8 digits {{code | code = SELECT * FROM my_table WHERE LENGTH(my_column) != 8 OR my_column NOT REGEXP '^[0-9]{8}$'}}
** The `LENGTH()` function checks if the string length is not 8 characters
** The `REGEXP '^[0-9]{8}$'` pattern validates that the value contains exactly 8 digits from start (^) to end ($)
** Using both conditions ensures catching values with correct length but non-numeric characters, as well as incorrect lengths


* Check if a value is integer which may contains comma and dot symbols e.g. 1,234.567 or 3.414
* Check if a value is integer which may contains comma and dot symbols e.g. 1,234.567 or 3.414
Line 317: Line 331:
* The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}}
* The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}}


==== Verify if value is number in PHP ====


PHP:
* [http://php.net/manual/en/function.is-numeric.php is_numeric] function
* [http://php.net/manual/en/function.is-numeric.php is_numeric] function
* [https://www.php.net/manual/en/function.is-int.php is_int] function
* [https://www.php.net/manual/en/function.is-int.php is_int] function


 
==== Verify if value is number in Excel or Google sheet ====
Excel & [https://www.google.com/sheets/about/ Google Sheets]:  
Excel & [https://www.google.com/sheets/about/ Google Sheets]:  
* Using [http://www.techonthenet.com/excel/formulas/isnumber.php ISNUMBER Function]: {{code | code = <nowiki>=INT(ISNUMBER(A1))</nowiki>}}
* Using [http://www.techonthenet.com/excel/formulas/isnumber.php ISNUMBER Function]: {{code | code = <nowiki>=INT(ISNUMBER(A1))</nowiki>}}
Line 337: Line 351:
=== Time data: Data was generated in N years ===
=== Time data: Data was generated in N years ===
Define the abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series])
Define the abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series])
* Verfiy the data were generated in N years. Possible abnormal values: {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type.
* Verify the data were generated in N years. Possible abnormal values: {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type. e.g.
* Verfiy the data were not newer than today
 
* Verfiy the year of data were not {{kbd | key=1900}} if the data were imported from Microsoft Excel file. Datevalue<ref>[https://support.microsoft.com/zh-tw/office/datevalue-%E5%87%BD%E6%95%B8-df8b07d4-7761-4a93-bc33-b7471bbff252 DATEVALUE 函數 - Office 支援]</ref> was started from the year {{kbd | key=1900}} e.g.  
* Verify the data were not newer than today
 
* Verify the year of data were not {{kbd | key=1900}} if the data were imported from Microsoft Excel file. Datevalue<ref>[https://support.microsoft.com/zh-tw/office/datevalue-%E5%87%BD%E6%95%B8-df8b07d4-7761-4a93-bc33-b7471bbff252 DATEVALUE 函數 - Office 支援]</ref> was started from the year {{kbd | key=1900}} e.g.  
** {{code | code = 1900/1/0}} (converted time formatted value from 0),  
** {{code | code = 1900/1/0}} (converted time formatted value from 0),  
** {{code | code = 1900/1/1}} (converted time formatted value from 1)
** {{code | code = 1900/1/1}} (converted time formatted value from 1)
* Verfiy the value of data were not {{kbd | key=0000-00-00 00:00:00}}
 
* Verfiy the diversity of data values e.g. [https://en.wikipedia.org/wiki/Variance Variance]
* Verify the diversity of data values e.g. [https://en.wikipedia.org/wiki/Variance Variance]


Find the normal values:  
Find the normal values:  
Line 353: Line 369:
** {{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` <= CURRENT_TIMESTAMP);}}  
** {{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` <= CURRENT_TIMESTAMP);}}  
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)
Abnormal values
* {{code | code = 1970-01-01 08:00:00}} (converted time formatted value from {{code | code =August 3, 2017}}) caused by the string contains special characters e.g. [https://en.wikipedia.org/wiki/Left-to-right_mark left-to-right mark (LRM) ]


Check if the date valid
Check if the date valid
Line 375: Line 394:
* [[Return symbol]]
* [[Return symbol]]
* [http://www.fileformat.info/info/unicode/char/a0/index.htm Unicode Character 'NO-BREAK SPACE' (U+00A0)]
* [http://www.fileformat.info/info/unicode/char/a0/index.htm Unicode Character 'NO-BREAK SPACE' (U+00A0)]
* [https://www.fileformat.info/info/unicode/char/200f/index.htm Unicode Character 'RIGHT-TO-LEFT MARK' (U+200F)]
* [https://www.fileformat.info/info/unicode/char/200f/index.htm Unicode Character 'RIGHT-TO-LEFT MARK' (U+200F)]<ref>[https://stackoverflow.com/questions/1930009/how-to-strip-unicode-chars-left-to-right-mark-from-a-string-in-php regex - How to strip unicode chars (LEFT_TO_RIGHT_MARK) from a string in php - Stack Overflow]</ref>
== File Validation ==
=== Verify the file format of downloaded file ===
* PDF file format: [https://stackoverflow.com/questions/16152583/tell-if-a-file-is-pdf-in-bash Tell if a file is PDF in bash - Stack Overflow]


== Find and remove duplicates ==
== Find and remove duplicates ==
Line 432: Line 458:


== Tools ==
== Tools ==
* [https://github.com/IvanMathy/Boop IvanMathy/Boop: A scriptable scratchpad for developers. In slow yet steady progress.] ([https://apps.apple.com/us/app/boop/id1518425043?mt=12 ‎Boop on the Mac App Store]) " ... to paste some plain text and run some basic text operations on it. "
* {{Mac}} [https://github.com/IvanMathy/Boop IvanMathy/Boop: A scriptable scratchpad for developers. In slow yet steady progress.] ([https://apps.apple.com/us/app/boop/id1518425043?mt=12 ‎Boop on the Mac App Store]) " ... to paste some plain text and run some basic text operations on it. "
* [https://gchq.github.io/CyberChef/ CyberChef] (source code available on [https://github.com/gchq/CyberChef github])  The Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis


== Further reading ==
== Further reading ==
Line 445: Line 472:
== References ==
== References ==
<references/>
<references/>
{{Template:Data factory flow}}


[[Category:Spreadsheet]] [[Category:Excel]]
[[Category:Spreadsheet]] [[Category:Excel]]

Navigation menu