Data cleaning: Difference between revisions
Jump to navigation
Jump to search
→Remove other string look like whitespace
m (→Counting) |
|||
(32 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== | == Check list == | ||
* row count | |||
* duplicate data | |||
== Check if field value was not fulfilled == | |||
=== By purpose === | === By purpose === | ||
<table border="1" style="width: 100%"> | <table border="1" style="width: 100%"> | ||
Line 234: | Line 239: | ||
* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}} | * MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}} | ||
== | == Data Validation == | ||
Validate the format of field value. Related page: [[Regular expression]] | |||
=== Email contains @ symbol === | === Email contains @ symbol === | ||
Line 258: | Line 263: | ||
* If the data was imported from Excel, you should notice the 15 digit precision issue. | * If the data was imported from Excel, you should notice the 15 digit precision issue. | ||
=== | === Check if the column value is numeric === | ||
Possibile values | |||
<pre> | |||
test data: | |||
3.141592654 | |||
1.36184E+14 | |||
123,456.789 | |||
20740199601 | |||
346183773390240 | |||
="5" | |||
</pre> | |||
MySQL: | MySQL: | ||
* Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref> | |||
* Find the records which the value of `my_column` is '''NOT''' numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'}} | * Check if a value is integer e.g. 1234567 | ||
** Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref><ref>[https://stackoverflow.com/questions/75704/how-do-i-check-to-see-if-a-value-is-an-integer-in-mysql How do I check to see if a value is an integer in MySQL? - Stack Overflow]</ref> | |||
* Check if a value is integer which may contains comma and dot symbols e.g. 1,234.567 or 3.414 | |||
** {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9,\.]+$'}}<ref>[https://community.denodo.com/answers/question/details?questionId=9060g000000XelhAAC&title=How+to+identify+if+values+in+a+column+is+numeric+%28+Function+similar+to+Isnumeric+is+SQL%29 How to identify if values in a column is numeric ( Function similar to Isnumeric is SQL)]</ref> | |||
* Check if a value is NOT integer | |||
** Find the records which the value of `my_column` is '''NOT''' numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'}} | |||
If the digit of number is known, the SQL syntax could be more specific | If the digit of number is known, the SQL syntax could be more specific | ||
* The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}} | * The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}} | ||
PHP: | |||
* [http://php.net/manual/en/function.is-numeric.php is_numeric] function | |||
* [https://www.php.net/manual/en/function.is-int.php is_int] function | |||
Excel & [https://www.google.com/sheets/about/ Google Sheets]: | Excel & [https://www.google.com/sheets/about/ Google Sheets]: | ||
Line 281: | Line 306: | ||
** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers | ** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers | ||
=== Time data: Validate the data format === | |||
[[Validate the datetime value]] | |||
=== Time data | |||
=== Time data: Data was generated in N years === | |||
* {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type | Define the abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series]) | ||
* {{code | code = 1900/1/0}} (converted time formatted value from 0), {{code | code = 1900/1/1}} (converted time formatted value from 1) | * Verfiy the data were generated in N years. Possible abnormal values: {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type. | ||
* | * Verfiy the data were not newer than today | ||
* Verfiy the year of data were not {{kbd | key=1900}} if the data were imported from Microsoft Excel file. Datevalue<ref>[https://support.microsoft.com/zh-tw/office/datevalue-%E5%87%BD%E6%95%B8-df8b07d4-7761-4a93-bc33-b7471bbff252 DATEVALUE 函數 - Office 支援]</ref> was started from the year {{kbd | key=1900}} e.g. | |||
** {{code | code = 1900/1/0}} (converted time formatted value from 0), | |||
** {{code | code = 1900/1/1}} (converted time formatted value from 1) | |||
* Verfiy the value of data were not {{kbd | key=0000-00-00 00:00:00}} | |||
* Verfiy the diversity of data values e.g. [https://en.wikipedia.org/wiki/Variance Variance] | |||
Find the normal values: | Find the normal values: | ||
Line 317: | Line 334: | ||
* PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow] | * PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow] | ||
=== | === Time data: Human birth year (age) data === | ||
Based on the existing record, the longest-living person who lived to 122<ref>[https://en.wikipedia.org/wiki/Maximum_life_span Maximum life span - Wikipedia]</ref>. | Based on the existing record, the longest-living person who lived to 122<ref>[https://en.wikipedia.org/wiki/Maximum_life_span Maximum life span - Wikipedia]</ref>. | ||
Line 333: | Line 350: | ||
* [http://www.fileformat.info/info/unicode/char/a0/index.htm Unicode Character 'NO-BREAK SPACE' (U+00A0)] | * [http://www.fileformat.info/info/unicode/char/a0/index.htm Unicode Character 'NO-BREAK SPACE' (U+00A0)] | ||
== Find and remove duplicates == | |||
[[Find and remove duplicates]] | |||
== Counting == | == Counting == | ||
[[Count occurrences of a word in string]] | * [[Count occurrences of a word in string]] | ||
* Count number of unique values | |||
** Excel: [https://www.excel-easy.com/examples/count-unique-values.html Count Unique Values in Excel - Easy Excel Tutorial] | |||
** Google sheet: [https://support.google.com/docs/answer/3093405?hl=zh-Hant COUNTUNIQUE - 文件編輯器說明] & [https://infoinspired.com/google-docs/spreadsheet/unique-function-in-horizontal-data-range-in-google-sheets/ How to Use UNIQUE Function in Horizontal Data Range in Google Sheets] | |||
== Outlier / Anomaly detection == | == Outlier / Anomaly detection == | ||
Anomaly detection | [[Anomaly detection]] | ||
== unique number of data values == | == unique number of data values == | ||
Line 489: | Line 389: | ||
# ASCII Horizontal Tab (TAB) {{kbd | key=<nowiki>\t</nowiki>}} | # ASCII Horizontal Tab (TAB) {{kbd | key=<nowiki>\t</nowiki>}} | ||
# ASCII Backspace {{kbd | key=<nowiki>\b</nowiki>}} | # ASCII Backspace {{kbd | key=<nowiki>\b</nowiki>}} | ||
# [ | # [[Remove non breaking space]] | ||
== Remove control character == | == Remove control character == | ||
Line 533: | Line 401: | ||
=== Fix garbled message text === | === Fix garbled message text === | ||
[[Fix garbled message text]] | [[Fix garbled message text]] | ||
== Tools == | |||
* [https://github.com/IvanMathy/Boop IvanMathy/Boop: A scriptable scratchpad for developers. In slow yet steady progress.] ([https://apps.apple.com/us/app/boop/id1518425043?mt=12 Boop on the Mac App Store]) " ... to paste some plain text and run some basic text operations on it. " | |||
== Further reading == | == Further reading == | ||
Line 543: | Line 414: | ||
* [https://medium.com/bryanyang0528/%E8%B3%87%E6%96%99%E5%93%81%E8%B3%AA%E5%88%9D%E6%8E%A2-data-quality-b765eb56a7c2?fbclid=IwAR3NBb2BtFm9O3FeY7JgQ5HLE5VG5nFe3m5Zx8zNW9XkvOUPlqV9hXmaoXI 資料品質初探(Data Quality) – 亂點技能的跨界人生 – Medium] {{access | date = 2018-01-13}} | * [https://medium.com/bryanyang0528/%E8%B3%87%E6%96%99%E5%93%81%E8%B3%AA%E5%88%9D%E6%8E%A2-data-quality-b765eb56a7c2?fbclid=IwAR3NBb2BtFm9O3FeY7JgQ5HLE5VG5nFe3m5Zx8zNW9XkvOUPlqV9hXmaoXI 資料品質初探(Data Quality) – 亂點技能的跨界人生 – Medium] {{access | date = 2018-01-13}} | ||
== | == References == | ||
<references/> | <references/> | ||
Line 550: | Line 421: | ||
[[Category:Data Science]] | [[Category:Data Science]] | ||
[[Category:MySQL]] | [[Category:MySQL]] | ||
[[Category: | [[Category:String manipulation]] |