Data cleaning: Difference between revisions
Jump to navigation
Jump to search
→Check if the column value is integer
Tags: Mobile edit Mobile web edit |
|||
(38 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== | == Check if field value was not fulfilled == | ||
=== By purpose === | === By purpose === | ||
<table border="1" style="width: 100%"> | <table border="1" style="width: 100%"> | ||
Line 234: | Line 234: | ||
* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}} | * MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}} | ||
== | == Data Validation == | ||
Validate the format of field value. Related page: [[Regular expression]] | |||
=== Email contains @ symbol === | === Email contains @ symbol === | ||
Line 258: | Line 258: | ||
* If the data was imported from Excel, you should notice the 15 digit precision issue. | * If the data was imported from Excel, you should notice the 15 digit precision issue. | ||
=== | === Check if the column value is integer === | ||
MySQL: | |||
Check if a value is integer | |||
* Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref> | |||
* [https://stackoverflow.com/questions/75704/how-do-i-check-to-see-if-a-value-is-an-integer-in-mysql How do I check to see if a value is an integer in MySQL? - Stack Overflow] | |||
Check if a value is NOT integer | |||
* Find the records which the value of `my_column` is '''NOT''' numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'}} | |||
If the digit of number is known, the SQL syntax could be more specific | |||
* The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}} | |||
=== Check if the column value is numeric === | |||
List of the possible abnormal values: | List of the possible abnormal values: | ||
* All numeric values are odd or even if the data were generated by user naturally. | * All numeric values are odd or even if the data were generated by user naturally. | ||
Line 266: | Line 281: | ||
MySQL: | MySQL: | ||
* Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[ | * Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+\.?[0-9]+$'}}<ref>[https://community.denodo.com/answers/question/details?questionId=9060g000000XelhAAC&title=How+to+identify+if+values+in+a+column+is+numeric+%28+Function+similar+to+Isnumeric+is+SQL%29 How to identify if values in a column is numeric ( Function similar to Isnumeric is SQL)]</ref> | ||
Excel & [https://www.google.com/sheets/about/ Google Sheets]: | Excel & [https://www.google.com/sheets/about/ Google Sheets]: | ||
Line 287: | Line 302: | ||
</pre> | </pre> | ||
=== Time data === | === Time data: Validate the data format === | ||
[[Validate the datetime value]] | |||
=== | === Time data: Data was generated in N years === | ||
Define the abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series]) | |||
* were generated | * Verfiy the data were generated in N years. Possible abnormal values: {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type. | ||
* Verfiy the data were not newer than today | |||
* Verfiy the year of data were not {{kbd | key=1900}} if the data were imported from Microsoft Excel file. Datevalue<ref>[https://support.microsoft.com/zh-tw/office/datevalue-%E5%87%BD%E6%95%B8-df8b07d4-7761-4a93-bc33-b7471bbff252 DATEVALUE 函數 - Office 支援]</ref> was started from the year {{kbd | key=1900}} e.g. | |||
** {{code | code = 1900/1/0}} (converted time formatted value from 0), | |||
** {{code | code = 1900/1/1}} (converted time formatted value from 1) | |||
* {{code | code = 1900/1/0}} (converted time formatted value from 0), {{code | code = 1900/1/1}} (converted time formatted value from 1) | * Verfiy the value of data were not {{kbd | key=0000-00-00 00:00:00}} | ||
* | * Verfiy the diversity of data values e.g. [https://en.wikipedia.org/wiki/Variance Variance] | ||
Find the normal values: | Find the normal values: | ||
Line 314: | Line 330: | ||
* PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow] | * PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow] | ||
=== | === Time data: Human birth year (age) data === | ||
Based on the existing record, the longest-living person who lived to 122<ref>[https://en.wikipedia.org/wiki/Maximum_life_span Maximum life span - Wikipedia]</ref>. | Based on the existing record, the longest-living person who lived to 122<ref>[https://en.wikipedia.org/wiki/Maximum_life_span Maximum life span - Wikipedia]</ref>. | ||
Line 333: | Line 349: | ||
=== Find duplicate data === | === Find duplicate data === | ||
==== EXCEL ==== | ==== EXCEL ==== | ||
===== Finding duplicate rows that differ in one column ===== | |||
* one column data: [http://www.extendoffice.com/documents/excel/1499-count-duplicate-values-in-column.html How to count duplicate values in a column in Excel?] Using {{kbd | key = COUNTIF(range, criteria)}} {{access | date = 2015-08-25}} or using '''Pivot Tables'''(樞紐分析表) to find the occurrence of value >= 2 | * one column data: [http://www.extendoffice.com/documents/excel/1499-count-duplicate-values-in-column.html How to count duplicate values in a column in Excel?] Using {{kbd | key = COUNTIF(range, criteria)}} {{access | date = 2015-08-25}} or using '''Pivot Tables'''(樞紐分析表) to find the occurrence of value >= 2 | ||
===== Finding duplicate rows that differ in multiple columns ===== | |||
* two or multiple columns data: (approach 1) [https://support.microsoft.com/en-us/kb/213367 How to compare data in two columns to find duplicates in Excel] {{access | date = 2015-06-16}} {{exclaim}} It may costs too much time (larger than one hour) if the number of records exceeds 1,000,000 (approach 2) Using [https://support.office.com/en-us/article/concat-function-9b1a9a3f-94ff-41af-9736-694cbd6b4ca2 CONCAT function] to concatenate two or multiple columns data. And then use {{kbd | key = COUNTIF(range, criteria)}}. | * two or multiple columns data: (approach 1) [https://support.microsoft.com/en-us/kb/213367 How to compare data in two columns to find duplicates in Excel] {{access | date = 2015-06-16}} {{exclaim}} It may costs too much time (larger than one hour) if the number of records exceeds 1,000,000 (approach 2) Using [https://support.office.com/en-us/article/concat-function-9b1a9a3f-94ff-41af-9736-694cbd6b4ca2 CONCAT function] to concatenate two or multiple columns data. And then use {{kbd | key = COUNTIF(range, criteria)}}. | ||
==== Cygwin ==== | |||
* [https://www.computerhope.com/unix/uuniq.htm uniq command] on Cygwin of {{Win}} or {{Linux}}: {{kbd | key=<nowiki>uniq -d <file.txt> > <duplicated_items.txt></nowiki>}}<ref>[https://unix.stackexchange.com/questions/52534/how-to-print-only-the-duplicate-values-from-a-text-file shell - How to print only the duplicate values from a text file? - Unix & Linux Stack Exchange]</ref> | * [https://www.computerhope.com/unix/uuniq.htm uniq command] on Cygwin of {{Win}} or {{Linux}}: {{kbd | key=<nowiki>uniq -d <file.txt> > <duplicated_items.txt></nowiki>}}<ref>[https://unix.stackexchange.com/questions/52534/how-to-print-only-the-duplicate-values-from-a-text-file shell - How to print only the duplicate values from a text file? - Unix & Linux Stack Exchange]</ref> | ||
==== MySQL ==== | ==== MySQL ==== | ||
===== Finding duplicate rows that differ in one column ===== | |||
Find the duplicated data for one column<ref>[http://stackoverflow.com/questions/688549/finding-duplicate-values-in-mysql?rq=1 Finding duplicate values in MySQL - Stack Overflow]</ref> | Find the duplicated data for one column<ref>[http://stackoverflow.com/questions/688549/finding-duplicate-values-in-mysql?rq=1 Finding duplicate values in MySQL - Stack Overflow]</ref> | ||
<pre> | <pre> | ||
Line 369: | Line 389: | ||
</pre> | </pre> | ||
===== Finding duplicate rows that differ in multiple columns ===== | |||
Using {{kbd | key =CONCAT}} for multiple columns ex: column_1, column_2 | Using {{kbd | key =CONCAT}} for multiple columns ex: column_1, column_2 | ||
<pre> | |||
SELECT count(*) count, CONCAT( `column_1`, `column_2` ) 'key' | |||
FROM `table_name` | |||
GROUP BY CONCAT( `column_1`, `column_2` ) | |||
HAVING count > 1; | |||
</pre> | |||
or | |||
<pre> | <pre> | ||
SELECT tmp.key FROM | SELECT tmp.key FROM | ||
Line 380: | Line 409: | ||
</pre> | </pre> | ||
===== other cases ===== | |||
For counting purpose: find the count of repeated id (type: int) between table_a and table_b | For counting purpose: find the count of repeated id (type: int) between table_a and table_b | ||
<pre> | <pre> | ||
Line 417: | Line 447: | ||
=== Counting number of duplicate occurrence === | === Counting number of duplicate occurrence === | ||
MySQL: find the number of duplicate occurrence between list_a & list_b which using the same primary key: column name {{kbd | key = id}} | |||
* {{kbd | key = SELECT count(DISTINCT(`id`)) FROM `list_a` WHERE `id` IN (SELECT DISTINCT(`id`) FROM `list_b`) ; }} | |||
Excel: | |||
* [http://superuser.com/questions/307837/how-to-count-number-of-repeat-occurrences microsoft excel - How to count number of repeat occurrences - Super User] {{exclaim}} long number issue: [https://superuser.com/questions/783840/countif-incorrectly-matches-long-number microsoft excel - Countif incorrectly matches long number - Super User] | |||
=== Other === | === Other === | ||
* symbol e.g. data-mining or data_mining | * symbol e.g. data-mining or data_mining | ||
== | == Counting == | ||
* [[Count occurrences of a word in string]] | |||
* | * Count number of unique values | ||
* | ** Excel: [https://www.excel-easy.com/examples/count-unique-values.html Count Unique Values in Excel - Easy Excel Tutorial] | ||
** Google sheet: [https://support.google.com/docs/answer/3093405?hl=zh-Hant COUNTUNIQUE - 文件編輯器說明] & [https://infoinspired.com/google-docs/spreadsheet/unique-function-in-horizontal-data-range-in-google-sheets/ How to Use UNIQUE Function in Horizontal Data Range in Google Sheets] | |||
== Outlier / Anomaly detection == | |||
[[Anomaly detection]] | |||
== unique number of data values == | == unique number of data values == | ||
Line 469: | Line 494: | ||
# [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] ({{kbd | key=<nowiki>nbsp;</nowiki>}}) Replace Non-breaking space with one whitespace using PHP: {{kbd | key=<nowiki>$result = str_replace("\xc2\xa0", ' ', $original_string);</nowiki>}}<ref>[https://stackoverflow.com/questions/40724543/how-to-replace-decoded-non-breakable-space-nbsp php - How to replace decoded Non-breakable space (nbsp) - Stack Overflow]</ref> | # [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] ({{kbd | key=<nowiki>nbsp;</nowiki>}}) Replace Non-breaking space with one whitespace using PHP: {{kbd | key=<nowiki>$result = str_replace("\xc2\xa0", ' ', $original_string);</nowiki>}}<ref>[https://stackoverflow.com/questions/40724543/how-to-replace-decoded-non-breakable-space-nbsp php - How to replace decoded Non-breakable space (nbsp) - Stack Overflow]</ref> | ||
How to display the Non-breaking space? In MySQL | Sentence spacing | ||
# [https://en.wikipedia.org/wiki/Sentence_spacing_in_digital_media Sentence spacing in digital media - Wikipedia] e.g. {{kbd | key=<nowiki>&Nbsp; &Ensp; &Emsp;</nowiki>}} | |||
How to display the Non-breaking space In PHP? | |||
<pre> | |||
$input = '12345678' . hex2bin('c2a0'); | |||
echo $input . PHP_EOL; | |||
## Result of above script: '12345678 ' (one whitespace at the end) | |||
echo bin2hex($input) . PHP_EOL; | |||
## Result of above script: 3132333435363738c2a0 | |||
echo bin2hex('12345678') . PHP_EOL; | |||
## Result of above script: 3132333435363738 (You mat notice the difference of script result is C2A0) | |||
</pre> | |||
How to display the Non-breaking space In MySQL? | |||
<pre> | <pre> | ||
SELECT CONCAT('12345678', UNHEX('C2A0')) | SELECT CONCAT('12345678', UNHEX('C2A0')) | ||
Line 486: | Line 528: | ||
-- Result of above query: 10 | -- Result of above query: 10 | ||
</pre> | </pre> | ||
== Remove control character == | |||
[https://en.wikipedia.org/wiki/Control_character Control character - Wikipedia] Using PHP to clean control character: | |||
<pre> | |||
$input = 'some string may contains control characters'; | |||
$replacement = ''; | |||
$result = preg_replace('/[\x00-\x1F]/', $replacement, $input); | |||
</pre> | |||
=== Fix garbled message text === | |||
[[Fix garbled message text]] | |||
== Tools == | |||
* [https://github.com/IvanMathy/Boop IvanMathy/Boop: A scriptable scratchpad for developers. In slow yet steady progress.] ([https://apps.apple.com/us/app/boop/id1518425043?mt=12 Boop on the Mac App Store]) " ... to paste some plain text and run some basic text operations on it. " | |||
== Further reading == | == Further reading == | ||
Line 503: | Line 559: | ||
[[Category:Data Science]] | [[Category:Data Science]] | ||
[[Category:MySQL]] | [[Category:MySQL]] | ||
[[Category:String manipulation]] |