Data cleaning: Difference between revisions

Jump to navigation Jump to search
4,223 bytes added ,  19 November 2019
m
(31 intermediate revisions by the same user not shown)
Line 186: Line 186:
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NULL;}}<ref>[http://dev.mysql.com/doc/refman/5.0/en/working-with-null.html MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values]</ref>  
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NULL;}}<ref>[http://dev.mysql.com/doc/refman/5.0/en/working-with-null.html MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values]</ref>  
* R: is.null(): [https://stat.ethz.ch/R-manual/R-devel/library/base/html/NULL.html R: The Null Object]
* R: is.null(): [https://stat.ethz.ch/R-manual/R-devel/library/base/html/NULL.html R: The Null Object]
* Excel<ref>[https://errerrors.blogspot.com/2018/09/check-if-cell-value-is-null-in-excel.html 如何判斷 Excel 儲存格的欄位值是 NULL]</ref>


Find whether a variable is NOT NULL
Find whether a variable is NOT NULL
Line 257: Line 258:
* If the data was imported from Excel, you should notice the 15 digit precision issue.
* If the data was imported from Excel, you should notice the 15 digit precision issue.


=== Numeric only ===
=== Numeric ===
List of the possible abnormal values:
List of the possible abnormal values:
* All numeric values are odd or even if the data were generated by user naturally.
* All numeric values are odd or even if the data were generated by user naturally.
Line 265: Line 266:


MySQL:  
MySQL:  
* Find numeric values {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref>
* Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref>
* Find '''NOT''' numeric values {{code | code = SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'}}
* Find the records which the value of `my_column` is '''NOT''' numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'}}
 
If the digit of number is known, the SQL syntax could be more specific
* The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}}


Excel & [https://www.google.com/sheets/about/ Google Sheets]:  
Excel & [https://www.google.com/sheets/about/ Google Sheets]:  
Line 273: Line 277:
** Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}}
** Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}}


*  Google Sheets only: Using [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH], [https://support.google.com/docs/answer/3094140?hl=zh-Hant TRIM] & [https://support.google.com/docs/answer/3093592?hl=zh-Hant CONCAT]<ref>[https://errerrors.blogspot.tw/2015/08/google.html GOOGLE 試算表: 數字轉成文字]</ref> functions: {{code | code = <nowiki>=IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)</nowiki>}}
*  Google Sheets only: Using [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH], [https://support.google.com/docs/answer/3094140?hl=zh-Hant TRIM] & [https://support.google.com/docs/answer/3093592?hl=zh-Hant CONCAT]<ref>[https://errerrors.blogspot.com/2015/08/google.html GOOGLE 試算表: 數字轉成文字]</ref> functions: {{code | code = <nowiki>=IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)</nowiki>}}
** Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}}
** Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}}
** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers
** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers
Line 322: Line 326:


Using [https://www.w3resource.com/mysql/date-and-time-functions/mysql-unix_timestamp-function.php UNIX_TIMESTAMP() function] to check the abnormality of birthday data is not appropriate. Because the birthdays which are earlier {{kbd | key=<nowiki>1970-01-01 00:00:00 UTC</nowiki>}} will all become zero.
Using [https://www.w3resource.com/mysql/date-and-time-functions/mysql-unix_timestamp-function.php UNIX_TIMESTAMP() function] to check the abnormality of birthday data is not appropriate. Because the birthdays which are earlier {{kbd | key=<nowiki>1970-01-01 00:00:00 UTC</nowiki>}} will all become zero.
=== String contains special characters ===
* [[Byte order mark]] (BOM)
* [[Return symbol]]
* [http://www.fileformat.info/info/unicode/char/a0/index.htm Unicode Character 'NO-BREAK SPACE' (U+00A0)]


== Duplicate data ==
== Duplicate data ==
=== Find duplicate data ===
=== Find duplicate data ===
* EXCEL:
==== EXCEL ====
** one column data: [http://www.extendoffice.com/documents/excel/1499-count-duplicate-values-in-column.html How to count duplicate values in a column in Excel?] Using {{kbd | key = COUNTIF(range, criteria)}} {{access | date = 2015-08-25}} or using '''Pivot Tables'''(樞紐分析表)  to find the occurrence of value >= 2
===== Finding duplicate rows that differ in one column =====
** two columns data: [https://support.microsoft.com/en-us/kb/213367 How to compare data in two columns to find duplicates in Excel] {{access | date = 2015-06-16}} {{exclaim}} It may costs too much time (larger than one hour) if the number of records exceeds 1,000,000
* one column data: [http://www.extendoffice.com/documents/excel/1499-count-duplicate-values-in-column.html How to count duplicate values in a column in Excel?] Using {{kbd | key = COUNTIF(range, criteria)}} {{access | date = 2015-08-25}} or using '''Pivot Tables'''(樞紐分析表)  to find the occurrence of value >= 2
 
===== Finding duplicate rows that differ in multiple columns =====
* two or multiple columns data: (approach 1) [https://support.microsoft.com/en-us/kb/213367 How to compare data in two columns to find duplicates in Excel] {{access | date = 2015-06-16}} {{exclaim}} It may costs too much time (larger than one hour) if the number of records exceeds 1,000,000 (approach 2) Using [https://support.office.com/en-us/article/concat-function-9b1a9a3f-94ff-41af-9736-694cbd6b4ca2 CONCAT function] to concatenate two or multiple columns data. And then use {{kbd | key = COUNTIF(range, criteria)}}.


==== Cygwin ====
* [https://www.computerhope.com/unix/uuniq.htm uniq command] on Cygwin of {{Win}} or {{Linux}}: {{kbd | key=<nowiki>uniq -d <file.txt> > <duplicated_items.txt></nowiki>}}<ref>[https://unix.stackexchange.com/questions/52534/how-to-print-only-the-duplicate-values-from-a-text-file shell - How to print only the duplicate values from a text file? - Unix & Linux Stack Exchange]</ref>
* [https://www.computerhope.com/unix/uuniq.htm uniq command] on Cygwin of {{Win}} or {{Linux}}: {{kbd | key=<nowiki>uniq -d <file.txt> > <duplicated_items.txt></nowiki>}}<ref>[https://unix.stackexchange.com/questions/52534/how-to-print-only-the-duplicate-values-from-a-text-file shell - How to print only the duplicate values from a text file? - Unix & Linux Stack Exchange]</ref>


* MySQL:
==== MySQL ====
 
===== Finding duplicate rows that differ in one column =====
Find the duplicated data for one column<ref>[http://stackoverflow.com/questions/688549/finding-duplicate-values-in-mysql?rq=1 Finding duplicate values in MySQL - Stack Overflow]</ref>
Find the duplicated data for one column<ref>[http://stackoverflow.com/questions/688549/finding-duplicate-values-in-mysql?rq=1 Finding duplicate values in MySQL - Stack Overflow]</ref>
<pre>
<pre>
Line 362: Line 376:
</pre>
</pre>


===== Finding duplicate rows that differ in multiple columns =====
Using {{kbd | key =CONCAT}} for multiple columns ex: column_1, column_2  
Using {{kbd | key =CONCAT}} for multiple columns ex: column_1, column_2  
<pre>
SELECT count(*) count, CONCAT(  `column_1`, `column_2`  ) 'key'
FROM `table_name`
GROUP BY CONCAT(  `column_1`, `column_2`  )
HAVING count > 1;
</pre>
or
<pre>
<pre>
SELECT tmp.key FROM
SELECT tmp.key FROM
Line 373: Line 396:
</pre>
</pre>


===== other cases =====
For counting purpose: find the count of repeated id (type: int) between table_a and table_b
For counting purpose: find the count of repeated id (type: int) between table_a and table_b
<pre>
<pre>
Line 381: Line 405:
</pre>
</pre>


* [https://chrome.google.com/webstore/detail/power-tools/dofhceeoedodcaheeoacmadcpegkjobi Power Tools] for Google Spreadsheet {{access | date = 2016-02-26}}
==== Google Spreadsheet ====
 
* [https://www.ablebits.com/google-sheets-add-ons/remove-duplicates/index.php Remove duplicates in Google Sheets] 30 days free {{access | date = 2019-02-26}}
* [https://chrome.google.com/webstore/detail/power-tools/dofhceeoedodcaheeoacmadcpegkjobi Power Tools] for Google Spreadsheet {{access | date = 2019-02-26}}
** Menu: Data -> Remove duplicates
** Menu: Data -> Remove duplicates


Line 397: Line 424:


* [http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html GNU Coreutils: sort invocation] OS: {{Linux}}, cygwin of {{Win}}. More details on [[Alternative_Linux_commands#Merge_multiple_plain_text_files | Merge multiple plain text files]].
* [http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html GNU Coreutils: sort invocation] OS: {{Linux}}, cygwin of {{Win}}. More details on [[Alternative_Linux_commands#Merge_multiple_plain_text_files | Merge multiple plain text files]].
** To remove duplicate lines: {{kbd | key=sort -us -o output_unique.file input.file}} in a large text file (GB)<ref>[http://unix.stackexchange.com/questions/19641/how-to-remove-duplicate-lines-in-a-large-multi-gb-textfile linux - How to remove duplicate lines in a large multi-GB textfile? - Unix & Linux Stack Exchange]</ref>  
** To remove duplicate lines:  
*** {{kbd | key=<nowiki>sort -us -o <output_unique.file> <input.file></nowiki>}} in a large text file (GB)<ref>[http://unix.stackexchange.com/questions/19641/how-to-remove-duplicate-lines-in-a-large-multi-gb-textfile linux - How to remove duplicate lines in a large multi-GB textfile? - Unix & Linux Stack Exchange]</ref>  
*** {{kbd | key=<nowiki>cat <input.file> | grep <pattern> | sort | uniq</nowiki>}} Processes text line by line and prints the '''unique''' lines which match a specified pattern. Equal to these steps: (1) {{kbd | key=<nowiki>cat <input.file> | grep <pattern> > <tmp.file></nowiki>}} (2) {{kbd | key=<nowiki>sort <tmp.file> | uniq</nowiki>}}
** Ignore first n line(s) & remove duplicate lines<ref>[https://stackoverflow.com/questions/14562423/is-there-a-way-to-ignore-header-lines-in-a-unix-sort sorting - Is there a way to ignore header lines in a UNIX sort? - Stack Overflow]</ref><ref>[http://linux.vbird.org/linux_basic/0320bash.php#redirect_com 命令執行的判斷依據: ; , &&, ||]</ref><ref>[https://www.computerhope.com/unix/utail.htm Linux tail command help and examples]</ref>
** Ignore first n line(s) & remove duplicate lines<ref>[https://stackoverflow.com/questions/14562423/is-there-a-way-to-ignore-header-lines-in-a-unix-sort sorting - Is there a way to ignore header lines in a UNIX sort? - Stack Overflow]</ref><ref>[http://linux.vbird.org/linux_basic/0320bash.php#redirect_com 命令執行的判斷依據: ; , &&, ||]</ref><ref>[https://www.computerhope.com/unix/utail.htm Linux tail command help and examples]</ref>
*** (1) ignore first one line: {{kbd | key=<nowiki>(head -n 1 <file> && tail -n +2 <file> | sort -us) > newfile</nowiki>}}   
*** (1) ignore first one line: {{kbd | key=<nowiki>(head -n 1 <file> && tail -n +2 <file> | sort -us) > newfile</nowiki>}}   
Line 405: Line 434:


=== Counting number of duplicate occurrence ===
=== Counting number of duplicate occurrence ===
* MySQL: find the number of duplicate occurrence between list_a & list_b which using the same primary key: column name {{kbd | key = id}}
MySQL: find the number of duplicate occurrence between list_a & list_b which using the same primary key: column name {{kbd | key = id}}
** {{kbd | key = SELECT count(DISTINCT(`id`)) FROM `list_a` WHERE `id` IN (SELECT DISTINCT(`id`) FROM `list_b`) ; }}
* {{kbd | key = SELECT count(DISTINCT(`id`)) FROM `list_a` WHERE `id` IN (SELECT DISTINCT(`id`) FROM `list_b`) ; }}
* Excel:  
 
** [http://superuser.com/questions/307837/how-to-count-number-of-repeat-occurrences microsoft excel - How to count number of repeat occurrences - Super User] {{exclaim}} long number issue: [https://superuser.com/questions/783840/countif-incorrectly-matches-long-number microsoft excel - Countif incorrectly matches long number - Super User]
Excel:  
* [http://superuser.com/questions/307837/how-to-count-number-of-repeat-occurrences microsoft excel - How to count number of repeat occurrences - Super User] {{exclaim}} long number issue: [https://superuser.com/questions/783840/countif-incorrectly-matches-long-number microsoft excel - Countif incorrectly matches long number - Super User]


=== Other ===
=== Other ===
* symbol e.g. data-mining or data_mining
* symbol e.g. data-mining or data_mining
== Counting ==
[[Count occurrences of a word in string]]


== Outlier / Anomaly detection ==
== Outlier / Anomaly detection ==
Line 447: Line 481:
</pre>
</pre>


==== other text look like whitespace ====
=== Remove other string look like whitespace ===
[https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
[https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
# [http://www.fileformat.info/info/unicode/char/3000/index.htm IDEOGRAPHIC SPACE] (全形空白, U+3000)<ref>[https://www.ptt.cc/bbs/PHP/M.1473522044.A.4FC.html Re: 請益 mysql空百=? - 看板 PHP - 批踢踢實業坊]</ref>:
# [http://www.fileformat.info/info/unicode/char/3000/index.htm IDEOGRAPHIC SPACE] (全形空白、全型空白, U+3000)<ref>[https://www.ptt.cc/bbs/PHP/M.1473522044.A.4FC.html Re: 請益 mysql空百=? - 看板 PHP - 批踢踢實業坊]</ref>:
#* diaplay: <pre><nowiki><?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?></nowiki></pre>
#* diaplay: <pre><nowiki><?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?></nowiki></pre>
#* replace with space: <pre><nowiki><?php echo str_replace(json_decode('"\u3000"'), " ", $string);?></nowiki></pre>
#* replace with space: <pre><nowiki><?php echo str_replace(json_decode('"\u3000"'), " ", $string);?></nowiki></pre>
# ASCII Vertical Tab \v
# ASCII Vertical Tab {{kbd | key=<nowiki>\v</nowiki>}}
# ASCII Horizontal Tab (TAB) \t
# ASCII Horizontal Tab (TAB) {{kbd | key=<nowiki>\t</nowiki>}}
# ASCII Backspace \b
# ASCII Backspace {{kbd | key=<nowiki>\b</nowiki>}}
# [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] ({{kbd | key=<nowiki>nbsp;</nowiki>}}) Replace Non-breaking space with one whitespace using PHP: {{kbd | key=<nowiki>$result = str_replace("\xc2\xa0", ' ', $original_string);</nowiki>}}<ref>[https://stackoverflow.com/questions/40724543/how-to-replace-decoded-non-breakable-space-nbsp php - How to replace decoded Non-breakable space (nbsp) - Stack Overflow]</ref>
 
How to display the Non-breaking space In PHP?
<pre>
$input = '12345678' . hex2bin('c2a0');
echo $input . PHP_EOL;
## Result of above script: '12345678 ' (one whitespace at the end)
 
echo bin2hex($input) . PHP_EOL;
## Result of above script: 3132333435363738c2a0
 
echo bin2hex('12345678') . PHP_EOL;
## Result of above script: 3132333435363738 (You mat notice the difference of script result is C2A0)
 
</pre>
 
How to display the Non-breaking space In MySQL?
<pre>
SELECT CONCAT('12345678', UNHEX('C2A0'))
-- Result of above query: '12345678 ' (one whitespace at the end)
 
SELECT HEX(CONCAT('12345678', UNHEX('C2A0')))
-- Result of above query: 3132333435363738C2A0
 
SELECT HEX('12345678')
-- Result of above query: 3132333435363738 (You mat notice the difference of query result is C2A0)
 
SELECT LENGTH('12345678')
-- Result of above query: 8
 
SELECT LENGTH(CONCAT('12345678', UNHEX('C2A0')))
-- Result of above query: 10
</pre>
 
== Remove control character ==
[https://en.wikipedia.org/wiki/Control_character Control character - Wikipedia] Using PHP to clean control character:
<pre>
$input = 'some string may contains control characters';
$replacement = '';
$result = preg_replace('/[\x00-\x1F]/', $replacement, $input);
</pre>
 
=== Fix garbled message text ===
[[Fix garbled message text]]


== Further reading ==
== Further reading ==
Line 463: Line 541:
* [https://docs.google.com/spreadsheets/d/1UyumfruGjps7xqd4aSB-NKfmu54NIscI2OV9Uht8F7M/edit?usp=sharing  判斷 Excel 欄位值是否錯誤] 除了常見的 Excel 錯誤,如何判斷來自其他資料來源的可能錯誤欄位值
* [https://docs.google.com/spreadsheets/d/1UyumfruGjps7xqd4aSB-NKfmu54NIscI2OV9Uht8F7M/edit?usp=sharing  判斷 Excel 欄位值是否錯誤] 除了常見的 Excel 錯誤,如何判斷來自其他資料來源的可能錯誤欄位值
* [[Web user behavior]]
* [[Web user behavior]]
* [https://medium.com/bryanyang0528/%E8%B3%87%E6%96%99%E5%93%81%E8%B3%AA%E5%88%9D%E6%8E%A2-data-quality-b765eb56a7c2?fbclid=IwAR3NBb2BtFm9O3FeY7JgQ5HLE5VG5nFe3m5Zx8zNW9XkvOUPlqV9hXmaoXI 資料品質初探(Data Quality) – 亂點技能的跨界人生 – Medium] {{access | date = 2018-01-13}}


== references ==
== references ==
Line 471: Line 550:
[[Category:Data Science]]
[[Category:Data Science]]
[[Category:MySQL]]
[[Category:MySQL]]
[[Category:Text file processing]]

Navigation menu