Data cleaning: Difference between revisions
m (→duplicate data) |
Tags: Mobile edit Mobile web edit |
||
(139 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== | == Check list == | ||
=== | |||
* row count | |||
* duplicate data | |||
== Check if field value was not fulfilled == | |||
=== By purpose === | |||
<table border="1" style="width: 100%"> | |||
<tr style="background-color: #555555; color: #ffffff;"> | |||
<td style="background-color: #777; color: #ffffff;">Purpose</td> | |||
<td style="background-color: #777; color: #ffffff; width: 350px;">Method (MySQL query syntax)</td> | |||
<td style="vertical-align: top;">Value1: <br />Fulfilled value what I want</td> | |||
<td style="vertical-align: top;">Value2: <br />Fulfilled value NOT I want</td> | |||
<td style="vertical-align: top;">Value3: <br />0</td> | |||
<td style="vertical-align: top;">Value4: <br />NULL value</td> | |||
<td style="vertical-align: top;">Value5: <br />Empty or white-spaces characters</td> | |||
</tr> | |||
<tr> | |||
<td>values were not fulfilled or empty <br />(not contains 0)</td> | |||
<td>{{kbd | key=WHERE column_name IS NULL <br /> OR LENGTH(TRIM( column_name )) = 0}}</td> | |||
<td></td> | |||
<td></td> | |||
<td></td> | |||
<td>V</td> | |||
<td>V</td> | |||
</tr> | |||
<tr> | |||
<td>values were not fulfilled or empty<br />(contains 0)</td> | |||
<td></td> | |||
<td></td> | |||
<td></td> | |||
<td>V</td> | |||
<td>V</td> | |||
<td>V</td> | |||
</tr> | |||
<tr> | |||
<td>values were fulfilled and non-empty<br />(not contains 0)</td> | |||
<td></td> | |||
<td>V</td> | |||
<td>V</td> | |||
<td></td> | |||
<td></td> | |||
<td></td> | |||
</tr> | |||
<tr> | |||
<td>values were fulfilled and non-empty<br />(contains 0)</td> | |||
<td>{{kbd | key=<nowiki>WHERE LENGTH(TRIM( column_name )) > 0</nowiki>}}</td> | |||
<td>V</td> | |||
<td>V</td> | |||
<td>V</td> | |||
<td></td> | |||
<td></td> | |||
</tr> | |||
<tr> | |||
<td>values (1) were not fulfilled or empty values (2) NOT I want <br />(not contains 0)</td> | |||
<td>{{kbd | key=WHERE column_name IS NULL <br /> OR LENGTH(TRIM( column_name )) = 0 <br /> OR column_name LIKE 'values NOT I want'}}</td> | |||
<td></td> | |||
<td>V</td> | |||
<td></td> | |||
<td>V</td> | |||
<td>V</td> | |||
</tr> | |||
<tr> | |||
<td>values (1) were not fulfilled or empty values (2) NOT I want <br />(contains 0)</td> | |||
<td></td> | |||
<td></td> | |||
<td>V</td> | |||
<td>V</td> | |||
<td>V</td> | |||
<td>V</td> | |||
</tr> | |||
</table> | |||
=== by datatype === | |||
==== VARCHAR and allows NULL value ==== | |||
<table border="1" style="width: 100%"> | <table border="1" style="width: 100%"> | ||
<tr style="background-color: #555555; color: #ffffff;"> | <tr style="background-color: #555555; color: #ffffff;"> | ||
Line 12: | Line 85: | ||
</tr> | </tr> | ||
<tr> | <tr> | ||
<td rowspan=" | <td rowspan="5">{{kbd | key = VARCHAR}} and allows {{kbd | key = NULL}}</td> | ||
<td>fulfilled value ex:123</td> | <td>fulfilled value ex:123</td> | ||
<td></td> | <td></td> | ||
Line 20: | Line 93: | ||
</tr> | </tr> | ||
<tr> | <tr> | ||
<td>{{kbd | key = NULL}}</td> | <td>{{kbd | key = NULL}} type:null</td> | ||
<td>V</td> | <td>V</td> | ||
<td></td> | <td></td> | ||
<td>V</td> | <td>V</td> | ||
<td></td> | |||
</tr> | |||
<tr> | |||
<td>{{kbd | key = 'NULL'}} type:string</td> | |||
<td></td> | |||
<td></td> | |||
<td></td> | |||
<td></td> | <td></td> | ||
</tr> | </tr> | ||
Line 56: | Line 136: | ||
* method4: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name IS NOT NULL</nowiki>}} | * method4: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name IS NOT NULL</nowiki>}} | ||
==== VARCHAR or numeric ==== | |||
<table border="1" style="width: 100%"> | <table border="1" style="width: 100%"> | ||
<tr style="background-color: #555555; color: #ffffff;"> | <tr style="background-color: #555555; color: #ffffff;"> | ||
Line 89: | Line 169: | ||
* method5: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max</nowiki>}} | * method5: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max</nowiki>}} | ||
* method6: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )</nowiki>}} | * method6: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )</nowiki>}} | ||
==== datetime and allows NULL value ==== | |||
possible column values | |||
# {{kbd | key = {{Template:Today}}}} | |||
# {{kbd | key = 00-00-00 00:00:00}} | |||
# {{kbd | key = NULL}} | |||
=== is null === | === is null === | ||
Fill 0 if the value is NA or NULL | |||
* MySQL SQL syntax: {{kbd | key = SELECT COALESCE(column_name, 0) }} | * MySQL SQL syntax: COALESCE(): {{kbd | key = SELECT COALESCE(column_name, 0) }} or {{kbd | key = SELECT COALESCE(column_name, 'other_filled_value') }} | ||
** (1) Using [http://www.w3resource.com/mysql/comparision-functions-and-operators/coalesce-function.php COALESCE() function] to replace the NULL value with 0. | ** (1) Using [http://www.w3resource.com/mysql/comparision-functions-and-operators/coalesce-function.php COALESCE() function] to replace the NULL value with 0. | ||
** (2) The case: 0/0 = null should be handled. | ** (2) The case: 0/0 = null should be handled. | ||
* MySQL SQL syntax: combined IF() & ISNULL():{{kbd | key = SELECT IF(ISNULL((column_name), 0, column_name) }} or {{kbd | key = SELECT IF(ISNULL((column_name), 'other_filled_value', column_name) }} | |||
* python: [http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.fillna.html pandas.DataFrame.fillna — pandas 0.16.0 documentation] "Fill NA/NaN values using the specified method" | * python: [http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.fillna.html pandas.DataFrame.fillna — pandas 0.16.0 documentation] "Fill NA/NaN values using the specified method" | ||
Find whether a variable is NULL. [http://ideone.com/tKv2gQ online demo] | Find whether a variable is NULL. [http://ideone.com/tKv2gQ online demo] | ||
* PHP [http://tw2.php.net/is_null is_null] | * PHP [http://tw2.php.net/is_null is_null] to find type:null null NOT type:string <nowiki>'null'</nowiki> {{exclaim}} | ||
* Google spreadsheet / Excel: | * Google spreadsheet / Excel: | ||
** [https://support.google.com/drive/answer/3093348 ISERR(value)] " value - The value to be verified as an error type other than #N/A." ex: {{kbd | key = #NULL!}} | ** [https://support.google.com/drive/answer/3093348 ISERR(value)] " value - The value to be verified as an error type other than #N/A." ex: {{kbd | key = #NULL!}} | ||
** If the cell value is exactly {{kbd | key = NULL}} not {{kbd | key = #NULL!}}, You may use {{kbd | key = <nowiki>COUNTIF(value, "NULL")</nowiki>}} or {{kbd | key = <nowiki>EXACT(value, "NULL")</nowiki>}} | ** If the cell value is exactly {{kbd | key = NULL}} not {{kbd | key = #NULL!}}, You may use {{kbd | key = <nowiki>COUNTIF(value, "NULL")</nowiki>}} or {{kbd | key = <nowiki>EXACT(value, "NULL")</nowiki>}} | ||
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NULL;}}<ref>[http://dev.mysql.com/doc/refman/5.0/en/working-with-null.html MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values]</ref> | * MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NULL;}}<ref>[http://dev.mysql.com/doc/refman/5.0/en/working-with-null.html MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values]</ref> | ||
* R: is.null(): [https://stat.ethz.ch/R-manual/R-devel/library/base/html/NULL.html R: The Null Object] | |||
* Excel<ref>[https://errerrors.blogspot.com/2018/09/check-if-cell-value-is-null-in-excel.html 如何判斷 Excel 儲存格的欄位值是 NULL]</ref> | |||
Find whether a variable is NOT NULL | Find whether a variable is NOT NULL | ||
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NOT NULL;}} | * MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NOT NULL;}} | ||
Find whether a variable is NOT {{kbd | key = #N/A}} | |||
* Excel: {{kbd | key = =NOT(ISERROR(cell_value))}} | |||
javascript | |||
* [http://www.w3schools.com/jsref/jsref_isnan.asp JavaScript isNaN() Function] | |||
=== check if field value was not fulfilled: NULL, empty value === | === check if field value was not fulfilled: NULL, empty value === | ||
{{exclaim}} NOT include those data which its field value fulfilled with default value automatically | {{exclaim}} NOT include those data which its field value fulfilled with default value automatically ([http://sqlfiddle.com/#!2/9b01e/3/0 demo on sqlfiddle]) | ||
# {{Gd}} quick solution: find records with {{kbd | key = NULL}} value OR empty, space value | |||
#* MySQL solution: {{kbd | key = SELECT * FROM table_name WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0;}} | |||
# find records with {{kbd | key = NULL}} value: (note: not {{kbd | key = #NULL!}}) | # find records with {{kbd | key = NULL}} value: (note: not {{kbd | key = #NULL!}}) | ||
#* MySQL solution: {{kbd | key = SELECT * FROM table_name WHERE column_name IS NULL;}} | #* MySQL solution: {{kbd | key = SELECT * FROM table_name WHERE column_name IS NULL;}} | ||
Line 120: | Line 216: | ||
#* solution: step1: Replace the year > 100 from this year with empty value at EXCEL: {{kbd | key =<nowiki>=IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2))</nowiki>}} (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format | #* solution: step1: Replace the year > 100 from this year with empty value at EXCEL: {{kbd | key =<nowiki>=IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2))</nowiki>}} (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format | ||
#* trivial approach : EXCEL: {{kbd | key =<nowiki>=IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2))</nowiki>}} {{exclaim}} this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0. | #* trivial approach : EXCEL: {{kbd | key =<nowiki>=IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2))</nowiki>}} {{exclaim}} this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0. | ||
# Using PHP [http://php.net/manual/en/function.empty.php empty()] function to find 0, null, false, empty string, empty array values. | # Using PHP | ||
#* [http://php.net/manual/en/function.empty.php empty()] function to find 0, null, false, empty string, empty array values. | |||
#* <code>if(empty($var) && $var !== 0 && $var !== "0"){ .. } </code> to find null, false, empty string, empty array values BUT not 0. | |||
# check if field value was NULL & not equal to some value | |||
#* MySQL: [http://stackoverflow.com/questions/17097890/mysql-select-on-fields-containing-null-values MySql select on fields containing null values - Stack Overflow] | |||
=== check if field value was fulfilled === | === check if field value was fulfilled === | ||
Line 139: | Line 239: | ||
* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}} | * MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}} | ||
== | == Data Validation == | ||
Validate the format of field value. Related page: [[Regular expression]] | |||
=== | === Email contains @ symbol === | ||
* EXCEL: {{kbd | key =<nowiki>=IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE)</nowiki>}} only check the field if contains @ symbol or not | * EXCEL: {{kbd | key =<nowiki>=IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE)</nowiki>}} only check the field if contains @ symbol or not | ||
** result: (1) normal condition: return TRUE; (2) exceptional condition: return '''FALSE''' if @ symbol was not found | ** result: (1) normal condition: return TRUE; (2) exceptional condition: return '''FALSE''' if @ symbol was not found | ||
Line 151: | Line 251: | ||
** "Returns the filtered data, or '''FALSE''' if the filter fails." quoted from [http://php.net/manual/en/function.filter-var.php PHP.net] | ** "Returns the filtered data, or '''FALSE''' if the filter fails." quoted from [http://php.net/manual/en/function.filter-var.php PHP.net] | ||
=== | === Number precision in Excel === | ||
Number precision: 15 digits (Excel中最多的有效位數為15位)<ref>[https://support.office.com/en-nz/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3 Excel specifications and limits]</ref><ref>[http://isvincent.pixnet.net/blog/post/30267208-excel-%E9%A1%AF%E7%A4%BApi%E8%BC%83%E5%A4%9A%E7%9A%84%E6%9C%89%E6%95%88%E4%BD%8D%E6%95%B8 A2]</ref> | Number precision: 15 digits (Excel中最多的有效位數為15位)<ref>[https://support.office.com/en-nz/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3 Excel specifications and limits]</ref><ref>[http://isvincent.pixnet.net/blog/post/30267208-excel-%E9%A1%AF%E7%A4%BApi%E8%BC%83%E5%A4%9A%E7%9A%84%E6%9C%89%E6%95%88%E4%BD%8D%E6%95%B8 A2]</ref> | ||
Line 163: | Line 263: | ||
* If the data was imported from Excel, you should notice the 15 digit precision issue. | * If the data was imported from Excel, you should notice the 15 digit precision issue. | ||
=== numeric | === Check if the column value is numeric === | ||
Possibile values | |||
* | |||
* Excel: [http://www.techonthenet.com/excel/formulas/isnumber.php ISNUMBER Function] | <pre> | ||
test data: | |||
3.141592654 | |||
1.36184E+14 | |||
123,456.789 | |||
20740199601 | |||
346183773390240 | |||
="5" | |||
</pre> | |||
MySQL: | |||
* Check if a value is integer e.g. 1234567 | |||
** Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref><ref>[https://stackoverflow.com/questions/75704/how-do-i-check-to-see-if-a-value-is-an-integer-in-mysql How do I check to see if a value is an integer in MySQL? - Stack Overflow]</ref> | |||
* Check if a value is integer which may contains comma and dot symbols e.g. 1,234.567 or 3.414 | |||
** {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9,\.]+$'}}<ref>[https://community.denodo.com/answers/question/details?questionId=9060g000000XelhAAC&title=How+to+identify+if+values+in+a+column+is+numeric+%28+Function+similar+to+Isnumeric+is+SQL%29 How to identify if values in a column is numeric ( Function similar to Isnumeric is SQL)]</ref> | |||
* Check if a value is NOT integer | |||
** Find the records which the value of `my_column` is '''NOT''' numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'}} | |||
If the digit of number is known, the SQL syntax could be more specific | |||
* The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}} | |||
PHP: | |||
* [http://php.net/manual/en/function.is-numeric.php is_numeric] function | |||
* [https://www.php.net/manual/en/function.is-int.php is_int] function | |||
Excel & [https://www.google.com/sheets/about/ Google Sheets]: | |||
* Using [http://www.techonthenet.com/excel/formulas/isnumber.php ISNUMBER Function]: {{code | code = <nowiki>=INT(ISNUMBER(A1))</nowiki>}} | |||
** Return 1 if the cell value is (1) Numbers (2) Numbers in [https://en.wikipedia.org/wiki/Scientific_notation scientific (exponential) notation] e.g. {{code | code = <nowiki>1.36184E+14</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers | |||
** Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}} | |||
* Google Sheets only: Using [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH], [https://support.google.com/docs/answer/3094140?hl=zh-Hant TRIM] & [https://support.google.com/docs/answer/3093592?hl=zh-Hant CONCAT]<ref>[https://errerrors.blogspot.com/2015/08/google.html GOOGLE 試算表: 數字轉成文字]</ref> functions: {{code | code = <nowiki>=IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)</nowiki>}} | |||
** Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}} | |||
** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers | |||
=== | === Time data: Validate the data format === | ||
[[Validate the datetime value]] | |||
=== Time data: Data was generated in N years === | |||
* {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type | Define the abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series]) | ||
* {{code | code = 1900/1/0}} (converted time formatted value from 0), {{code | code = 1900/1/1}} (converted time formatted value from 1) | * Verfiy the data were generated in N years. Possible abnormal values: {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type. | ||
* | * Verfiy the data were not newer than today | ||
* Verfiy the year of data were not {{kbd | key=1900}} if the data were imported from Microsoft Excel file. Datevalue<ref>[https://support.microsoft.com/zh-tw/office/datevalue-%E5%87%BD%E6%95%B8-df8b07d4-7761-4a93-bc33-b7471bbff252 DATEVALUE 函數 - Office 支援]</ref> was started from the year {{kbd | key=1900}} e.g. | |||
** {{code | code = 1900/1/0}} (converted time formatted value from 0), | |||
** {{code | code = 1900/1/1}} (converted time formatted value from 1) | |||
* Verfiy the value of data were not {{kbd | key=0000-00-00 00:00:00}} | |||
* Verfiy the diversity of data values e.g. [https://en.wikipedia.org/wiki/Variance Variance] | |||
Find the normal values: | Find the normal values: | ||
Line 184: | Line 325: | ||
** {{code | code = SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 ) AND ( `my_time_column` < CURDATE() + 1); }} | ** {{code | code = SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 ) AND ( `my_time_column` < CURDATE() + 1); }} | ||
* MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach. | * MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach. | ||
** | ** {{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` <= CURRENT_TIMESTAMP);}} | ||
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue) | *** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue) | ||
== duplicate data == | Check if the date valid | ||
* Excel: [https://support.office.com/en-us/article/WEEKDAY-function-60e44483-2ed1-439f-8bd0-e404c190949a?ui=en-US&rs=en-US&ad=US WEEKDAY function] <ref>[http://www.mrexcel.com/forum/excel-questions/322203-check-if-valid-date.html Check if a valid date?]</ref> | |||
** legal results: {{kbd | key=1}} (Sunday) ~ {{kbd | key=7}} (Saturday) | |||
** illegal result: {{kbd | key= <nowiki>#VALUE!</nowiki>}} | |||
* PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow] | |||
=== Time data: Human birth year (age) data === | |||
Based on the existing record, the longest-living person who lived to 122<ref>[https://en.wikipedia.org/wiki/Maximum_life_span Maximum life span - Wikipedia]</ref>. | |||
MySQL query is as follows<ref>[https://stackoverflow.com/questions/5773405/calculate-age-in-mysql-innodb sql - Calculate Age in MySQL (InnoDb) - Stack Overflow]</ref> where the column {{kbd | key=<nowiki>`birthday`</nowiki>}} is {{kbd | key=<nowiki>date</nowiki>}} type. | |||
<pre> | |||
WHERE TIMESTAMPDIFF(YEAR, `birthday`, CURDATE()) <= 122 | |||
</pre> | |||
Using [https://www.w3resource.com/mysql/date-and-time-functions/mysql-unix_timestamp-function.php UNIX_TIMESTAMP() function] to check the abnormality of birthday data is not appropriate. Because the birthdays which are earlier {{kbd | key=<nowiki>1970-01-01 00:00:00 UTC</nowiki>}} will all become zero. | |||
=== String contains special characters === | |||
* [[Byte order mark]] (BOM) | |||
* [[Return symbol]] | |||
* [http://www.fileformat.info/info/unicode/char/a0/index.htm Unicode Character 'NO-BREAK SPACE' (U+00A0)] | |||
== Duplicate data == | |||
=== Find duplicate data === | |||
==== EXCEL ==== | |||
===== Finding duplicate rows that differ in one column ===== | |||
* one column data: [http://www.extendoffice.com/documents/excel/1499-count-duplicate-values-in-column.html How to count duplicate values in a column in Excel?] Using {{kbd | key = COUNTIF(range, criteria)}} {{access | date = 2015-08-25}} or using '''Pivot Tables'''(樞紐分析表) to find the occurrence of value >= 2 | |||
===== Finding duplicate rows that differ in multiple columns ===== | |||
* two or multiple columns data: (approach 1) [https://support.microsoft.com/en-us/kb/213367 How to compare data in two columns to find duplicates in Excel] {{access | date = 2015-06-16}} {{exclaim}} It may costs too much time (larger than one hour) if the number of records exceeds 1,000,000 (approach 2) Using [https://support.office.com/en-us/article/concat-function-9b1a9a3f-94ff-41af-9736-694cbd6b4ca2 CONCAT function] to concatenate two or multiple columns data. And then use {{kbd | key = COUNTIF(range, criteria)}}. | |||
==== Cygwin ==== | |||
* [https://www.computerhope.com/unix/uuniq.htm uniq command] on Cygwin of {{Win}} or {{Linux}}: {{kbd | key=<nowiki>uniq -d <file.txt> > <duplicated_items.txt></nowiki>}}<ref>[https://unix.stackexchange.com/questions/52534/how-to-print-only-the-duplicate-values-from-a-text-file shell - How to print only the duplicate values from a text file? - Unix & Linux Stack Exchange]</ref> | |||
==== MySQL ==== | |||
===== Finding duplicate rows that differ in one column ===== | |||
Find the duplicated data for one column<ref>[http://stackoverflow.com/questions/688549/finding-duplicate-values-in-mysql?rq=1 Finding duplicate values in MySQL - Stack Overflow]</ref> | |||
<pre> | |||
-- Generate test data. | |||
CREATE TABLE `table_name` ( | |||
`id` int(11) NOT NULL, | |||
`content` varchar(5) NOT NULL | |||
) ENGINE=InnoDB DEFAULT CHARSET=utf8; | |||
INSERT INTO `table_name` (`id`, `content`) VALUES | |||
(1, 'apple'), | |||
(2, 'lemon'), | |||
(3, 'apple'); | |||
ALTER TABLE `table_name` | |||
ADD PRIMARY KEY (`id`); | |||
-- Find duplicated data | |||
SELECT `content`, COUNT(*) count | |||
FROM `table_name` | |||
GROUP BY `content` | |||
HAVING count > 1; | |||
SELECT tmp.* FROM | |||
( | |||
SELECT `content`, count(*) count FROM `table_name` GROUP BY `content` | |||
) tmp | |||
WHERE tmp.count >1; | |||
</pre> | |||
===== Finding duplicate rows that differ in multiple columns ===== | |||
Using {{kbd | key =CONCAT}} for multiple columns ex: column_1, column_2 | |||
<pre> | |||
SELECT count(*) count, CONCAT( `column_1`, `column_2` ) 'key' | |||
FROM `table_name` | |||
GROUP BY CONCAT( `column_1`, `column_2` ) | |||
HAVING count > 1; | |||
</pre> | |||
or | |||
<pre> | |||
SELECT tmp.key FROM | |||
( | |||
SELECT count(*) count, CONCAT( `column_1`, `column_2` ) 'key' | |||
FROM `table_name` | |||
GROUP BY CONCAT( `column_1`, `column_2` ) | |||
) tmp | |||
WHERE tmp.count >=2 | |||
</pre> | |||
===== other cases ===== | |||
For counting purpose: find the count of repeated id (type: int) between table_a and table_b | |||
<pre> | |||
SELECT count(DISTINCT(id)) FROM table_a WHERE id IN | |||
( | |||
SELECT DISTINCT(id) FROM table_b | |||
) | |||
</pre> | |||
==== Google Spreadsheet ==== | |||
* [https://www.ablebits.com/google-sheets-add-ons/remove-duplicates/index.php Remove duplicates in Google Sheets] 30 days free {{access | date = 2019-02-26}} | |||
* [https://chrome.google.com/webstore/detail/power-tools/dofhceeoedodcaheeoacmadcpegkjobi Power Tools] for Google Spreadsheet {{access | date = 2019-02-26}} | |||
** Menu: Data -> Remove duplicates | |||
=== Deduplicate === | |||
* EXCEL: Data Tools -> Remove Duplicates: [https://support.office.com/en-us/article/Filter-for-unique-values-or-remove-duplicate-values-d6549cf0-357a-4acf-9df5-ca507915b704 Filter for unique values or remove duplicate values] {{access | date = 2015-10-20}} | |||
* PHP: [http://php.net/manual/en/function.array-unique.php PHP: array_unique], [http://php.net/manual/en/function.array-intersect.php PHP: array_intersect] | * PHP: [http://php.net/manual/en/function.array-unique.php PHP: array_unique], [http://php.net/manual/en/function.array-intersect.php PHP: array_intersect] | ||
* MySQL: | |||
** [http://www.mysqltutorial.org/mysql-distinct.aspx MySQL DISTINCT - Eliminate Duplicate Rows in a Result Set]. Using | * MySQL: select deduplicated records | ||
** [http://www.mysqltutorial.org/mysql-distinct.aspx MySQL DISTINCT - Eliminate Duplicate Rows in a Result Set]. Using {{kbd | key =GROUP_CONCAT}} to handle the multiple columns<ref>[http://stackoverflow.com/questions/12188027/mysql-select-distinct-multiple-columns sql - MySQL SELECT DISTINCT multiple columns - Stack Overflow]</ref> | |||
** [http://www.w3schools.com/sql/sql_unique.asp SQL UNIQUE Constraint] "Note that you can have many UNIQUE constraints per table, but only one PRIMARY KEY constraint per table." Quoted from w3schools webpage. | ** [http://www.w3schools.com/sql/sql_unique.asp SQL UNIQUE Constraint] "Note that you can have many UNIQUE constraints per table, but only one PRIMARY KEY constraint per table." Quoted from w3schools webpage. | ||
** "{{kbd | key = UNION}} removes duplicates, whereas {{kbd | key = UNION ALL}} does not." source: [http://stackoverflow.com/questions/49925/what-is-the-difference-between-union-and-union-all sql - What is the difference between UNION and UNION ALL? - Stack Overflow] | |||
* MySQL: delete duplicated records | |||
** [http://stackoverflow.com/questions/4685173/delete-all-duplicate-rows-except-for-one-in-mysql sql - Delete all Duplicate Rows except for One in MySQL? - Stack Overflow] | |||
* [http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html GNU Coreutils: sort invocation] OS: {{Linux}}, cygwin of {{Win}}. More details on [[Alternative_Linux_commands#Merge_multiple_plain_text_files | Merge multiple plain text files]]. | |||
* | ** To remove duplicate lines: | ||
** {{kbd | key = | *** {{kbd | key=<nowiki>sort -us -o <output_unique.file> <input.file></nowiki>}} in a large text file (GB)<ref>[http://unix.stackexchange.com/questions/19641/how-to-remove-duplicate-lines-in-a-large-multi-gb-textfile linux - How to remove duplicate lines in a large multi-GB textfile? - Unix & Linux Stack Exchange]</ref> | ||
* | *** {{kbd | key=<nowiki>cat <input.file> | grep <pattern> | sort | uniq</nowiki>}} Processes text line by line and prints the '''unique''' lines which match a specified pattern. Equal to these steps: (1) {{kbd | key=<nowiki>cat <input.file> | grep <pattern> > <tmp.file></nowiki>}} (2) {{kbd | key=<nowiki>sort <tmp.file> | uniq</nowiki>}} | ||
** Ignore first n line(s) & remove duplicate lines<ref>[https://stackoverflow.com/questions/14562423/is-there-a-way-to-ignore-header-lines-in-a-unix-sort sorting - Is there a way to ignore header lines in a UNIX sort? - Stack Overflow]</ref><ref>[http://linux.vbird.org/linux_basic/0320bash.php#redirect_com 命令執行的判斷依據: ; , &&, ||]</ref><ref>[https://www.computerhope.com/unix/utail.htm Linux tail command help and examples]</ref> | |||
*** (1) ignore first one line: {{kbd | key=<nowiki>(head -n 1 <file> && tail -n +2 <file> | sort -us) > newfile</nowiki>}} | |||
*** (2) ignore first two lines: {{kbd | key=<nowiki>(head -n 2 <file> && tail -n +3 <file> | sort -us) > newfile</nowiki>}} | |||
* Google spreadsheet add-on: [https://www.ablebits.com/google-sheets-add-ons/remove-duplicates/howto.php Remove Duplicates for Google Sheets help] | |||
== data handling == | === Counting number of duplicate occurrence === | ||
=== | MySQL: find the number of duplicate occurrence between list_a & list_b which using the same primary key: column name {{kbd | key = id}} | ||
* {{kbd | key = SELECT count(DISTINCT(`id`)) FROM `list_a` WHERE `id` IN (SELECT DISTINCT(`id`) FROM `list_b`) ; }} | |||
Excel: | |||
* [http://superuser.com/questions/307837/how-to-count-number-of-repeat-occurrences microsoft excel - How to count number of repeat occurrences - Super User] {{exclaim}} long number issue: [https://superuser.com/questions/783840/countif-incorrectly-matches-long-number microsoft excel - Countif incorrectly matches long number - Super User] | |||
=== Other === | |||
* symbol e.g. data-mining or data_mining | |||
== Counting == | |||
* [[Count occurrences of a word in string]] | |||
* Count number of unique values | |||
** Excel: [https://www.excel-easy.com/examples/count-unique-values.html Count Unique Values in Excel - Easy Excel Tutorial] | |||
** Google sheet: [https://support.google.com/docs/answer/3093405?hl=zh-Hant COUNTUNIQUE - 文件編輯器說明] & [https://infoinspired.com/google-docs/spreadsheet/unique-function-in-horizontal-data-range-in-google-sheets/ How to Use UNIQUE Function in Horizontal Data Range in Google Sheets] | |||
== Outlier / Anomaly detection == | |||
[[Anomaly detection]] | |||
== unique number of data values == | |||
If the data values were generated by different users, the unique number of data values should be larger than ____ | |||
== Data handling == | |||
=== Remove first, last or certain characters from text === | |||
* Excel: using {{kbd | key=RIGHT}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342854.aspx RIGHT、RIGHTB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342650.aspx LEN、LENB 函數 - Excel - Office.com]</ref> functions <ref>[http://www.extendoffice.com/documents/excel/560-excel-remove-character-from-string.html How to remove first, last or certain characters from text in Excel?]</ref> | * Excel: using {{kbd | key=RIGHT}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342854.aspx RIGHT、RIGHTB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342650.aspx LEN、LENB 函數 - Excel - Office.com]</ref> functions <ref>[http://www.extendoffice.com/documents/excel/560-excel-remove-character-from-string.html How to remove first, last or certain characters from text in Excel?]</ref> | ||
* Excel: if the length of text was fixed after removed, you may try to use {{kbd | key=REPLACE}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342844.aspx REPLACE、REPLACEB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}} functions ([http://bit.ly/1i2tsgk demo]) | * Excel: if the length of text was fixed after removed, you may try to use {{kbd | key=REPLACE}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342844.aspx REPLACE、REPLACEB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}} functions ([http://bit.ly/1i2tsgk demo]) | ||
== | === Remove leading and trailing spaces from text === | ||
Data modeling: [[Data type]] | * PHP: [http://php.net/manual/en/function.trim.php trim function] | ||
* Excel: [https://exceljet.net/formula/remove-leading-and-trailing-spaces-from-text Excel formula: Remove leading and trailing spaces from text | Exceljet] | |||
* MySQL: Using [http://www.w3resource.com/mysql/string-functions/mysql-trim-function.php TRIM() function] & [http://www.w3resource.com/mysql/string-functions/mysql-length-function.php LENGTH() function ] | |||
<pre> | |||
UPDATE `table` | |||
SET `column` = TRIM( `column` ) | |||
WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` ); | |||
</pre> | |||
=== Remove other string look like whitespace === | |||
[https://en.wikipedia.org/wiki/Whitespace_character Whitespace character] | |||
# [http://www.fileformat.info/info/unicode/char/3000/index.htm IDEOGRAPHIC SPACE] (全形空白、全型空白, U+3000)<ref>[https://www.ptt.cc/bbs/PHP/M.1473522044.A.4FC.html Re: 請益 mysql空百=? - 看板 PHP - 批踢踢實業坊]</ref>: | |||
#* diaplay: <pre><nowiki><?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?></nowiki></pre> | |||
#* replace with space: <pre><nowiki><?php echo str_replace(json_decode('"\u3000"'), " ", $string);?></nowiki></pre> | |||
# ASCII Vertical Tab {{kbd | key=<nowiki>\v</nowiki>}} | |||
# ASCII Horizontal Tab (TAB) {{kbd | key=<nowiki>\t</nowiki>}} | |||
# ASCII Backspace {{kbd | key=<nowiki>\b</nowiki>}} | |||
# [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] ({{kbd | key=<nowiki>nbsp;</nowiki>}}) Replace Non-breaking space with one whitespace using PHP: {{kbd | key=<nowiki>$result = str_replace("\xc2\xa0", ' ', $original_string);</nowiki>}}<ref>[https://stackoverflow.com/questions/40724543/how-to-replace-decoded-non-breakable-space-nbsp php - How to replace decoded Non-breakable space (nbsp) - Stack Overflow]</ref> | |||
Sentence spacing | |||
# [https://en.wikipedia.org/wiki/Sentence_spacing_in_digital_media Sentence spacing in digital media - Wikipedia] e.g. {{kbd | key=<nowiki>&Nbsp; &Ensp; &Emsp;</nowiki>}} | |||
How to display the Non-breaking space In PHP? | |||
<pre> | |||
$input = '12345678' . hex2bin('c2a0'); | |||
echo $input . PHP_EOL; | |||
## Result of above script: '12345678 ' (one whitespace at the end) | |||
echo bin2hex($input) . PHP_EOL; | |||
## Result of above script: 3132333435363738c2a0 | |||
echo bin2hex('12345678') . PHP_EOL; | |||
## Result of above script: 3132333435363738 (You mat notice the difference of script result is C2A0) | |||
</pre> | |||
How to display the Non-breaking space In MySQL? | |||
<pre> | |||
SELECT CONCAT('12345678', UNHEX('C2A0')) | |||
-- Result of above query: '12345678 ' (one whitespace at the end) | |||
SELECT HEX(CONCAT('12345678', UNHEX('C2A0'))) | |||
-- Result of above query: 3132333435363738C2A0 | |||
SELECT HEX('12345678') | |||
-- Result of above query: 3132333435363738 (You mat notice the difference of query result is C2A0) | |||
SELECT LENGTH('12345678') | |||
-- Result of above query: 8 | |||
SELECT LENGTH(CONCAT('12345678', UNHEX('C2A0'))) | |||
-- Result of above query: 10 | |||
</pre> | |||
== Remove control character == | |||
[https://en.wikipedia.org/wiki/Control_character Control character - Wikipedia] Using PHP to clean control character: | |||
<pre> | |||
$input = 'some string may contains control characters'; | |||
$replacement = ''; | |||
$result = preg_replace('/[\x00-\x1F]/', $replacement, $input); | |||
</pre> | |||
=== Fix garbled message text === | |||
[[Fix garbled message text]] | |||
== Tools == | |||
* [https://github.com/IvanMathy/Boop IvanMathy/Boop: A scriptable scratchpad for developers. In slow yet steady progress.] ([https://apps.apple.com/us/app/boop/id1518425043?mt=12 Boop on the Mac App Store]) " ... to paste some plain text and run some basic text operations on it. " | |||
== Further reading == | |||
* [https://g0v.gitbooks.io/data-design/content/book/ch08-data-cleaning.html Data Cleaning | DATA + DESIGN 資料+設計] | |||
* Data modeling: [[Data type]] | |||
* [http://www.icpsr.umich.edu/icpsrweb/NAHDAP/support/faqs/2006/01/what-is-codebook What is a codebook?] | |||
* [https://www.facebook.com/groups/1210634969026548/permalink/1540468379376537/ 台灣R軟體Club] 「由於真實社會的資料千奇百怪,遺失值、空字串、空格、甚至是假的 "NA" (NA兩邊多了雙引號)、假的"NULL" (NULL兩邊多了雙引號) 。。。,其中一個顯著的差異是,社會實務應用的 R 程式中,通常會多出一堆檢查機制。 ... ...」{{access | date = 2017-12-18}} | |||
* [https://docs.google.com/spreadsheets/d/1UyumfruGjps7xqd4aSB-NKfmu54NIscI2OV9Uht8F7M/edit?usp=sharing 判斷 Excel 欄位值是否錯誤] 除了常見的 Excel 錯誤,如何判斷來自其他資料來源的可能錯誤欄位值 | |||
* [[Web user behavior]] | |||
* [https://medium.com/bryanyang0528/%E8%B3%87%E6%96%99%E5%93%81%E8%B3%AA%E5%88%9D%E6%8E%A2-data-quality-b765eb56a7c2?fbclid=IwAR3NBb2BtFm9O3FeY7JgQ5HLE5VG5nFe3m5Zx8zNW9XkvOUPlqV9hXmaoXI 資料品質初探(Data Quality) – 亂點技能的跨界人生 – Medium] {{access | date = 2018-01-13}} | |||
== references == | == references == | ||
Line 219: | Line 563: | ||
[[Category:Data Science]] | [[Category:Data Science]] | ||
[[Category:MySQL]] | [[Category:MySQL]] | ||
[[Category:String manipulation]] |
Revision as of 17:28, 8 December 2021
Check list
- row count
- duplicate data
Check if field value was not fulfilled
By purpose
Purpose | Method (MySQL query syntax) | Value1: Fulfilled value what I want |
Value2: Fulfilled value NOT I want |
Value3: 0 |
Value4: NULL value |
Value5: Empty or white-spaces characters |
values were not fulfilled or empty (not contains 0) |
WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0 |
V | V | |||
values were not fulfilled or empty (contains 0) |
V | V | V | |||
values were fulfilled and non-empty (not contains 0) |
V | V | ||||
values were fulfilled and non-empty (contains 0) |
WHERE LENGTH(TRIM( column_name )) > 0 | V | V | V | ||
values (1) were not fulfilled or empty values (2) NOT I want (not contains 0) |
WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0 OR column_name LIKE 'values NOT I want' |
V | V | V | ||
values (1) were not fulfilled or empty values (2) NOT I want (contains 0) |
V | V | V | V |
by datatype
VARCHAR and allows NULL value
data type of column | possible column values | method1: find not fulfilled or empty values |
method2: find fulfilled and non-empty values |
method3: find NULL values |
method4: find not NULL values |
VARCHAR and allows NULL | fulfilled value ex:123 | V | V | ||
NULL type:null | V | V | |||
'NULL' type:string | |||||
0 | V | V | |||
EMPTY ex: '' or space(s) ' ' | V | V |
symbol V: means the column value will be able to find by means of the method
- method1:
- method2:
- SELECT * FROM `my_table` WHERE column_name > ''
- SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0
- SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0
- method3: SELECT * FROM `my_table` WHERE column_name IS NULL
- method4: SELECT * FROM `my_table` WHERE column_name IS NOT NULL
VARCHAR or numeric
data type of column | possible column values | method5: find values within the range |
method6: find values out of the range, empty & NULL values |
VARCHAR or numeric | values within the range ex: min ≤ value ≤ max | V | |
values out of range | V | ||
NULL | V | ||
EMPTY ex: '' or space ' ' | V |
- method5: SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max
- method6: SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )
datetime and allows NULL value
possible column values
- 2024-03-28
- 00-00-00 00:00:00
- NULL
is null
Fill 0 if the value is NA or NULL
- MySQL SQL syntax: COALESCE(): SELECT COALESCE(column_name, 0) or SELECT COALESCE(column_name, 'other_filled_value')
- (1) Using COALESCE() function to replace the NULL value with 0.
- (2) The case: 0/0 = null should be handled.
- MySQL SQL syntax: combined IF() & ISNULL():SELECT IF(ISNULL((column_name), 0, column_name) or SELECT IF(ISNULL((column_name), 'other_filled_value', column_name)
- python: pandas.DataFrame.fillna — pandas 0.16.0 documentation "Fill NA/NaN values using the specified method"
Find whether a variable is NULL. online demo
- PHP is_null to find type:null null NOT type:string 'null'
- Google spreadsheet / Excel:
- ISERR(value) " value - The value to be verified as an error type other than #N/A." ex: #NULL!
- If the cell value is exactly NULL not #NULL!, You may use COUNTIF(value, "NULL") or EXACT(value, "NULL")
- MySQL SQL syntax: SELECT * FROM table WHERE column IS NULL;[3]
- R: is.null(): R: The Null Object
- Excel[4]
Find whether a variable is NOT NULL
- MySQL SQL syntax: SELECT * FROM table WHERE column IS NOT NULL;
Find whether a variable is NOT #N/A
- Excel: =NOT(ISERROR(cell_value))
javascript
check if field value was not fulfilled: NULL, empty value
NOT include those data which its field value fulfilled with default value automatically (demo on sqlfiddle)
- quick solution: find records with NULL value OR empty, space value
- MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0;
- find records with NULL value: (note: not #NULL!)
- MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL;
- EXCEL: =EXACT(A2, "NULL")
- find records with empty value: (not contains NULL value)
- MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0; SQL query SELECT * FROM table_name WHERE column_name IS NOT NULL includes empty value
- MS SQL Server: SELECT * FROM table_name WHERE LEN( LTRIM(RTRIM(column_name)) ) = 0; [5]
- Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...
- solution: step1: Replace the year > 100 from this year with empty value at EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2)) (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
- trivial approach : EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2)) this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
- Using PHP
- empty() function to find 0, null, false, empty string, empty array values.
if(empty($var) && $var !== 0 && $var !== "0"){ .. }
to find null, false, empty string, empty array values BUT not 0.
- check if field value was NULL & not equal to some value
check if field value was fulfilled
length of string > 0
column value is not null or 0
- Excel: COUNTIFS(criteria_range1, "<>NULL", criteria_range1, "<>0")[6]
find if number or cell value is positive integer
- EXCEL: =IFERROR(IF(AND(INT( value )= value, value>0), TRUE, FALSE), FALSE)[7] online demo
check numeric range
- MySQL: SELECT * FROM table_name WHERE column_name BETWEEN min_number AND max_number; the value >= min_number AND value <= max_number ( min_number ≤ value ≤ max_number )
find NOT empty records means records without NULL or empty value:
- MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;
- MySQL: SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;
Data Validation
Validate the format of field value. Related page: Regular expression
Email contains @ symbol
- EXCEL: =IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE) only check the field if contains @ symbol or not
- result: (1) normal condition: return TRUE; (2) exceptional condition: return FALSE if @ symbol was not found
- EXCEL: =FIND("@", A2, 2) only check the field if contains @ symbol or not
- syntax: FIND(find_text, with_text, [start_num]) the start_num is 2 because the position of @ symbol should be larger than 1 (position of first char is 1)
- result: (1) normal condition: return the number larger than 1; (2) exceptional condition: return #VALUE! if @ symbol was not found
- PHP: PHP FILTER_VALIDATE_EMAIL Filter
- "Returns the filtered data, or FALSE if the filter fails." quoted from PHP.net
Number precision in Excel
Number precision: 15 digits (Excel中最多的有效位數為15位)[8][9]
raw data: 1234567890123456 ->
- (numeric format 數值格式) 1234567890123450.00 losing precision
- (general format 通用格式) 1.23457E+15 losing precision
- (text format 文字格式) 1234567890123456
large numbers
- If the data was imported from Excel, you should notice the 15 digit precision issue.
Check if the column value is numeric
Possibile values
test data: 3.141592654 1.36184E+14 123,456.789 20740199601 346183773390240 ="5"
MySQL:
- Check if a value is integer e.g. 1234567
- Check if a value is integer which may contains comma and dot symbols e.g. 1,234.567 or 3.414
SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9,\.]+$'
[12]
- Check if a value is NOT integer
- Find the records which the value of `my_column` is NOT numeric values entirely
SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'
- Find the records which the value of `my_column` is NOT numeric values entirely
If the digit of number is known, the SQL syntax could be more specific
- The tax_id column is 8 digits only. Find the well-formatted tax_id records by using
SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'
PHP:
- is_numeric function
- is_int function
Excel & Google Sheets:
- Using ISNUMBER Function:
=INT(ISNUMBER(A1))
- Return 1 if the cell value is (1) Numbers (2) Numbers in scientific (exponential) notation e.g.
1.36184E+14
(3) Decimal numbers e.g.3.141592654
(4) Negative numbers - Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g.
="5"
- Return 1 if the cell value is (1) Numbers (2) Numbers in scientific (exponential) notation e.g.
- Google Sheets only: Using REGEXMATCH, TRIM & CONCAT[13] functions:
=IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)
- Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g.
="5"
- Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g.
1.23E+16
(3) Decimal numbers e.g.3.141592654
(4) Negative numbers
- Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g.
Time data: Validate the data format
Time data: Data was generated in N years
Define the abnormal values of the time data (time series)
- Verfiy the data were generated in N years. Possible abnormal values:
0001-01 00:00:00
occurred in MySQLdatetime
type. - Verfiy the data were not newer than today
- Verfiy the year of data were not 1900 if the data were imported from Microsoft Excel file. Datevalue[14] was started from the year 1900 e.g.
1900/1/0
(converted time formatted value from 0),1900/1/1
(converted time formatted value from 1)
- Verfiy the value of data were not 0000-00-00 00:00:00
- Verfiy the diversity of data values e.g. Variance
Find the normal values:
- MySQL: Assume the data was generated in recent 10 years & not newer than today
SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` < CURDATE() + 1);
- NOT
`my_time_column` < CURDATE()
。 ex:CURDATE()
is 2024-03-28. Which is the same with 2024-03-28 00:00:00
- NOT
SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 ) AND ( `my_time_column` < CURDATE() + 1);
- MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.
SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` <= CURRENT_TIMESTAMP);
- You need to check the
SELECT CURRENT_TIMESTAMP);
if correct or not before you delete the abnormal data (timezone issue)
- You need to check the
Check if the date valid
- Excel: WEEKDAY function [15]
- legal results: 1 (Sunday) ~ 7 (Saturday)
- illegal result: #VALUE!
- PHP: php - Correctly determine if date string is a valid date in that format - Stack Overflow
Time data: Human birth year (age) data
Based on the existing record, the longest-living person who lived to 122[16].
MySQL query is as follows[17] where the column `birthday` is date type.
WHERE TIMESTAMPDIFF(YEAR, `birthday`, CURDATE()) <= 122
Using UNIX_TIMESTAMP() function to check the abnormality of birthday data is not appropriate. Because the birthdays which are earlier 1970-01-01 00:00:00 UTC will all become zero.
String contains special characters
Duplicate data
Find duplicate data
EXCEL
Finding duplicate rows that differ in one column
- one column data: How to count duplicate values in a column in Excel? Using COUNTIF(range, criteria) [Last visited: 2015-08-25] or using Pivot Tables(樞紐分析表) to find the occurrence of value >= 2
Finding duplicate rows that differ in multiple columns
- two or multiple columns data: (approach 1) How to compare data in two columns to find duplicates in Excel [Last visited: 2015-06-16] It may costs too much time (larger than one hour) if the number of records exceeds 1,000,000 (approach 2) Using CONCAT function to concatenate two or multiple columns data. And then use COUNTIF(range, criteria).
Cygwin
- uniq command on Cygwin of Win or Linux : uniq -d <file.txt> > <duplicated_items.txt>[18]
MySQL
Finding duplicate rows that differ in one column
Find the duplicated data for one column[19]
-- Generate test data. CREATE TABLE `table_name` ( `id` int(11) NOT NULL, `content` varchar(5) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8; INSERT INTO `table_name` (`id`, `content`) VALUES (1, 'apple'), (2, 'lemon'), (3, 'apple'); ALTER TABLE `table_name` ADD PRIMARY KEY (`id`); -- Find duplicated data SELECT `content`, COUNT(*) count FROM `table_name` GROUP BY `content` HAVING count > 1; SELECT tmp.* FROM ( SELECT `content`, count(*) count FROM `table_name` GROUP BY `content` ) tmp WHERE tmp.count >1;
Finding duplicate rows that differ in multiple columns
Using CONCAT for multiple columns ex: column_1, column_2
SELECT count(*) count, CONCAT( `column_1`, `column_2` ) 'key' FROM `table_name` GROUP BY CONCAT( `column_1`, `column_2` ) HAVING count > 1;
or
SELECT tmp.key FROM ( SELECT count(*) count, CONCAT( `column_1`, `column_2` ) 'key' FROM `table_name` GROUP BY CONCAT( `column_1`, `column_2` ) ) tmp WHERE tmp.count >=2
other cases
For counting purpose: find the count of repeated id (type: int) between table_a and table_b
SELECT count(DISTINCT(id)) FROM table_a WHERE id IN ( SELECT DISTINCT(id) FROM table_b )
Google Spreadsheet
- Remove duplicates in Google Sheets 30 days free [Last visited: 2019-02-26]
- Power Tools for Google Spreadsheet [Last visited: 2019-02-26]
- Menu: Data -> Remove duplicates
Deduplicate
- EXCEL: Data Tools -> Remove Duplicates: Filter for unique values or remove duplicate values [Last visited: 2015-10-20]
- MySQL: select deduplicated records
- MySQL DISTINCT - Eliminate Duplicate Rows in a Result Set. Using GROUP_CONCAT to handle the multiple columns[20]
- SQL UNIQUE Constraint "Note that you can have many UNIQUE constraints per table, but only one PRIMARY KEY constraint per table." Quoted from w3schools webpage.
- "UNION removes duplicates, whereas UNION ALL does not." source: sql - What is the difference between UNION and UNION ALL? - Stack Overflow
- MySQL: delete duplicated records
- GNU Coreutils: sort invocation OS: Linux , cygwin of Win . More details on Merge multiple plain text files.
- To remove duplicate lines:
- sort -us -o <output_unique.file> <input.file> in a large text file (GB)[21]
- cat <input.file> | grep <pattern> | sort | uniq Processes text line by line and prints the unique lines which match a specified pattern. Equal to these steps: (1) cat <input.file> | grep <pattern> > <tmp.file> (2) sort <tmp.file> | uniq
- Ignore first n line(s) & remove duplicate lines[22][23][24]
- (1) ignore first one line: (head -n 1 <file> && tail -n +2 <file> | sort -us) > newfile
- (2) ignore first two lines: (head -n 2 <file> && tail -n +3 <file> | sort -us) > newfile
- To remove duplicate lines:
- Google spreadsheet add-on: Remove Duplicates for Google Sheets help
Counting number of duplicate occurrence
MySQL: find the number of duplicate occurrence between list_a & list_b which using the same primary key: column name id
- SELECT count(DISTINCT(`id`)) FROM `list_a` WHERE `id` IN (SELECT DISTINCT(`id`) FROM `list_b`) ;
Excel:
- microsoft excel - How to count number of repeat occurrences - Super User long number issue: microsoft excel - Countif incorrectly matches long number - Super User
Other
- symbol e.g. data-mining or data_mining
Counting
- Count occurrences of a word in string
- Count number of unique values
Outlier / Anomaly detection
unique number of data values
If the data values were generated by different users, the unique number of data values should be larger than ____
Data handling
Remove first, last or certain characters from text
- Excel: using RIGHT[25] + LEN[26] functions [27]
- Excel: if the length of text was fixed after removed, you may try to use REPLACE[28] + LEN functions (demo)
Remove leading and trailing spaces from text
- PHP: trim function
- Excel: Excel formula: Remove leading and trailing spaces from text | Exceljet
- MySQL: Using TRIM() function & LENGTH() function
UPDATE `table` SET `column` = TRIM( `column` ) WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );
Remove other string look like whitespace
- IDEOGRAPHIC SPACE (全形空白、全型空白, U+3000)[29]:
- diaplay:
<?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?>
- replace with space:
<?php echo str_replace(json_decode('"\u3000"'), " ", $string);?>
- diaplay:
- ASCII Vertical Tab \v
- ASCII Horizontal Tab (TAB) \t
- ASCII Backspace \b
- Non-breaking space (nbsp;) Replace Non-breaking space with one whitespace using PHP: $result = str_replace("\xc2\xa0", ' ', $original_string);[30]
Sentence spacing
- Sentence spacing in digital media - Wikipedia e.g. &Nbsp; &Ensp; &Emsp;
How to display the Non-breaking space In PHP?
$input = '12345678' . hex2bin('c2a0'); echo $input . PHP_EOL; ## Result of above script: '12345678 ' (one whitespace at the end) echo bin2hex($input) . PHP_EOL; ## Result of above script: 3132333435363738c2a0 echo bin2hex('12345678') . PHP_EOL; ## Result of above script: 3132333435363738 (You mat notice the difference of script result is C2A0)
How to display the Non-breaking space In MySQL?
SELECT CONCAT('12345678', UNHEX('C2A0')) -- Result of above query: '12345678 ' (one whitespace at the end) SELECT HEX(CONCAT('12345678', UNHEX('C2A0'))) -- Result of above query: 3132333435363738C2A0 SELECT HEX('12345678') -- Result of above query: 3132333435363738 (You mat notice the difference of query result is C2A0) SELECT LENGTH('12345678') -- Result of above query: 8 SELECT LENGTH(CONCAT('12345678', UNHEX('C2A0'))) -- Result of above query: 10
Remove control character
Control character - Wikipedia Using PHP to clean control character:
$input = 'some string may contains control characters'; $replacement = ''; $result = preg_replace('/[\x00-\x1F]/', $replacement, $input);
Fix garbled message text
Tools
- IvanMathy/Boop: A scriptable scratchpad for developers. In slow yet steady progress. (Boop on the Mac App Store) " ... to paste some plain text and run some basic text operations on it. "
Further reading
- Data Cleaning | DATA + DESIGN 資料+設計
- Data modeling: Data type
- What is a codebook?
- 台灣R軟體Club 「由於真實社會的資料千奇百怪,遺失值、空字串、空格、甚至是假的 "NA" (NA兩邊多了雙引號)、假的"NULL" (NULL兩邊多了雙引號) 。。。,其中一個顯著的差異是,社會實務應用的 R 程式中,通常會多出一堆檢查機制。 ... ...」[Last visited: 2017-12-18]
- 判斷 Excel 欄位值是否錯誤 除了常見的 Excel 錯誤,如何判斷來自其他資料來源的可能錯誤欄位值
- Web user behavior
- 資料品質初探(Data Quality) – 亂點技能的跨界人生 – Medium [Last visited: 2018-01-13]
references
- ↑ MySQL COALESCE() function - w3resource
- ↑ How to check if field is null or empty mysql? - Stack Overflow
- ↑ MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values
- ↑ 如何判斷 Excel 儲存格的欄位值是 NULL
- ↑ SQL TRIM 函數 - 1Keydata SQL 語法教學
- ↑ Excel COUNTIFS and COUNTIF with multiple criteria – examples of usage
- ↑ Check if number is an Integer
- ↑ Excel specifications and limits
- ↑ A2
- ↑ regex - Mysql REGEXP with . and numbers only - Stack Overflow
- ↑ How do I check to see if a value is an integer in MySQL? - Stack Overflow
- ↑ How to identify if values in a column is numeric ( Function similar to Isnumeric is SQL)
- ↑ GOOGLE 試算表: 數字轉成文字
- ↑ DATEVALUE 函數 - Office 支援
- ↑ Check if a valid date?
- ↑ Maximum life span - Wikipedia
- ↑ sql - Calculate Age in MySQL (InnoDb) - Stack Overflow
- ↑ shell - How to print only the duplicate values from a text file? - Unix & Linux Stack Exchange
- ↑ Finding duplicate values in MySQL - Stack Overflow
- ↑ sql - MySQL SELECT DISTINCT multiple columns - Stack Overflow
- ↑ linux - How to remove duplicate lines in a large multi-GB textfile? - Unix & Linux Stack Exchange
- ↑ sorting - Is there a way to ignore header lines in a UNIX sort? - Stack Overflow
- ↑ 命令執行的判斷依據: ; , &&, ||
- ↑ Linux tail command help and examples
- ↑ RIGHT、RIGHTB 函數 - Excel - Office.com
- ↑ LEN、LENB 函數 - Excel - Office.com
- ↑ How to remove first, last or certain characters from text in Excel?
- ↑ REPLACE、REPLACEB 函數 - Excel - Office.com
- ↑ Re: 請益 mysql空百=? - 看板 PHP - 批踢踢實業坊
- ↑ php - How to replace decoded Non-breakable space (nbsp) - Stack Overflow