Data cleaning: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
 
(217 intermediate revisions by the same user not shown)
Line 1: Line 1:


== is null ==
== Check list ==
Finds whether a variable is NULL. [http://ideone.com/tKv2gQ online demo]
 
* PHP [http://tw2.php.net/is_null is_null]
* Row count: The number of data entries is a fundamental item for data verification and is easy to observe and check. For instance, one can compare the number of entries displayed on a webpage to the number of entries after exporting to a CSV file.
* Duplicate data
 
== Check if field value was not fulfilled  ==
=== By purpose ===
<table border="1" style="width: 100%">
  <tr style="background-color: #555555; color: #ffffff;">
    <td style="background-color: #777; color: #ffffff;">Purpose</td>
    <td style="background-color: #777; color: #ffffff; width: 350px;">Method (MySQL query syntax)</td>
    <td style="vertical-align: top;">Value1: <br />Fulfilled value what I want</td>
    <td style="vertical-align: top;">Value2: <br />Fulfilled value NOT I want</td>
    <td style="vertical-align: top;">Value3: <br />0</td>
    <td style="vertical-align: top;">Value4: <br />NULL value</td>
    <td style="vertical-align: top;">Value5: <br />Empty or white-spaces characters</td>
  </tr>
  <tr>
    <td>values were not fulfilled or empty <br />(not contains 0)</td>
    <td>{{kbd | key=WHERE column_name IS NULL <br /> OR LENGTH(TRIM( column_name )) = 0}}</td>
    <td></td>
    <td></td>
    <td></td>
    <td>V</td>
    <td>V</td>
  </tr>
  <tr>
    <td>values were not fulfilled or empty<br />(contains 0)</td>
    <td></td>
    <td></td>
    <td></td>
    <td>V</td>
    <td>V</td>
    <td>V</td>
  </tr>
  <tr>
    <td>values were fulfilled and non-empty<br />(not contains 0)</td>
    <td></td>
    <td>V</td>
    <td>V</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>values were fulfilled and non-empty<br />(contains 0)</td>
    <td>{{kbd | key=<nowiki>WHERE LENGTH(TRIM( column_name )) > 0</nowiki>}}</td>
    <td>V</td>
    <td>V</td>
    <td>V</td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>values (1) were not fulfilled or empty values (2) NOT I want <br />(not contains 0)</td>
    <td>{{kbd | key=WHERE column_name IS NULL <br /> OR LENGTH(TRIM( column_name )) = 0 <br /> OR column_name LIKE 'values NOT I want'}}</td>
    <td></td>
    <td>V</td>
    <td></td>
    <td>V</td>
    <td>V</td>
  </tr>
  <tr>
    <td>values (1) were not fulfilled or empty values (2) NOT I want <br />(contains 0)</td>
    <td></td>
    <td></td>
    <td>V</td>
    <td>V</td>
    <td>V</td>
    <td>V</td>
  </tr>
</table>
 
=== By datatype ===
==== VARCHAR and NOT allows NULL value ====
Using NULLIF() function<ref>[https://www.w3schools.com/sql/func_mysql_nullif.asp MySQL NULLIF() Function]</ref>
 
SQL query:
<pre>
SELECT NULLIF(TRIM(`my_column`), "")
</pre>
 
Example result:
 
<pre>
SELECT NULLIF(null, "");
-- return NULL
 
SELECT NULLIF("", "");
-- return NULL
 
SELECT NULLIF(TRIM("  "), "");
-- return NULL
 
SELECT NULLIF(TRIM("not empty string  "), "");
-- return "not empty string"
 
</pre>
 
 
==== VARCHAR and allows NULL value ====
<table border="1" style="width: 100%">
  <tr style="background-color: #555555; color: #ffffff;">
    <td>data type of column</td>
    <td>possible column values</td>
    <td>method1: <br />find not fulfilled or empty values</td>
    <td>method2: <br />find fulfilled and non-empty values</td>
    <td>method3: <br />find NULL values</td>
    <td>method4: <br />find not NULL values</td>
  </tr>
  <tr>
    <td rowspan="5">{{kbd | key = VARCHAR}} and allows {{kbd | key = NULL}}</td>
    <td>fulfilled value ex:123</td>
    <td></td>
    <td>V</td>
    <td></td>
    <td>V</td>
  </tr>
  <tr>
    <td>{{kbd | key = NULL}} type:null</td>
    <td>V</td>
    <td></td>
    <td>V</td>
    <td></td>
  </tr>
  <tr>
    <td>{{kbd | key = 'NULL'}} type:string</td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>0</td>
    <td></td>
    <td>V</td>
    <td></td>
    <td>V</td>
  </tr>
  <tr>
    <td>EMPTY ex: <nowiki>'' or space(s) '  '</nowiki></td>
    <td>V</td>
    <td></td>
    <td></td>
    <td>V</td>
  </tr>
</table>
 
symbol V: means the column value will be able to find by means of the method
 
 
* method1:
** {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE COALESCE(column_name, '') = ''</nowiki>}}<ref>[http://www.w3resource.com/mysql/comparision-functions-and-operators/coalesce-function.php MySQL COALESCE() function - w3resource]</ref>
** {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0</nowiki>}}
**{{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name IS NULL OR column_name = ''</nowiki>}}<ref>[http://stackoverflow.com/questions/17832906/how-to-check-if-field-is-null-or-empty-mysql How to check if field is null or empty mysql? - Stack Overflow]</ref>
* method2:
** {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name > ''</nowiki>}}
** {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0</nowiki>}}
** {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0</nowiki>}}
* method3: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name IS NULL</nowiki>}}
* method4: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name IS NOT NULL</nowiki>}}
 
==== VARCHAR or numeric ====
<table border="1" style="width: 100%">
  <tr style="background-color: #555555; color: #ffffff;">
    <td>data type of column</td>
    <td>possible column values</td>
    <td>method5: <br />find values within the range</td>
    <td>method6: <br />find values out of the range, empty & NULL values</td>
  </tr>
  <tr>
    <td rowspan="4">{{kbd | key = VARCHAR}} or numeric</td>
    <td>values within the range ex:  min ≤ value ≤ max</td>
    <td>V</td>
    <td></td>
  </tr>
  <tr>
    <td>values out of range</td>
    <td></td>
    <td>V</td>
  </tr>
  <tr>
    <td>{{kbd | key = NULL}}</td>
    <td></td>
    <td>V</td>
  </tr>
  <tr>
    <td>EMPTY ex: <nowiki>'' or space '  '</nowiki></td>
    <td></td>
    <td>V</td>
  </tr>
</table>
 
* method5: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max</nowiki>}}
* method6: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE (  (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max)  )</nowiki>}}
 
==== datetime and allows NULL value ====
possible column values
# {{kbd | key = {{Template:Today}}}}
# {{kbd | key = 00-00-00 00:00:00}}
# {{kbd | key = NULL}}
 
=== is null ===
Fill 0 if the value is NA or NULL
* MySQL SQL syntax: COALESCE(): {{kbd | key = SELECT COALESCE(column_name, 0) }} or {{kbd | key = SELECT COALESCE(column_name, 'other_filled_value') }}
** (1) Using [http://www.w3resource.com/mysql/comparision-functions-and-operators/coalesce-function.php COALESCE() function] to replace the NULL value with 0.
** (2) The case: 0/0 = null should be handled.
* MySQL SQL syntax: combined IF() & ISNULL():{{kbd | key = SELECT IF(ISNULL((column_name), 0, column_name) }} or  {{kbd | key = SELECT IF(ISNULL((column_name), 'other_filled_value', column_name) }}
* python: [http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.fillna.html pandas.DataFrame.fillna — pandas 0.16.0 documentation] "Fill NA/NaN values using the specified method"
 
Find whether a variable is NULL. [http://ideone.com/tKv2gQ online demo]
* PHP [http://tw2.php.net/is_null is_null] to find type:null null NOT type:string <nowiki>'null'</nowiki> {{exclaim}}
* Google spreadsheet / Excel:  
* Google spreadsheet / Excel:  
** [https://support.google.com/drive/answer/3093348 ISERR(value)] " value - The value to be verified as an error type other than #N/A." ex: {{kbd | key = #NULL!}}
** [https://support.google.com/drive/answer/3093348 ISERR(value)] " value - The value to be verified as an error type other than #N/A." ex: {{kbd | key = #NULL!}}
** If the cell value is exactly {{kbd | key = NULL}} not {{kbd | key = #NULL!}}, You may use  {{kbd | key = <nowiki>COUNTIF(value, "NULL")</nowiki>}} or  {{kbd | key = <nowiki>EXACT(value, "NULL")</nowiki>}}
** If the cell value is exactly {{kbd | key = NULL}} not {{kbd | key = #NULL!}}, You may use  {{kbd | key = <nowiki>COUNTIF(value, "NULL")</nowiki>}} or  {{kbd | key = <nowiki>EXACT(value, "NULL")</nowiki>}}
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NULL;}}<ref>[http://dev.mysql.com/doc/refman/5.0/en/working-with-null.html MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values]</ref>  
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NULL;}}<ref>[http://dev.mysql.com/doc/refman/5.0/en/working-with-null.html MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values]</ref>  
* R: is.null(): [https://stat.ethz.ch/R-manual/R-devel/library/base/html/NULL.html R: The Null Object]
* Excel<ref>[https://errerrors.blogspot.com/2018/09/check-if-cell-value-is-null-in-excel.html 如何判斷 Excel 儲存格的欄位值是 NULL]</ref>


Finds whether a variable is NOT NULL
Find whether a variable is NOT NULL
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NOT NULL;}}
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NOT NULL;}}


== check if field value was not fulfilled: NULL, empty value ==
Find whether a variable is NOT {{kbd | key = #N/A}}
{{exclaim}} NOT include those data which its field value  fulfilled with default value automatically
* Excel: {{kbd | key = =NOT(ISERROR(cell_value))}}


([http://sqlfiddle.com/#!2/9b01e/3/0 demo on sqlfiddle])
javascript
* [http://www.w3schools.com/jsref/jsref_isnan.asp JavaScript isNaN() Function]
 
=== check if field value was not fulfilled: NULL, empty value ===
{{exclaim}} NOT include those data which its field value  fulfilled with default value automatically ([http://sqlfiddle.com/#!2/9b01e/3/0 demo on sqlfiddle])
 
# {{Gd}} quick solution: find records with {{kbd | key = NULL}} value OR empty, space value
#* MySQL solution: {{kbd | key = SELECT * FROM table_name WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0;}}
# find records with {{kbd | key = NULL}} value: (note: not {{kbd | key = #NULL!}})
# find records with {{kbd | key = NULL}} value: (note: not {{kbd | key = #NULL!}})
#* MySQL solution: {{kbd | key = SELECT * FROM table_name WHERE column_name IS NULL;}}  
#* MySQL solution: {{kbd | key = SELECT * FROM table_name WHERE column_name IS NULL;}}  
Line 20: Line 238:
# find records with empty value: (not contains {{kbd | key = NULL}} value)
# find records with empty value: (not contains {{kbd | key = NULL}} value)
#* MySQL: {{kbd | key = SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0;}} {{Exclaim}} SQL query {{kbd | key =SELECT * FROM table_name WHERE column_name IS NOT NULL}} includes empty value  
#* MySQL: {{kbd | key = SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0;}} {{Exclaim}} SQL query {{kbd | key =SELECT * FROM table_name WHERE column_name IS NOT NULL}} includes empty value  
# find NOT empty records means records without NULL or empty value:
#* MS SQL Server: {{kbd | key = SELECT * FROM table_name WHERE LEN( LTRIM(RTRIM(column_name)) ) = 0;}} <ref>[http://www.1keydata.com/tw/sql/sql-trim.html SQL TRIM 函數 - 1Keydata SQL 語法教學]</ref>
#* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;</nowiki>}}
#* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}}
#  Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...  
#  Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...  
#* solution: step1: Replace the year > 100 from this year with empty value at EXCEL: {{kbd | key =<nowiki>=IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2))</nowiki>}} (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
#* solution: step1: Replace the year > 100 from this year with empty value at EXCEL: {{kbd | key =<nowiki>=IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2))</nowiki>}} (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
#* trivial approach : EXCEL: {{kbd | key =<nowiki>=IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2))</nowiki>}}  {{exclaim}} this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
#* trivial approach : EXCEL: {{kbd | key =<nowiki>=IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2))</nowiki>}}  {{exclaim}} this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
# Using PHP [http://php.net/manual/en/function.empty.php empty()] function to find 0, null, false, empty string, empty array values.
# Using PHP  
#* [http://php.net/manual/en/function.empty.php empty()] function to find 0, null, false, empty string, empty array values.
#* <code>if(empty($var) && $var !== 0 && $var !== "0"){ .. } </code> to find null, false, empty string, empty array values BUT not 0.
# check if field value was NULL & not equal to some value
#* MySQL: [http://stackoverflow.com/questions/17097890/mysql-select-on-fields-containing-null-values MySql select on fields containing null values - Stack Overflow]


== check if field value was fulfilled ==
=== check if field value was fulfilled ===
length of string > 0
length of string > 0
* MySQL: {{kbd | key = SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;}} [http://sqlfiddle.com/#!9/e0e5f/1/0 demo][http://ideone.com/rt6D2A]
* MySQL: {{kbd | key = SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;}} [http://sqlfiddle.com/#!9/e0e5f/1/0 demo][http://ideone.com/rt6D2A]
Line 41: Line 261:
* MySQL: {{kbd | key = SELECT * FROM table_name WHERE column_name BETWEEN ''min_number'' AND ''max_number'';}} the value >= ''min_number'' AND value <= ''max_number''  ( ''min_number'' ≤ value ≤ ''max_number'' )
* MySQL: {{kbd | key = SELECT * FROM table_name WHERE column_name BETWEEN ''min_number'' AND ''max_number'';}} the value >= ''min_number'' AND value <= ''max_number''  ( ''min_number'' ≤ value ≤ ''max_number'' )


== verify the format of field value ==
find NOT empty records means records without NULL or empty value:
email:  
* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;</nowiki>}}
* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}}
 
== Data Validation ==
Validate the format of field value. Related page: [[Regular expression]]
 
=== Verify the strings are in valid email format ===
Rule: Email contains @ symbol
 
* EXCEL: {{kbd | key =<nowiki>=IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE)</nowiki>}} only check the field if contains @ symbol or not
* EXCEL: {{kbd | key =<nowiki>=IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE)</nowiki>}} only check the field if contains @ symbol or not
** result: (1) normal condition: return TRUE; (2) exceptional condition: return '''FALSE''' if @ symbol was not found  
** result: (1) normal condition: return TRUE; (2) exceptional condition: return '''FALSE''' if @ symbol was not found  
Line 51: Line 279:
** "Returns the filtered data, or '''FALSE''' if the filter fails." quoted from [http://php.net/manual/en/function.filter-var.php PHP.net]
** "Returns the filtered data, or '''FALSE''' if the filter fails." quoted from [http://php.net/manual/en/function.filter-var.php PHP.net]


== duplicate data ==
=== Verify the strings are in valid url format ===
* EXCEL: [http://www.extendoffice.com/documents/excel/1499-count-duplicate-values-in-column.html How to count duplicate values in a column in Excel?]
Rule: Begin with http or https
* PHP: [http://php.net/manual/en/function.array-unique.php PHP: array_unique], [http://php.net/manual/en/function.array-intersect.php PHP: array_intersect]


== outlier ==
* Google spreadsheet {{kbd | key =<nowiki>=REGEXMATCH(A1, "^http(s?)")</nowiki>}}
(left blank intentionally)


== data handling ==
=== Number precision in Excel ===
=== remove first, last or certain characters from text ===
Number precision: 15 digits (Excel中最多的有效位數為15位)<ref>[https://support.office.com/en-nz/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3 Excel specifications and limits]</ref><ref>[http://isvincent.pixnet.net/blog/post/30267208-excel-%E9%A1%AF%E7%A4%BApi%E8%BC%83%E5%A4%9A%E7%9A%84%E6%9C%89%E6%95%88%E4%BD%8D%E6%95%B8 A2]</ref>
 
raw data: 1234567890123456
->
* (numeric format 數值格式) 1234567890123450.00 {{exclaim}} losing precision
* (general format 通用格式) 1.23457E+15 {{exclaim}} losing precision
* (text format 文字格式) 1234567890123456
 
large numbers
* If the data was imported from Excel, you should notice the 15 digit precision issue.
 
=== Verify the column values are numeric ===
 
Possible values
 
<pre>
test data:
3.141592654
1.36184E+14
123,456.789
20740199601
346183773390240
="5"
</pre>
 
==== Verify if value is number in MySQL ====
MySQL:
 
* Check if a value is integer e.g. 1234567
** Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref><ref>[https://stackoverflow.com/questions/75704/how-do-i-check-to-see-if-a-value-is-an-integer-in-mysql How do I check to see if a value is an integer in MySQL? - Stack Overflow]</ref>
 
* Check if a value is integer which may contains comma and dot symbols e.g. 1,234.567 or 3.414
** {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9,\.]+$'}}<ref>[https://community.denodo.com/answers/question/details?questionId=9060g000000XelhAAC&title=How+to+identify+if+values+in+a+column+is+numeric+%28+Function+similar+to+Isnumeric+is+SQL%29 How to identify if values in a column is numeric ( Function similar to Isnumeric is SQL)]</ref>
 
* Check if a value is NOT integer
** Find the records which the value of `my_column` is '''NOT''' numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'}}
 
 
If the digit of number is known, the SQL syntax could be more specific
* The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}}
 
==== Verify if value is number in PHP ====
 
* [http://php.net/manual/en/function.is-numeric.php is_numeric] function
* [https://www.php.net/manual/en/function.is-int.php is_int] function
 
==== Verify if value is number in Excel or Google sheet ====
Excel & [https://www.google.com/sheets/about/ Google Sheets]:
* Using [http://www.techonthenet.com/excel/formulas/isnumber.php ISNUMBER Function]: {{code | code = <nowiki>=INT(ISNUMBER(A1))</nowiki>}}
** Return 1 if the cell value is (1) Numbers (2) Numbers in [https://en.wikipedia.org/wiki/Scientific_notation scientific (exponential) notation] e.g. {{code | code = <nowiki>1.36184E+14</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers
** Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}}
 
*  Google Sheets only: Using [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH], [https://support.google.com/docs/answer/3094140?hl=zh-Hant TRIM] & [https://support.google.com/docs/answer/3093592?hl=zh-Hant CONCAT]<ref>[https://errerrors.blogspot.com/2015/08/google.html GOOGLE 試算表: 數字轉成文字]</ref> functions: {{code | code = <nowiki>=IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)</nowiki>}}
** Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}}
** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers
 
=== Time data: Validate the data format ===
[[Validate the datetime value]]
 
=== Time data: Data was generated in N years ===
Define the abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series])
* Verify the data were generated in N years. Possible abnormal values: {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type. e.g.
 
* Verify the data were not newer than today
 
* Verify the year of data were not {{kbd | key=1900}} if the data were imported from Microsoft Excel file. Datevalue<ref>[https://support.microsoft.com/zh-tw/office/datevalue-%E5%87%BD%E6%95%B8-df8b07d4-7761-4a93-bc33-b7471bbff252 DATEVALUE 函數 - Office 支援]</ref> was started from the year {{kbd | key=1900}} e.g.
** {{code | code = 1900/1/0}} (converted time formatted value from 0),
** {{code | code = 1900/1/1}} (converted time formatted value from 1)
 
* Verify the diversity of data values e.g. [https://en.wikipedia.org/wiki/Variance Variance]
 
Find the normal values:
* MySQL: Assume the data was generated in recent 10 years & not newer than today
** {{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` < CURDATE() + 1);}}
*** {{exclaim}} NOT {{code | code =  `my_time_column` < CURDATE() }}。 ex: {{code | code = CURDATE() }} is {{CURRENTYEAR}}-{{CURRENTMONTH}}-{{CURRENTDAY2}}. Which is the same with {{CURRENTYEAR}}-{{CURRENTMONTH}}-{{CURRENTDAY2}} 00:00:00
** {{code | code = SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 )  AND  ( `my_time_column` < CURDATE() + 1); }}
* MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.
** {{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` <= CURRENT_TIMESTAMP);}}
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)
 
Abnormal values
* {{code | code = 1970-01-01 08:00:00}} (converted time formatted value from {{code | code =August 3, 2017}}) caused by the string contains special characters e.g. [https://en.wikipedia.org/wiki/Left-to-right_mark left-to-right mark (LRM) ]
 
Check if the date valid
* Excel: [https://support.office.com/en-us/article/WEEKDAY-function-60e44483-2ed1-439f-8bd0-e404c190949a?ui=en-US&rs=en-US&ad=US WEEKDAY function] <ref>[http://www.mrexcel.com/forum/excel-questions/322203-check-if-valid-date.html Check if a valid date?]</ref>
** legal results: {{kbd | key=1}} (Sunday) ~ {{kbd | key=7}} (Saturday)
** illegal result: {{kbd | key= <nowiki>#VALUE!</nowiki>}}
* PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow]
 
=== Time data: Human birth year (age) data ===
Based on the existing record, the longest-living person who lived to 122<ref>[https://en.wikipedia.org/wiki/Maximum_life_span Maximum life span - Wikipedia]</ref>.
 
MySQL query is as follows<ref>[https://stackoverflow.com/questions/5773405/calculate-age-in-mysql-innodb sql - Calculate Age in MySQL (InnoDb) - Stack Overflow]</ref> where the column {{kbd | key=<nowiki>`birthday`</nowiki>}} is {{kbd | key=<nowiki>date</nowiki>}} type.
<pre>
WHERE TIMESTAMPDIFF(YEAR, `birthday`, CURDATE()) <= 122
</pre>
 
Using [https://www.w3resource.com/mysql/date-and-time-functions/mysql-unix_timestamp-function.php UNIX_TIMESTAMP() function] to check the abnormality of birthday data is not appropriate. Because the birthdays which are earlier {{kbd | key=<nowiki>1970-01-01 00:00:00 UTC</nowiki>}} will all become zero.
 
 
=== String contains special characters ===
* [[Byte order mark]] (BOM)
* [[Return symbol]]
* [http://www.fileformat.info/info/unicode/char/a0/index.htm Unicode Character 'NO-BREAK SPACE' (U+00A0)]
* [https://www.fileformat.info/info/unicode/char/200f/index.htm Unicode Character 'RIGHT-TO-LEFT MARK' (U+200F)]
* [https://www.fileformat.info/info/unicode/char/200f/index.htm Unicode Character 'RIGHT-TO-LEFT MARK' (U+200F)]<ref>[https://stackoverflow.com/questions/1930009/how-to-strip-unicode-chars-left-to-right-mark-from-a-string-in-php regex - How to strip unicode chars (LEFT_TO_RIGHT_MARK) from a string in php - Stack Overflow]</ref>
 
== File Validation ==
 
=== Verify the file format of downloaded file ===
* PDF file format: [https://stackoverflow.com/questions/16152583/tell-if-a-file-is-pdf-in-bash Tell if a file is PDF in bash - Stack Overflow]
 
== Find and remove duplicates ==
[[Find and remove duplicates]] in Excel/BASH/MySQL/PHP
 
== Counting ==
 
* [[Count occurrences of a word in string]]
* Count number of unique values
** Excel: [https://www.excel-easy.com/examples/count-unique-values.html Count Unique Values in Excel - Easy Excel Tutorial]
** Google sheet: [https://support.google.com/docs/answer/3093405?hl=zh-Hant COUNTUNIQUE - 文件編輯器說明] & [https://infoinspired.com/google-docs/spreadsheet/unique-function-in-horizontal-data-range-in-google-sheets/ How to Use UNIQUE Function in Horizontal Data Range in Google Sheets]
 
== Outlier / Anomaly detection ==
[[Anomaly detection]]
 
== unique number of data values ==
If the data values were generated by different users, the unique number of data values should be larger than ____
 
== Data handling ==
=== Remove first, last or certain characters from text ===
* Excel: using {{kbd | key=RIGHT}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342854.aspx RIGHT、RIGHTB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342650.aspx LEN、LENB 函數 - Excel - Office.com]</ref> functions <ref>[http://www.extendoffice.com/documents/excel/560-excel-remove-character-from-string.html How to remove first, last or certain characters from text in Excel?]</ref>
* Excel: using {{kbd | key=RIGHT}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342854.aspx RIGHT、RIGHTB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342650.aspx LEN、LENB 函數 - Excel - Office.com]</ref> functions <ref>[http://www.extendoffice.com/documents/excel/560-excel-remove-character-from-string.html How to remove first, last or certain characters from text in Excel?]</ref>
* Excel: if the text length will be removed was fixed, you may try to use {{kbd | key=REPLACE}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342844.aspx REPLACE、REPLACEB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}} functions ([http://bit.ly/1i2tsgk demo])
* Excel: if the length of text was fixed after removed, you may try to use {{kbd | key=REPLACE}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342844.aspx REPLACE、REPLACEB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}} functions ([http://bit.ly/1i2tsgk demo])
 
=== Remove leading and trailing spaces from text ===
* PHP: [http://php.net/manual/en/function.trim.php trim function]
* Excel: [https://exceljet.net/formula/remove-leading-and-trailing-spaces-from-text Excel formula: Remove leading and trailing spaces from text | Exceljet]
* MySQL: Using [http://www.w3resource.com/mysql/string-functions/mysql-trim-function.php TRIM() function] & [http://www.w3resource.com/mysql/string-functions/mysql-length-function.php LENGTH() function ]
<pre>
UPDATE `table`
SET `column` = TRIM( `column` )
WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );
</pre>
 
=== Remove other string look like whitespace ===
[https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
# [http://www.fileformat.info/info/unicode/char/3000/index.htm IDEOGRAPHIC SPACE] (全形空白、全型空白, U+3000)<ref>[https://www.ptt.cc/bbs/PHP/M.1473522044.A.4FC.html Re: 請益 mysql空百=? - 看板 PHP - 批踢踢實業坊]</ref>:
#* diaplay: <pre><nowiki><?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?></nowiki></pre>
#* replace with space: <pre><nowiki><?php echo str_replace(json_decode('"\u3000"'), " ", $string);?></nowiki></pre>
# ASCII Vertical Tab {{kbd | key=<nowiki>\v</nowiki>}}
# ASCII Horizontal Tab (TAB) {{kbd | key=<nowiki>\t</nowiki>}}
# ASCII Backspace {{kbd | key=<nowiki>\b</nowiki>}}
# [[Remove non breaking space]]
 
== Remove control character ==
[https://en.wikipedia.org/wiki/Control_character Control character - Wikipedia] Using PHP to clean control character:
<pre>
$input = 'some string may contains control characters';
$replacement = '';
$result = preg_replace('/[\x00-\x1F]/', $replacement, $input);
</pre>
 
== Remove tracking parameter from link ==
[[Remove tracking parameter from link]]
 
=== Fix garbled message text ===
[[Fix garbled message text]]
 
== Tools ==
* {{Mac}} [https://github.com/IvanMathy/Boop IvanMathy/Boop: A scriptable scratchpad for developers. In slow yet steady progress.] ([https://apps.apple.com/us/app/boop/id1518425043?mt=12 ‎Boop on the Mac App Store]) " ... to paste some plain text and run some basic text operations on it. "
* [https://gchq.github.io/CyberChef/ CyberChef] (source code available on [https://github.com/gchq/CyberChef github])  The Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis


== related pages ==
== Further reading ==
Data modeling: [[Data type]]
* [https://g0v.gitbooks.io/data-design/content/book/ch08-data-cleaning.html Data Cleaning | DATA + DESIGN 資料+設計]
* Data modeling: [[Data type]]
* [http://www.icpsr.umich.edu/icpsrweb/NAHDAP/support/faqs/2006/01/what-is-codebook What is a codebook?]
* [https://www.facebook.com/groups/1210634969026548/permalink/1540468379376537/ 台灣R軟體Club] 「由於真實社會的資料千奇百怪,遺失值、空字串、空格、甚至是假的 "NA" (NA兩邊多了雙引號)、假的"NULL" (NULL兩邊多了雙引號) 。。。,其中一個顯著的差異是,社會實務應用的 R 程式中,通常會多出一堆檢查機制。 ... ...」{{access | date = 2017-12-18}}
* [https://docs.google.com/spreadsheets/d/1UyumfruGjps7xqd4aSB-NKfmu54NIscI2OV9Uht8F7M/edit?usp=sharing  判斷 Excel 欄位值是否錯誤] 除了常見的 Excel 錯誤,如何判斷來自其他資料來源的可能錯誤欄位值
* [[Web user behavior]]
* [https://medium.com/bryanyang0528/%E8%B3%87%E6%96%99%E5%93%81%E8%B3%AA%E5%88%9D%E6%8E%A2-data-quality-b765eb56a7c2?fbclid=IwAR3NBb2BtFm9O3FeY7JgQ5HLE5VG5nFe3m5Zx8zNW9XkvOUPlqV9hXmaoXI 資料品質初探(Data Quality) – 亂點技能的跨界人生 – Medium] {{access | date = 2018-01-13}}


== references ==
== References ==
<references/>
<references/>


[[Category:Spreadsheet]]
[[Category:Spreadsheet]] [[Category:Excel]]
[[Category:Data_hygiene]]
[[Category:Data_hygiene]]
[[Category:Data Science]]
[[Category:Data Science]]
[[Category:MySQL]]
[[Category:String manipulation]]

Latest revision as of 13:39, 8 April 2024

Check list[edit]

  • Row count: The number of data entries is a fundamental item for data verification and is easy to observe and check. For instance, one can compare the number of entries displayed on a webpage to the number of entries after exporting to a CSV file.
  • Duplicate data

Check if field value was not fulfilled[edit]

By purpose[edit]

Purpose Method (MySQL query syntax) Value1:
Fulfilled value what I want
Value2:
Fulfilled value NOT I want
Value3:
0
Value4:
NULL value
Value5:
Empty or white-spaces characters
values were not fulfilled or empty
(not contains 0)
WHERE column_name IS NULL
OR LENGTH(TRIM( column_name )) = 0
V V
values were not fulfilled or empty
(contains 0)
V V V
values were fulfilled and non-empty
(not contains 0)
V V
values were fulfilled and non-empty
(contains 0)
WHERE LENGTH(TRIM( column_name )) > 0 V V V
values (1) were not fulfilled or empty values (2) NOT I want
(not contains 0)
WHERE column_name IS NULL
OR LENGTH(TRIM( column_name )) = 0
OR column_name LIKE 'values NOT I want'
V V V
values (1) were not fulfilled or empty values (2) NOT I want
(contains 0)
V V V V

By datatype[edit]

VARCHAR and NOT allows NULL value[edit]

Using NULLIF() function[1]

SQL query:

SELECT NULLIF(TRIM(`my_column`), "")

Example result:

SELECT NULLIF(null, "");
-- return NULL

SELECT NULLIF("", "");
-- return NULL

SELECT NULLIF(TRIM("   "), "");
-- return NULL

SELECT NULLIF(TRIM("not empty string   "), "");
-- return "not empty string"


VARCHAR and allows NULL value[edit]

data type of column possible column values method1:
find not fulfilled or empty values
method2:
find fulfilled and non-empty values
method3:
find NULL values
method4:
find not NULL values
VARCHAR and allows NULL fulfilled value ex:123 V V
NULL type:null V V
'NULL' type:string
0 V V
EMPTY ex: '' or space(s) ' ' V V

symbol V: means the column value will be able to find by means of the method


  • method1:
    • SELECT * FROM `my_table` WHERE COALESCE(column_name, '') = ''[2]
    • SELECT * FROM `my_table` WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0
    • SELECT * FROM `my_table` WHERE column_name IS NULL OR column_name = ''[3]
  • method2:
    • SELECT * FROM `my_table` WHERE column_name > ''
    • SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0
    • SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0
  • method3: SELECT * FROM `my_table` WHERE column_name IS NULL
  • method4: SELECT * FROM `my_table` WHERE column_name IS NOT NULL

VARCHAR or numeric[edit]

data type of column possible column values method5:
find values within the range
method6:
find values out of the range, empty & NULL values
VARCHAR or numeric values within the range ex: min ≤ value ≤ max V
values out of range V
NULL V
EMPTY ex: '' or space ' ' V
  • method5: SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max
  • method6: SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )

datetime and allows NULL value[edit]

possible column values

  1. 2024-05-05
  2. 00-00-00 00:00:00
  3. NULL

is null[edit]

Fill 0 if the value is NA or NULL

  • MySQL SQL syntax: COALESCE(): SELECT COALESCE(column_name, 0) or SELECT COALESCE(column_name, 'other_filled_value')
    • (1) Using COALESCE() function to replace the NULL value with 0.
    • (2) The case: 0/0 = null should be handled.
  • MySQL SQL syntax: combined IF() & ISNULL():SELECT IF(ISNULL((column_name), 0, column_name) or SELECT IF(ISNULL((column_name), 'other_filled_value', column_name)
  • python: pandas.DataFrame.fillna — pandas 0.16.0 documentation "Fill NA/NaN values using the specified method"

Find whether a variable is NULL. online demo

  • PHP is_null to find type:null null NOT type:string 'null' Icon_exclaim.gif
  • Google spreadsheet / Excel:
    • ISERR(value) " value - The value to be verified as an error type other than #N/A." ex: #NULL!
    • If the cell value is exactly NULL not #NULL!, You may use COUNTIF(value, "NULL") or EXACT(value, "NULL")
  • MySQL SQL syntax: SELECT * FROM table WHERE column IS NULL;[4]
  • R: is.null(): R: The Null Object
  • Excel[5]

Find whether a variable is NOT NULL

  • MySQL SQL syntax: SELECT * FROM table WHERE column IS NOT NULL;

Find whether a variable is NOT #N/A

  • Excel: =NOT(ISERROR(cell_value))

javascript

check if field value was not fulfilled: NULL, empty value[edit]

Icon_exclaim.gif NOT include those data which its field value fulfilled with default value automatically (demo on sqlfiddle)

  1. Good.gif quick solution: find records with NULL value OR empty, space value
    • MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0;
  2. find records with NULL value: (note: not #NULL!)
    • MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL;
    • EXCEL: =EXACT(A2, "NULL")
  3. find records with empty value: (not contains NULL value)
    • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0; Icon_exclaim.gif SQL query SELECT * FROM table_name WHERE column_name IS NOT NULL includes empty value
    • MS SQL Server: SELECT * FROM table_name WHERE LEN( LTRIM(RTRIM(column_name)) ) = 0; [6]
  4. Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...
    • solution: step1: Replace the year > 100 from this year with empty value at EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2)) (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
    • trivial approach : EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2)) Icon_exclaim.gif this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
  5. Using PHP
    • empty() function to find 0, null, false, empty string, empty array values.
    • if(empty($var) && $var !== 0 && $var !== "0"){ .. } to find null, false, empty string, empty array values BUT not 0.
  6. check if field value was NULL & not equal to some value

check if field value was fulfilled[edit]

length of string > 0

  • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0; demo[1]

column value is not null or 0

  • Excel: COUNTIFS(criteria_range1, "<>NULL", criteria_range1, "<>0")[7]

find if number or cell value is positive integer

  • EXCEL: =IFERROR(IF(AND(INT( value )= value, value>0), TRUE, FALSE), FALSE)[8] online demo

check numeric range

  • MySQL: SELECT * FROM table_name WHERE column_name BETWEEN min_number AND max_number; the value >= min_number AND value <= max_number ( min_number ≤ value ≤ max_number )

find NOT empty records means records without NULL or empty value:

  • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;
  • MySQL: SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;

Data Validation[edit]

Validate the format of field value. Related page: Regular expression

Verify the strings are in valid email format[edit]

Rule: Email contains @ symbol

  • EXCEL: =IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE) only check the field if contains @ symbol or not
    • result: (1) normal condition: return TRUE; (2) exceptional condition: return FALSE if @ symbol was not found
  • EXCEL: =FIND("@", A2, 2) only check the field if contains @ symbol or not
    • syntax: FIND(find_text, with_text, [start_num]) the start_num is 2 because the position of @ symbol should be larger than 1 (position of first char is 1)
    • result: (1) normal condition: return the number larger than 1; (2) exceptional condition: return #VALUE! if @ symbol was not found
  • PHP: PHP FILTER_VALIDATE_EMAIL Filter
    • "Returns the filtered data, or FALSE if the filter fails." quoted from PHP.net

Verify the strings are in valid url format[edit]

Rule: Begin with http or https

  • Google spreadsheet =REGEXMATCH(A1, "^http(s?)")

Number precision in Excel[edit]

Number precision: 15 digits (Excel中最多的有效位數為15位)[9][10]

raw data: 1234567890123456 ->

  • (numeric format 數值格式) 1234567890123450.00 Icon_exclaim.gif losing precision
  • (general format 通用格式) 1.23457E+15 Icon_exclaim.gif losing precision
  • (text format 文字格式) 1234567890123456

large numbers

  • If the data was imported from Excel, you should notice the 15 digit precision issue.

Verify the column values are numeric[edit]

Possible values

test data:
3.141592654
1.36184E+14
123,456.789
20740199601
346183773390240
="5"

Verify if value is number in MySQL[edit]

MySQL:

  • Check if a value is integer e.g. 1234567
    • Find the records which the value of `my_column` is numeric values entirely SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'[11][12]
  • Check if a value is integer which may contains comma and dot symbols e.g. 1,234.567 or 3.414
    • SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9,\.]+$'[13]
  • Check if a value is NOT integer
    • Find the records which the value of `my_column` is NOT numeric values entirely SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'


If the digit of number is known, the SQL syntax could be more specific

  • The tax_id column is 8 digits only. Find the well-formatted tax_id records by using SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'

Verify if value is number in PHP[edit]

Verify if value is number in Excel or Google sheet[edit]

Excel & Google Sheets:

  • Using ISNUMBER Function: =INT(ISNUMBER(A1))
    • Return 1 if the cell value is (1) Numbers (2) Numbers in scientific (exponential) notation e.g. 1.36184E+14 (3) Decimal numbers e.g. 3.141592654 (4) Negative numbers
    • Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g. ="5"
  • Google Sheets only: Using REGEXMATCH, TRIM & CONCAT[14] functions: =IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)
    • Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g. ="5"
    • Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. 1.23E+16 (3) Decimal numbers e.g. 3.141592654 (4) Negative numbers

Time data: Validate the data format[edit]

Validate the datetime value

Time data: Data was generated in N years[edit]

Define the abnormal values of the time data (time series)

  • Verify the data were generated in N years. Possible abnormal values: 0001-01 00:00:00 occurred in MySQL datetime type. e.g.
  • Verify the data were not newer than today
  • Verify the year of data were not 1900 if the data were imported from Microsoft Excel file. Datevalue[15] was started from the year 1900 e.g.
    • 1900/1/0 (converted time formatted value from 0),
    • 1900/1/1 (converted time formatted value from 1)
  • Verify the diversity of data values e.g. Variance

Find the normal values:

  • MySQL: Assume the data was generated in recent 10 years & not newer than today
    • SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` < CURDATE() + 1);
      • Icon_exclaim.gif NOT `my_time_column` < CURDATE()。 ex: CURDATE() is 2024-05-05. Which is the same with 2024-05-05 00:00:00
    • SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 ) AND ( `my_time_column` < CURDATE() + 1);
  • MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.
    • SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` <= CURRENT_TIMESTAMP);
      • You need to check the SELECT CURRENT_TIMESTAMP); if correct or not before you delete the abnormal data (timezone issue)

Abnormal values

  • 1970-01-01 08:00:00 (converted time formatted value from August 3, 2017) caused by the string contains special characters e.g. left-to-right mark (LRM)

Check if the date valid

Time data: Human birth year (age) data[edit]

Based on the existing record, the longest-living person who lived to 122[17].

MySQL query is as follows[18] where the column `birthday` is date type.

WHERE TIMESTAMPDIFF(YEAR, `birthday`, CURDATE()) <= 122

Using UNIX_TIMESTAMP() function to check the abnormality of birthday data is not appropriate. Because the birthdays which are earlier 1970-01-01 00:00:00 UTC will all become zero.


String contains special characters[edit]

File Validation[edit]

Verify the file format of downloaded file[edit]

Find and remove duplicates[edit]

Find and remove duplicates in Excel/BASH/MySQL/PHP

Counting[edit]

Outlier / Anomaly detection[edit]

Anomaly detection

unique number of data values[edit]

If the data values were generated by different users, the unique number of data values should be larger than ____

Data handling[edit]

Remove first, last or certain characters from text[edit]

  • Excel: using RIGHT[20] + LEN[21] functions [22]
  • Excel: if the length of text was fixed after removed, you may try to use REPLACE[23] + LEN functions (demo)

Remove leading and trailing spaces from text[edit]

UPDATE `table` 
SET `column` = TRIM( `column` ) 
WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );

Remove other string look like whitespace[edit]

Whitespace character

  1. IDEOGRAPHIC SPACE (全形空白、全型空白, U+3000)[24]:
    • diaplay:
      <?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?>
    • replace with space:
      <?php echo str_replace(json_decode('"\u3000"'), " ", $string);?>
  2. ASCII Vertical Tab \v
  3. ASCII Horizontal Tab (TAB) \t
  4. ASCII Backspace \b
  5. Remove non breaking space

Remove control character[edit]

Control character - Wikipedia Using PHP to clean control character:

$input = 'some string may contains control characters';
$replacement = '';
$result = preg_replace('/[\x00-\x1F]/', $replacement, $input);

Remove tracking parameter from link[edit]

Remove tracking parameter from link

Fix garbled message text[edit]

Fix garbled message text

Tools[edit]

Further reading[edit]

References[edit]