Editing Data cleaning

Jump to navigation Jump to search
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:


== Check list ==
== is null ==
 
Finds whether a variable is NULL
* Row count: The number of data entries is a fundamental item for data verification and is easy to observe and check. For instance, one can compare the number of entries displayed on a webpage to the number of entries after exporting to a CSV file.
* PHP [http://tw2.php.net/is_null is_null]
* Duplicate data
* Google spreadsheet:  
 
== Check if field value was not fulfilled  ==
=== By purpose ===
<table border="1" style="width: 100%">
  <tr style="background-color: #555555; color: #ffffff;">
    <td style="background-color: #777; color: #ffffff;">Purpose</td>
    <td style="background-color: #777; color: #ffffff; width: 350px;">Method (MySQL query syntax)</td>
    <td style="vertical-align: top;">Value1: <br />Fulfilled value what I want</td>
    <td style="vertical-align: top;">Value2: <br />Fulfilled value NOT I want</td>
    <td style="vertical-align: top;">Value3: <br />0</td>
    <td style="vertical-align: top;">Value4: <br />NULL value</td>
    <td style="vertical-align: top;">Value5: <br />Empty or white-spaces characters</td>
  </tr>
  <tr>
    <td>values were not fulfilled or empty <br />(not contains 0)</td>
    <td>{{kbd | key=WHERE column_name IS NULL <br /> OR LENGTH(TRIM( column_name )) = 0}}</td>
    <td></td>
    <td></td>
    <td></td>
    <td>V</td>
    <td>V</td>
  </tr>
  <tr>
    <td>values were not fulfilled or empty<br />(contains 0)</td>
    <td></td>
    <td></td>
    <td></td>
    <td>V</td>
    <td>V</td>
    <td>V</td>
  </tr>
  <tr>
    <td>values were fulfilled and non-empty<br />(not contains 0)</td>
    <td></td>
    <td>V</td>
    <td>V</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>values were fulfilled and non-empty<br />(contains 0)</td>
    <td>{{kbd | key=<nowiki>WHERE LENGTH(TRIM( column_name )) > 0</nowiki>}}</td>
    <td>V</td>
    <td>V</td>
    <td>V</td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>values (1) were not fulfilled or empty values (2) NOT I want <br />(not contains 0)</td>
    <td>{{kbd | key=WHERE column_name IS NULL <br /> OR LENGTH(TRIM( column_name )) = 0 <br /> OR column_name LIKE 'values NOT I want'}}</td>
    <td></td>
    <td>V</td>
    <td></td>
    <td>V</td>
    <td>V</td>
  </tr>
  <tr>
    <td>values (1) were not fulfilled or empty values (2) NOT I want <br />(contains 0)</td>
    <td></td>
    <td></td>
    <td>V</td>
    <td>V</td>
    <td>V</td>
    <td>V</td>
  </tr>
</table>
 
=== By datatype ===
==== VARCHAR and NOT allows NULL value ====
Using NULLIF() function<ref>[https://www.w3schools.com/sql/func_mysql_nullif.asp MySQL NULLIF() Function]</ref>
 
SQL query:
<pre>
SELECT NULLIF(TRIM(`my_column`), "")
</pre>
 
Example result:
 
<pre>
SELECT NULLIF(null, "");
-- return NULL
 
SELECT NULLIF("", "");
-- return NULL
 
SELECT NULLIF(TRIM("  "), "");
-- return NULL
 
SELECT NULLIF(TRIM("not empty string  "), "");
-- return "not empty string"
 
</pre>
 
 
==== VARCHAR and allows NULL value ====
<table border="1" style="width: 100%">
  <tr style="background-color: #555555; color: #ffffff;">
    <td>data type of column</td>
    <td>possible column values</td>
    <td>method1: <br />find not fulfilled or empty values</td>
    <td>method2: <br />find fulfilled and non-empty values</td>
    <td>method3: <br />find NULL values</td>
    <td>method4: <br />find not NULL values</td>
  </tr>
  <tr>
    <td rowspan="5">{{kbd | key = VARCHAR}} and allows {{kbd | key = NULL}}</td>
    <td>fulfilled value ex:123</td>
    <td></td>
    <td>V</td>
    <td></td>
    <td>V</td>
  </tr>
  <tr>
    <td>{{kbd | key = NULL}} type:null</td>
    <td>V</td>
    <td></td>
    <td>V</td>
    <td></td>
  </tr>
  <tr>
    <td>{{kbd | key = 'NULL'}} type:string</td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>0</td>
    <td></td>
    <td>V</td>
    <td></td>
    <td>V</td>
  </tr>
  <tr>
    <td>EMPTY ex: <nowiki>'' or space(s) '  '</nowiki></td>
    <td>V</td>
    <td></td>
    <td></td>
    <td>V</td>
  </tr>
</table>
 
symbol V: means the column value will be able to find by means of the method
 
 
* method1:
** {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE COALESCE(column_name, '') = ''</nowiki>}}<ref>[http://www.w3resource.com/mysql/comparision-functions-and-operators/coalesce-function.php MySQL COALESCE() function - w3resource]</ref>
** {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0</nowiki>}}
**{{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name IS NULL OR column_name = ''</nowiki>}}<ref>[http://stackoverflow.com/questions/17832906/how-to-check-if-field-is-null-or-empty-mysql How to check if field is null or empty mysql? - Stack Overflow]</ref>
* method2:
** {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name > ''</nowiki>}}
** {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0</nowiki>}}
** {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0</nowiki>}}
* method3: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name IS NULL</nowiki>}}
* method4: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name IS NOT NULL</nowiki>}}
 
==== VARCHAR or numeric ====
<table border="1" style="width: 100%">
  <tr style="background-color: #555555; color: #ffffff;">
    <td>data type of column</td>
    <td>possible column values</td>
    <td>method5: <br />find values within the range</td>
    <td>method6: <br />find values out of the range, empty & NULL values</td>
  </tr>
  <tr>
    <td rowspan="4">{{kbd | key = VARCHAR}} or numeric</td>
    <td>values within the range ex:  min ≤ value ≤ max</td>
    <td>V</td>
    <td></td>
  </tr>
  <tr>
    <td>values out of range</td>
    <td></td>
    <td>V</td>
  </tr>
  <tr>
    <td>{{kbd | key = NULL}}</td>
    <td></td>
    <td>V</td>
  </tr>
  <tr>
    <td>EMPTY ex: <nowiki>'' or space '  '</nowiki></td>
    <td></td>
    <td>V</td>
  </tr>
</table>
 
* method5: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max</nowiki>}}
* method6: {{kbd | key = <nowiki>SELECT * FROM `my_table` WHERE (  (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max)  )</nowiki>}}
 
==== datetime and allows NULL value ====
possible column values
# {{kbd | key = {{Template:Today}}}}
# {{kbd | key = 00-00-00 00:00:00}}
# {{kbd | key = NULL}}
 
=== is null ===
Fill 0 if the value is NA or NULL
* MySQL SQL syntax: COALESCE(): {{kbd | key = SELECT COALESCE(column_name, 0) }} or {{kbd | key = SELECT COALESCE(column_name, 'other_filled_value') }}
** (1) Using [http://www.w3resource.com/mysql/comparision-functions-and-operators/coalesce-function.php COALESCE() function] to replace the NULL value with 0.
** (2) The case: 0/0 = null should be handled.
* MySQL SQL syntax: combined IF() & ISNULL():{{kbd | key = SELECT IF(ISNULL((column_name), 0, column_name) }} or  {{kbd | key = SELECT IF(ISNULL((column_name), 'other_filled_value', column_name) }}
* python: [http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.fillna.html pandas.DataFrame.fillna — pandas 0.16.0 documentation] "Fill NA/NaN values using the specified method"
 
Find whether a variable is NULL. [http://ideone.com/tKv2gQ online demo]
* PHP [http://tw2.php.net/is_null is_null] to find type:null null NOT type:string <nowiki>'null'</nowiki> {{exclaim}}
* Google spreadsheet / Excel:  
** [https://support.google.com/drive/answer/3093348 ISERR(value)] " value - The value to be verified as an error type other than #N/A." ex: {{kbd | key = #NULL!}}
** [https://support.google.com/drive/answer/3093348 ISERR(value)] " value - The value to be verified as an error type other than #N/A." ex: {{kbd | key = #NULL!}}
** If the cell value is exactly {{kbd | key = NULL}} not {{kbd | key = #NULL!}}, You may use {{kbd | key = <nowiki>COUNTIF(value, "NULL")</nowiki>}} or  {{kbd | key = <nowiki>EXACT(value, "NULL")</nowiki>}}
** If the cell value is exactly {{kbd | key = NULL}} not {{kbd | key = #NULL!}}, You may use {{kbd | key = <nowiki>EXACT(value, "NULL")</nowiki>}}
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NULL;}}<ref>[http://dev.mysql.com/doc/refman/5.0/en/working-with-null.html MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values]</ref>
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NULL;}}<ref>[http://dev.mysql.com/doc/refman/5.0/en/working-with-null.html MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values]</ref> [http://sqlfiddle.com/#!2/ec1152/1/0 demo]
* R: is.null(): [https://stat.ethz.ch/R-manual/R-devel/library/base/html/NULL.html R: The Null Object]
* Excel<ref>[https://errerrors.blogspot.com/2018/09/check-if-cell-value-is-null-in-excel.html 如何判斷 Excel 儲存格的欄位值是 NULL]</ref>
 
Find whether a variable is NOT NULL
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NOT NULL;}}
 
Find whether a variable is NOT {{kbd | key = #N/A}}
* Excel: {{kbd | key = =NOT(ISERROR(cell_value))}}
 
javascript
* [http://www.w3schools.com/jsref/jsref_isnan.asp JavaScript isNaN() Function]
 
=== check if field value was not fulfilled: NULL, empty value ===
{{exclaim}} NOT include those data which its field value  fulfilled with default value automatically ([http://sqlfiddle.com/#!2/9b01e/3/0 demo on sqlfiddle])
 
# {{Gd}} quick solution: find records with {{kbd | key = NULL}} value OR empty, space value
#* MySQL solution: {{kbd | key = SELECT * FROM table_name WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0;}}
# find records with {{kbd | key = NULL}} value: (note: not {{kbd | key = #NULL!}})
#* MySQL solution: {{kbd | key = SELECT * FROM table_name WHERE column_name IS NULL;}}
#* EXCEL: {{kbd | key =<nowiki>=EXACT(A2, "NULL")</nowiki>}}
# find records with empty value: (not contains {{kbd | key = NULL}} value)
#* MySQL: {{kbd | key = SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0;}} {{Exclaim}} SQL query {{kbd | key =SELECT * FROM table_name WHERE column_name IS NOT NULL}} includes empty value
#* MS SQL Server: {{kbd | key = SELECT * FROM table_name WHERE LEN( LTRIM(RTRIM(column_name)) ) = 0;}} <ref>[http://www.1keydata.com/tw/sql/sql-trim.html SQL TRIM 函數 - 1Keydata SQL 語法教學]</ref>
#  Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...
#* solution: step1: Replace the year > 100 from this year with empty value at EXCEL: {{kbd | key =<nowiki>=IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2))</nowiki>}} (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
#* trivial approach : EXCEL: {{kbd | key =<nowiki>=IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2))</nowiki>}}  {{exclaim}} this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
# Using PHP
#* [http://php.net/manual/en/function.empty.php empty()] function to find 0, null, false, empty string, empty array values.
#* <code>if(empty($var) && $var !== 0 && $var !== "0"){ .. } </code> to find null, false, empty string, empty array values BUT not 0.
# check if field value was NULL & not equal to some value
#* MySQL: [http://stackoverflow.com/questions/17097890/mysql-select-on-fields-containing-null-values MySql select on fields containing null values - Stack Overflow]
 
=== check if field value was fulfilled ===
length of string > 0
* MySQL: {{kbd | key = SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;}} [http://sqlfiddle.com/#!9/e0e5f/1/0 demo][http://ideone.com/rt6D2A]
 
column value is not {{kbd | key = null}} or {{kbd | key = 0}}
* Excel: {{kbd | key = COUNTIFS(criteria_range1, "<>NULL", criteria_range1, "<>0")}}<ref>[https://www.ablebits.com/office-addins-blog/2014/07/10/excel-countifs-multiple-criteria/ Excel COUNTIFS and COUNTIF with multiple criteria – examples of usage]</ref>
 
find if number or cell value is positive integer
* EXCEL: {{kbd | key =<nowiki>=IFERROR(IF(AND(INT( value )= value, value>0), TRUE, FALSE), FALSE)</nowiki>}}<ref>[http://www.ozgrid.com/forum/showthread.php?t=24464 Check if number is an Integer]</ref> [https://docs.google.com/spreadsheets/d/1JlY9RsFDO-I9BuFa_Oegnrq5-V5vJwt4jVYUZBlyFLM/edit?usp=sharing online demo]
 
check numeric range
* MySQL: {{kbd | key = SELECT * FROM table_name WHERE column_name BETWEEN ''min_number'' AND ''max_number'';}} the value >= ''min_number'' AND value <= ''max_number''  ( ''min_number'' ≤ value ≤ ''max_number'' )
 
find NOT empty records means records without NULL or empty value:
* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;</nowiki>}}
* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}}
 
== Data Validation ==
Validate the format of field value. Related page: [[Regular expression]]
 
=== Verify the strings are in valid email format ===
Rule: Email contains @ symbol
 
* EXCEL: {{kbd | key =<nowiki>=IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE)</nowiki>}} only check the field if contains @ symbol or not
** result: (1) normal condition: return TRUE; (2) exceptional condition: return '''FALSE''' if @ symbol was not found
* EXCEL: {{kbd | key =<nowiki>=FIND("@", A2, 2)</nowiki>}} only check the field if contains @ symbol or not
** syntax: <nowiki>FIND(find_text, with_text, [start_num])</nowiki> the start_num is 2 because the position of  @ symbol should be larger than 1 (position of first char is 1)
** result: (1) normal condition: return the number larger than 1; (2) exceptional condition: return '''#VALUE!''' if @ symbol was not found
* PHP: [http://www.w3schools.com/php/filter_validate_email.asp PHP FILTER_VALIDATE_EMAIL Filter]
** "Returns the filtered data, or '''FALSE''' if the filter fails." quoted from [http://php.net/manual/en/function.filter-var.php PHP.net]
 
=== Verify the strings are in valid url format ===
Rule: Begin with http or https
 
* Google spreadsheet {{kbd | key =<nowiki>=REGEXMATCH(A1, "^http(s?)")</nowiki>}}
 
=== Number precision in Excel ===
Number precision: 15 digits (Excel中最多的有效位數為15位)<ref>[https://support.office.com/en-nz/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3 Excel specifications and limits]</ref><ref>[http://isvincent.pixnet.net/blog/post/30267208-excel-%E9%A1%AF%E7%A4%BApi%E8%BC%83%E5%A4%9A%E7%9A%84%E6%9C%89%E6%95%88%E4%BD%8D%E6%95%B8 A2]</ref>
 
raw data: 1234567890123456
->
* (numeric format 數值格式) 1234567890123450.00 {{exclaim}} losing precision
* (general format 通用格式) 1.23457E+15 {{exclaim}} losing precision
* (text format 文字格式) 1234567890123456
 
large numbers
* If the data was imported from Excel, you should notice the 15 digit precision issue.
 
=== Verify the column values are numeric ===
 
Possible values
 
<pre>
test data:
3.141592654
1.36184E+14
123,456.789
20740199601
346183773390240
="5"
</pre>
 
==== Verify if value is number in MySQL ====
MySQL:
 
* Check if a value is integer e.g. 1234567
** Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref><ref>[https://stackoverflow.com/questions/75704/how-do-i-check-to-see-if-a-value-is-an-integer-in-mysql How do I check to see if a value is an integer in MySQL? - Stack Overflow]</ref>


* Check if a value is integer which may contains comma and dot symbols e.g. 1,234.567 or 3.414
Finds whether a variable is NOT NULL
** {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9,\.]+$'}}<ref>[https://community.denodo.com/answers/question/details?questionId=9060g000000XelhAAC&title=How+to+identify+if+values+in+a+column+is+numeric+%28+Function+similar+to+Isnumeric+is+SQL%29 How to identify if values in a column is numeric ( Function similar to Isnumeric is SQL)]</ref>
* MySQL SQL syntax: {{kbd | key = SELECT * FROM table WHERE column IS NOT NULL;}} [http://sqlfiddle.com/#!2/ec1152/2/0 demo]


* Check if a value is NOT integer
** Find the records which the value of `my_column` is '''NOT''' numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'}}


== check if field value was not fulfilled: NULL, empty value ==
([http://sqlfiddle.com/#!2/9b01e/3/0 demo on sqlfiddle])
# {{kbd | key = NULL}} value: {{kbd | key = SELECT * FROM table_name WHERE column_name IS NULL;}}
# empty value: {{kbd | key = SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0;}} {{exclaim}} SQL query with the condition {{kbd | key =SELECT * FROM table_name WHERE column_name IS NOT NULL}} includes empty value


If the digit of number is known, the SQL syntax could be more specific
== check if field contains value ==
* The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}}
# MySQL: {{kbd | key = SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;}} [http://sqlfiddle.com/#!2/f491c/1/0 demo]


==== Verify if value is number in PHP ====
== remove first, last or certain characters from text ==
 
* [http://php.net/manual/en/function.is-numeric.php is_numeric] function
* [https://www.php.net/manual/en/function.is-int.php is_int] function
 
==== Verify if value is number in Excel or Google sheet ====
Excel & [https://www.google.com/sheets/about/ Google Sheets]:
* Using [http://www.techonthenet.com/excel/formulas/isnumber.php ISNUMBER Function]: {{code | code = <nowiki>=INT(ISNUMBER(A1))</nowiki>}}
** Return 1 if the cell value is (1) Numbers (2) Numbers in [https://en.wikipedia.org/wiki/Scientific_notation scientific (exponential) notation] e.g. {{code | code = <nowiki>1.36184E+14</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers
** Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}}
 
*  Google Sheets only: Using [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH], [https://support.google.com/docs/answer/3094140?hl=zh-Hant TRIM] & [https://support.google.com/docs/answer/3093592?hl=zh-Hant CONCAT]<ref>[https://errerrors.blogspot.com/2015/08/google.html GOOGLE 試算表: 數字轉成文字]</ref> functions: {{code | code = <nowiki>=IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)</nowiki>}}
** Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}}
** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers
 
=== Time data: Validate the data format ===
[[Validate the datetime value]]
 
=== Time data: Data was generated in N years ===
Define the abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series])
* Verify the data were generated in N years. Possible abnormal values: {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type. e.g.
 
* Verify the data were not newer than today
 
* Verify the year of data were not {{kbd | key=1900}} if the data were imported from Microsoft Excel file. Datevalue<ref>[https://support.microsoft.com/zh-tw/office/datevalue-%E5%87%BD%E6%95%B8-df8b07d4-7761-4a93-bc33-b7471bbff252 DATEVALUE 函數 - Office 支援]</ref> was started from the year {{kbd | key=1900}} e.g.
** {{code | code = 1900/1/0}} (converted time formatted value from 0),
** {{code | code = 1900/1/1}} (converted time formatted value from 1)
 
* Verify the diversity of data values e.g. [https://en.wikipedia.org/wiki/Variance Variance]
 
Find the normal values:
* MySQL: Assume the data was generated in recent 10 years & not newer than today
** {{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` < CURDATE() + 1);}}
*** {{exclaim}} NOT {{code | code =  `my_time_column` < CURDATE() }}。 ex: {{code | code = CURDATE() }} is {{CURRENTYEAR}}-{{CURRENTMONTH}}-{{CURRENTDAY2}}. Which is the same with {{CURRENTYEAR}}-{{CURRENTMONTH}}-{{CURRENTDAY2}} 00:00:00
** {{code | code = SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 )  AND  ( `my_time_column` < CURDATE() + 1); }}
* MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.
** {{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` <= CURRENT_TIMESTAMP);}}
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)
 
Abnormal values
* {{code | code = 1970-01-01 08:00:00}} (converted time formatted value from {{code | code =August 3, 2017}}) caused by the string contains special characters e.g. [https://en.wikipedia.org/wiki/Left-to-right_mark left-to-right mark (LRM) ]
 
Check if the date valid
* Excel: [https://support.office.com/en-us/article/WEEKDAY-function-60e44483-2ed1-439f-8bd0-e404c190949a?ui=en-US&rs=en-US&ad=US WEEKDAY function] <ref>[http://www.mrexcel.com/forum/excel-questions/322203-check-if-valid-date.html Check if a valid date?]</ref>
** legal results: {{kbd | key=1}} (Sunday) ~ {{kbd | key=7}} (Saturday)
** illegal result: {{kbd | key= <nowiki>#VALUE!</nowiki>}}
* PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow]
 
=== Time data: Human birth year (age) data ===
Based on the existing record, the longest-living person who lived to 122<ref>[https://en.wikipedia.org/wiki/Maximum_life_span Maximum life span - Wikipedia]</ref>.
 
MySQL query is as follows<ref>[https://stackoverflow.com/questions/5773405/calculate-age-in-mysql-innodb sql - Calculate Age in MySQL (InnoDb) - Stack Overflow]</ref> where the column {{kbd | key=<nowiki>`birthday`</nowiki>}} is {{kbd | key=<nowiki>date</nowiki>}} type.
<pre>
WHERE TIMESTAMPDIFF(YEAR, `birthday`, CURDATE()) <= 122
</pre>
 
Using [https://www.w3resource.com/mysql/date-and-time-functions/mysql-unix_timestamp-function.php UNIX_TIMESTAMP() function] to check the abnormality of birthday data is not appropriate. Because the birthdays which are earlier {{kbd | key=<nowiki>1970-01-01 00:00:00 UTC</nowiki>}} will all become zero.
 
 
=== String contains special characters ===
* [[Byte order mark]] (BOM)
* [[Return symbol]]
* [http://www.fileformat.info/info/unicode/char/a0/index.htm Unicode Character 'NO-BREAK SPACE' (U+00A0)]
* [https://www.fileformat.info/info/unicode/char/200f/index.htm Unicode Character 'RIGHT-TO-LEFT MARK' (U+200F)]
* [https://www.fileformat.info/info/unicode/char/200f/index.htm Unicode Character 'RIGHT-TO-LEFT MARK' (U+200F)]<ref>[https://stackoverflow.com/questions/1930009/how-to-strip-unicode-chars-left-to-right-mark-from-a-string-in-php regex - How to strip unicode chars (LEFT_TO_RIGHT_MARK) from a string in php - Stack Overflow]</ref>
 
== File Validation ==
 
=== Verify the file format of downloaded file ===
* PDF file format: [https://stackoverflow.com/questions/16152583/tell-if-a-file-is-pdf-in-bash Tell if a file is PDF in bash - Stack Overflow]
 
== Find and remove duplicates ==
[[Find and remove duplicates]] in Excel/BASH/MySQL/PHP
 
== Counting ==
 
* [[Count occurrences of a word in string]]
* Count number of unique values
** Excel: [https://www.excel-easy.com/examples/count-unique-values.html Count Unique Values in Excel - Easy Excel Tutorial]
** Google sheet: [https://support.google.com/docs/answer/3093405?hl=zh-Hant COUNTUNIQUE - 文件編輯器說明] & [https://infoinspired.com/google-docs/spreadsheet/unique-function-in-horizontal-data-range-in-google-sheets/ How to Use UNIQUE Function in Horizontal Data Range in Google Sheets]
 
== Outlier / Anomaly detection ==
[[Anomaly detection]]
 
== unique number of data values ==
If the data values were generated by different users, the unique number of data values should be larger than ____
 
== Data handling ==
=== Remove first, last or certain characters from text ===
* Excel: using {{kbd | key=RIGHT}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342854.aspx RIGHT、RIGHTB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342650.aspx LEN、LENB 函數 - Excel - Office.com]</ref> functions <ref>[http://www.extendoffice.com/documents/excel/560-excel-remove-character-from-string.html How to remove first, last or certain characters from text in Excel?]</ref>
* Excel: using {{kbd | key=RIGHT}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342854.aspx RIGHT、RIGHTB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342650.aspx LEN、LENB 函數 - Excel - Office.com]</ref> functions <ref>[http://www.extendoffice.com/documents/excel/560-excel-remove-character-from-string.html How to remove first, last or certain characters from text in Excel?]</ref>
* Excel: if the length of text was fixed after removed, you may try to use {{kbd | key=REPLACE}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342844.aspx REPLACE、REPLACEB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}} functions ([http://bit.ly/1i2tsgk demo])
* Excel: if the text length will be removed was fixed, you may try to use {{kbd | key=REPLACE}}<ref>[http://office.microsoft.com/zh-tw/excel-help/HP010342844.aspx REPLACE、REPLACEB 函數 - Excel - Office.com]</ref> + {{kbd | key=LEN}} functions ([http://bit.ly/1i2tsgk demo])
 
=== Remove leading and trailing spaces from text ===
* PHP: [http://php.net/manual/en/function.trim.php trim function]
* Excel: [https://exceljet.net/formula/remove-leading-and-trailing-spaces-from-text Excel formula: Remove leading and trailing spaces from text | Exceljet]
* MySQL: Using [http://www.w3resource.com/mysql/string-functions/mysql-trim-function.php TRIM() function] & [http://www.w3resource.com/mysql/string-functions/mysql-length-function.php LENGTH() function ]
<pre>
UPDATE `table`
SET `column` = TRIM( `column` )
WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );
</pre>
 
=== Remove other string look like whitespace ===
[https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
# [http://www.fileformat.info/info/unicode/char/3000/index.htm IDEOGRAPHIC SPACE] (全形空白、全型空白, U+3000)<ref>[https://www.ptt.cc/bbs/PHP/M.1473522044.A.4FC.html Re: 請益 mysql空百=? - 看板 PHP - 批踢踢實業坊]</ref>:
#* diaplay: <pre><nowiki><?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?></nowiki></pre>
#* replace with space: <pre><nowiki><?php echo str_replace(json_decode('"\u3000"'), " ", $string);?></nowiki></pre>
# ASCII Vertical Tab {{kbd | key=<nowiki>\v</nowiki>}}
# ASCII Horizontal Tab (TAB) {{kbd | key=<nowiki>\t</nowiki>}}
# ASCII Backspace {{kbd | key=<nowiki>\b</nowiki>}}
# [[Remove non breaking space]]
 
== Remove control character ==
[https://en.wikipedia.org/wiki/Control_character Control character - Wikipedia] Using PHP to clean control character:
<pre>
$input = 'some string may contains control characters';
$replacement = '';
$result = preg_replace('/[\x00-\x1F]/', $replacement, $input);
</pre>
 
== Remove tracking parameter from link ==
[[Remove tracking parameter from link]]
 
=== Fix garbled message text ===
[[Fix garbled message text]]
 
== Tools ==
* {{Mac}} [https://github.com/IvanMathy/Boop IvanMathy/Boop: A scriptable scratchpad for developers. In slow yet steady progress.] ([https://apps.apple.com/us/app/boop/id1518425043?mt=12 ‎Boop on the Mac App Store]) " ... to paste some plain text and run some basic text operations on it. "
* [https://gchq.github.io/CyberChef/ CyberChef] (source code available on [https://github.com/gchq/CyberChef github])  The Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis
 
== Further reading ==
* [https://g0v.gitbooks.io/data-design/content/book/ch08-data-cleaning.html Data Cleaning | DATA + DESIGN 資料+設計]
* Data modeling: [[Data type]]
* [http://www.icpsr.umich.edu/icpsrweb/NAHDAP/support/faqs/2006/01/what-is-codebook What is a codebook?]
* [https://www.facebook.com/groups/1210634969026548/permalink/1540468379376537/ 台灣R軟體Club] 「由於真實社會的資料千奇百怪,遺失值、空字串、空格、甚至是假的 "NA" (NA兩邊多了雙引號)、假的"NULL" (NULL兩邊多了雙引號) 。。。,其中一個顯著的差異是,社會實務應用的 R 程式中,通常會多出一堆檢查機制。 ... ...」{{access | date = 2017-12-18}}
* [https://docs.google.com/spreadsheets/d/1UyumfruGjps7xqd4aSB-NKfmu54NIscI2OV9Uht8F7M/edit?usp=sharing  判斷 Excel 欄位值是否錯誤] 除了常見的 Excel 錯誤,如何判斷來自其他資料來源的可能錯誤欄位值
* [[Web user behavior]]
* [https://medium.com/bryanyang0528/%E8%B3%87%E6%96%99%E5%93%81%E8%B3%AA%E5%88%9D%E6%8E%A2-data-quality-b765eb56a7c2?fbclid=IwAR3NBb2BtFm9O3FeY7JgQ5HLE5VG5nFe3m5Zx8zNW9XkvOUPlqV9hXmaoXI 資料品質初探(Data Quality) – 亂點技能的跨界人生 – Medium] {{access | date = 2018-01-13}}


== References ==
== references ==
<references/>
<references/>


[[Category:Spreadsheet]] [[Category:Excel]]
[[Category:Spreadsheet]]
[[Category:Data_hygiene]]
[[Category:Data_hygiene]]
[[Category:Data Science]]
[[Category:MySQL]]
[[Category:String manipulation]]
Please note that all contributions to LemonWiki共筆 are considered to be released under the Creative Commons Attribution-NonCommercial-ShareAlike (see LemonWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!
Cancel Editing help (opens in new window)