Data cleaning: Difference between revisions

Jump to navigation Jump to search
596 bytes added ,  6 September 2016
Line 232: Line 232:
* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}}
* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}}


== verify the format of field value ==
== Verify the format of field value ==
related page: [[Regular expression]]
related page: [[Regular expression]]


=== email contains @ symbol ===
=== Email contains @ symbol ===
* EXCEL: {{kbd | key =<nowiki>=IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE)</nowiki>}} only check the field if contains @ symbol or not
* EXCEL: {{kbd | key =<nowiki>=IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE)</nowiki>}} only check the field if contains @ symbol or not
** result: (1) normal condition: return TRUE; (2) exceptional condition: return '''FALSE''' if @ symbol was not found  
** result: (1) normal condition: return TRUE; (2) exceptional condition: return '''FALSE''' if @ symbol was not found  
Line 244: Line 244:
** "Returns the filtered data, or '''FALSE''' if the filter fails." quoted from [http://php.net/manual/en/function.filter-var.php PHP.net]
** "Returns the filtered data, or '''FALSE''' if the filter fails." quoted from [http://php.net/manual/en/function.filter-var.php PHP.net]


=== number precision in Excel ===
=== Number precision in Excel ===
Number precision: 15 digits (Excel中最多的有效位數為15位)<ref>[https://support.office.com/en-nz/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3 Excel specifications and limits]</ref><ref>[http://isvincent.pixnet.net/blog/post/30267208-excel-%E9%A1%AF%E7%A4%BApi%E8%BC%83%E5%A4%9A%E7%9A%84%E6%9C%89%E6%95%88%E4%BD%8D%E6%95%B8 A2]</ref>
Number precision: 15 digits (Excel中最多的有效位數為15位)<ref>[https://support.office.com/en-nz/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3 Excel specifications and limits]</ref><ref>[http://isvincent.pixnet.net/blog/post/30267208-excel-%E9%A1%AF%E7%A4%BApi%E8%BC%83%E5%A4%9A%E7%9A%84%E6%9C%89%E6%95%88%E4%BD%8D%E6%95%B8 A2]</ref>


Line 256: Line 256:
* If the data was imported from Excel, you should notice the 15 digit precision issue.
* If the data was imported from Excel, you should notice the 15 digit precision issue.


=== numeric only ===
=== Numeric only ===
* PHP: [http://php.net/manual/en/function.is-numeric.php is_numeric]
* PHP: [http://php.net/manual/en/function.is-numeric.php is_numeric]
* MySQL:  
* MySQL:  
Line 263: Line 263:
* Excel: [http://www.techonthenet.com/excel/formulas/isnumber.php ISNUMBER Function]
* Excel: [http://www.techonthenet.com/excel/formulas/isnumber.php ISNUMBER Function]


=== abnormal values of time data ===
=== Abnormal values of time data ===
Definition of abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series]) if they
Definition of abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series]) if they
* were generated 10 years before or
* were generated 10 years before or
Line 281: Line 281:
** :{{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` <= CURRENT_TIMESTAMP);}}  
** :{{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` <= CURRENT_TIMESTAMP);}}  
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)
=== Remove leading and trailing spaces from text ===
* PHP: [http://php.net/manual/en/function.trim.php trim function]
* Excel: [https://exceljet.net/formula/remove-leading-and-trailing-spaces-from-text Excel formula: Remove leading and trailing spaces from text | Exceljet]
* MySQL: Using [http://www.w3resource.com/mysql/string-functions/mysql-trim-function.php TRIM() function] & [http://www.w3resource.com/mysql/string-functions/mysql-length-function.php LENGTH() function ]
<pre>
UPDATE `table`
SET `column` = TRIM( `column` )
WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );
</pre>


== duplicate data ==
== duplicate data ==

Navigation menu