Data cleaning: Difference between revisions

Jump to navigation Jump to search
607 bytes added ,  27 November 2017
Line 286: Line 286:


=== Time data ===
=== Time data ===
==== Condition: Data was generated in 10 years ====
Definition of abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series]) if they
Definition of abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series]) if they
* were generated 10 years before or
* were generated 10 years before or
Line 294: Line 296:
* {{code | code = 1900/1/0}} (converted time formatted value from 0), {{code | code = 1900/1/1}} (converted time formatted value from 1), {{code | code = 1900/1/2}} ... occurred in MS Excel
* {{code | code = 1900/1/0}} (converted time formatted value from 0), {{code | code = 1900/1/1}} (converted time formatted value from 1), {{code | code = 1900/1/2}} ... occurred in MS Excel
* future data: the date after today
* future data: the date after today


Find the normal values:  
Find the normal values:  
Line 302: Line 303:
** {{code | code = SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 )  AND  ( `my_time_column` < CURDATE() + 1); }}
** {{code | code = SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 )  AND  ( `my_time_column` < CURDATE() + 1); }}
* MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.  
* MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.  
** :{{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` <= CURRENT_TIMESTAMP);}}  
** {{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` <= CURRENT_TIMESTAMP);}}  
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)


Line 310: Line 311:
** illegal result: {{kbd | key= <nowiki>#VALUE!</nowiki>}}
** illegal result: {{kbd | key= <nowiki>#VALUE!</nowiki>}}
* PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow]
* PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow]
==== Condition: Human birth year (age) data ====
Based on the existing record, the longest-living person who lived to 122<ref>[https://en.wikipedia.org/wiki/Maximum_life_span Maximum life span - Wikipedia]</ref>.
MySQL query is as follows<ref>[https://stackoverflow.com/questions/5773405/calculate-age-in-mysql-innodb sql - Calculate Age in MySQL (InnoDb) - Stack Overflow]</ref> where the column {{kbd | key=<nowiki>`birthday`</nowiki>}} is {{kbd | key=<nowiki>date</nowiki>}} type.
<pre>
WHERE TIMESTAMPDIFF(YEAR, `birthday`, CURDATE()) <= 122
</pre>


== Duplicate data ==
== Duplicate data ==

Navigation menu