Data cleaning: Difference between revisions

Jump to navigation Jump to search
173 bytes added ,  8 April 2024
Line 346: Line 346:
=== Time data: Data was generated in N years ===
=== Time data: Data was generated in N years ===
Define the abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series])
Define the abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series])
* Verfiy the data were generated in N years. Possible abnormal values: {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type. e.g.
* Verify the data were generated in N years. Possible abnormal values: {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type. e.g.


* Verfiy the data were not newer than today
* Verify the data were not newer than today


* Verfiy the year of data were not {{kbd | key=1900}} if the data were imported from Microsoft Excel file. Datevalue<ref>[https://support.microsoft.com/zh-tw/office/datevalue-%E5%87%BD%E6%95%B8-df8b07d4-7761-4a93-bc33-b7471bbff252 DATEVALUE 函數 - Office 支援]</ref> was started from the year {{kbd | key=1900}} e.g.  
* Verify the year of data were not {{kbd | key=1900}} if the data were imported from Microsoft Excel file. Datevalue<ref>[https://support.microsoft.com/zh-tw/office/datevalue-%E5%87%BD%E6%95%B8-df8b07d4-7761-4a93-bc33-b7471bbff252 DATEVALUE 函數 - Office 支援]</ref> was started from the year {{kbd | key=1900}} e.g.  
** {{code | code = 1900/1/0}} (converted time formatted value from 0),  
** {{code | code = 1900/1/0}} (converted time formatted value from 0),  
** {{code | code = 1900/1/1}} (converted time formatted value from 1)
** {{code | code = 1900/1/1}} (converted time formatted value from 1)


* Verfiy the diversity of data values e.g. [https://en.wikipedia.org/wiki/Variance Variance]
* Verify the diversity of data values e.g. [https://en.wikipedia.org/wiki/Variance Variance]


Find the normal values:  
Find the normal values:  
Line 364: Line 364:
** {{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` <= CURRENT_TIMESTAMP);}}  
** {{code | code = SELECT * FROM `my_table` WHERE ( `my_time_column` >=  CURDATE() - INTERVAL 10 YEAR )  AND  ( `my_time_column` <= CURRENT_TIMESTAMP);}}  
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)
*** You need to check the {{code | code = SELECT CURRENT_TIMESTAMP);}} if correct or not before you delete the abnormal data (timezone issue)
Abnormal values
* {{code | code = 1970-01-01 08:00:00}} (converted time formatted value from {{code | code =August 3, 2017}}) caused by the string contains special symbols


Check if the date valid
Check if the date valid

Navigation menu