Data cleaning
check if field value was not fulfilled
quick table
data type | possible values | method1: find not fulfilled or empty values |
method2: find fulfilled and non-empty values |
method3: find NULL values |
method4: find not NULL values |
VARCHAR and allows NULL | fulfilled value ex:123 | V | V | ||
NULL | V | V | |||
0 | V | V | |||
EMPTY ex: '' or space ' ' | V | V |
symbol V: means the column value will be able to find by means of the method
- method1:
- method2:
- SELECT * FROM `my_table` WHERE column_name > ''
- SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0
- SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0
- method3: SELECT * FROM `my_table` WHERE column_name IS NULL
- method4: SELECT * FROM `my_table` WHERE column_name IS NOT NULL
data type | possible values | method5: find values within the range |
method6: find values out of the range, empty & NULL values |
VARCHAR or numeric | values within the range ex: min ≤ value ≤ max | V | |
values out of range | V | ||
NULL | V | ||
EMPTY ex: '' or space ' ' | V |
- method5: SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max
- method6: SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )
is null
Fill 0 if the value is NA or NULL
- MySQL SQL syntax: SELECT COALESCE(column_name, 0)
- (1) Using COALESCE() function to replace the NULL value with 0.
- (2) The case: 0/0 = null should be handled.
- python: pandas.DataFrame.fillna — pandas 0.16.0 documentation "Fill NA/NaN values using the specified method"
Find whether a variable is NULL. online demo
- PHP is_null
- Google spreadsheet / Excel:
- ISERR(value) " value - The value to be verified as an error type other than #N/A." ex: #NULL!
- If the cell value is exactly NULL not #NULL!, You may use COUNTIF(value, "NULL") or EXACT(value, "NULL")
- MySQL SQL syntax: SELECT * FROM table WHERE column IS NULL;[3]
Find whether a variable is NOT NULL
- MySQL SQL syntax: SELECT * FROM table WHERE column IS NOT NULL;
check if field value was not fulfilled: NULL, empty value
NOT include those data which its field value fulfilled with default value automatically
- find records with NULL value: (note: not #NULL!)
- MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL;
- EXCEL: =EXACT(A2, "NULL")
- find records with empty value: (not contains NULL value)
- MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0; SQL query SELECT * FROM table_name WHERE column_name IS NOT NULL includes empty value
- MS SQL Server: SELECT * FROM table_name WHERE LEN( LTRIM(RTRIM(column_name)) ) = 0; [4]
- Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...
- solution: step1: Replace the year > 100 from this year with empty value at EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2)) (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
- trivial approach : EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2)) this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
- Using PHP empty() function to find 0, null, false, empty string, empty array values.
check if field value was fulfilled
length of string > 0
column value is not null or 0
- Excel: COUNTIFS(criteria_range1, "<>NULL", criteria_range1, "<>0")[5]
find if number or cell value is positive integer
- EXCEL: =IFERROR(IF(AND(INT( value )= value, value>0), TRUE, FALSE), FALSE)[6] online demo
check numeric range
- MySQL: SELECT * FROM table_name WHERE column_name BETWEEN min_number AND max_number; the value >= min_number AND value <= max_number ( min_number ≤ value ≤ max_number )
find NOT empty records means records without NULL or empty value:
- MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;
- MySQL: SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;
verify the format of field value
related page: Regular expression
email contains @ symbol
- EXCEL: =IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE) only check the field if contains @ symbol or not
- result: (1) normal condition: return TRUE; (2) exceptional condition: return FALSE if @ symbol was not found
- EXCEL: =FIND("@", A2, 2) only check the field if contains @ symbol or not
- syntax: FIND(find_text, with_text, [start_num]) the start_num is 2 because the position of @ symbol should be larger than 1 (position of first char is 1)
- result: (1) normal condition: return the number larger than 1; (2) exceptional condition: return #VALUE! if @ symbol was not found
- PHP: PHP FILTER_VALIDATE_EMAIL Filter
- "Returns the filtered data, or FALSE if the filter fails." quoted from PHP.net
number precision of Excel
Number precision: 15 digits (Excel中最多的有效位數為15位)[7][8]
raw data: 1234567890123456 ->
- (numeric format 數值格式) 1234567890123450.00 losing precision
- (general format 通用格式) 1.23457E+15 losing precision
- (text format 文字格式) 1234567890123456
numeric only
- PHP: is_numeric
- MySQL:
SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'
[9] - Excel: ISNUMBER Function
abnormal values of time data
Definition of abnormal values of the time data (time series) if they
- were generated 10 years before or
- newer than today
List of the possible abnormal values:
0001-01 00:00:00
occurred in MySQLdatetime
type1900/1/0
(converted time formatted value from 0),1900/1/1
(converted time formatted value from 1),1900/1/2
... occurred in MS Excel- future data: the date after today
Find the normal values:
- MySQL: Assume the data was generated in recent 10 years & not newer than today
SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` < CURDATE() + 1);
- NOT
`my_time_column` < CURDATE()
。 ex:CURDATE()
is 2024-03-29. Which is the same with 2024-03-29 00:00:00
- NOT
SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 ) AND ( `my_time_column` < CURDATE() + 1);
- MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.
- :
SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` <= CURRENT_TIMESTAMP);
- You need to check the
SELECT CURRENT_TIMESTAMP);
if correct or not before you delete the abnormal data (timezone issue)
- You need to check the
- :
duplicate data
- EXCEL:
- one column data: How to count duplicate values in a column in Excel?
- two columns data: How to compare data in two columns to find duplicates in Excel [Last visited: 2016-06-16]
- PHP: PHP: array_unique, PHP: array_intersect
- MySQL:
- MySQL DISTINCT - Eliminate Duplicate Rows in a Result Set. Using GROUP_CONCAT to handle the multiple columns[10]
- SQL UNIQUE Constraint "Note that you can have many UNIQUE constraints per table, but only one PRIMARY KEY constraint per table." Quoted from w3schools webpage.
outlier
(left blank intentionally)
data handling
remove first, last or certain characters from text
- Excel: using RIGHT[11] + LEN[12] functions [13]
- Excel: if the length of text was fixed after removed, you may try to use REPLACE[14] + LEN functions (demo)
Data modeling: Data type
references
- ↑ MySQL COALESCE() function - w3resource
- ↑ How to check if field is null or empty mysql? - Stack Overflow
- ↑ MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values
- ↑ SQL TRIM 函數 - 1Keydata SQL 語法教學
- ↑ Excel COUNTIFS and COUNTIF with multiple criteria – examples of usage
- ↑ Check if number is an Integer
- ↑ Excel specifications and limits
- ↑ A2
- ↑ regex - Mysql REGEXP with . and numbers only - Stack Overflow
- ↑ sql - MySQL SELECT DISTINCT multiple columns - Stack Overflow
- ↑ RIGHT、RIGHTB 函數 - Excel - Office.com
- ↑ LEN、LENB 函數 - Excel - Office.com
- ↑ How to remove first, last or certain characters from text in Excel?
- ↑ REPLACE、REPLACEB 函數 - Excel - Office.com