Data cleaning: Difference between revisions
Jump to navigation
Jump to search
m (→quick table) |
|||
Line 4: | Line 4: | ||
<tr style="background-color: #555555; color: #ffffff;"> | <tr style="background-color: #555555; color: #ffffff;"> | ||
<td>data type</td> | <td>data type</td> | ||
<td> | <td>possible values</td> | ||
<td>method1: <br />find not fulfilled or empty values</td> | <td>method1: <br />find not fulfilled or empty values</td> | ||
<td>method2: <br />find fulfilled and non-empty values</td> | <td>method2: <br />find fulfilled and non-empty values</td> |
Revision as of 12:15, 12 March 2015
quick table
data type | possible values | method1: find not fulfilled or empty values |
method2: find fulfilled and non-empty values |
method3: find NULL values |
method4: find not NULL values |
VARCHAR and allows NULL | fulfilled value ex:123 | V | V | ||
NULL | V | V | |||
0 | V | V | |||
EMPTY ex: '' or ' ' | V | V |
symbol V: means the column value will be found with the method
- method1:
- method2:
- SELECT * FROM `my_table` WHERE column_name > ''
- SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0
- SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0
- method3: SELECT * FROM `my_table` WHERE column_name IS NULL
- method4: SELECT * FROM `my_table` WHERE column_name IS NOT NULL
data type | possible values | method5: find values within the range |
method6: find values out of the range, empty, NULL values |
VARCHAR or numeric | values within the range ex: min ≤ value ≤ max | V | |
values out of range | V | ||
NULL | V | ||
EMPTY ex: '' or ' ' | V |
- method5: SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max
- method6: SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )
is null
Finds whether a variable is NULL. online demo
- PHP is_null
- Google spreadsheet / Excel:
- ISERR(value) " value - The value to be verified as an error type other than #N/A." ex: #NULL!
- If the cell value is exactly NULL not #NULL!, You may use COUNTIF(value, "NULL") or EXACT(value, "NULL")
- MySQL SQL syntax: SELECT * FROM table WHERE column IS NULL;[3]
Finds whether a variable is NOT NULL
- MySQL SQL syntax: SELECT * FROM table WHERE column IS NOT NULL;
check if field value was not fulfilled: NULL, empty value
NOT include those data which its field value fulfilled with default value automatically
- find records with NULL value: (note: not #NULL!)
- MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL;
- EXCEL: =EXACT(A2, "NULL")
- find records with empty value: (not contains NULL value)
- MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0; SQL query SELECT * FROM table_name WHERE column_name IS NOT NULL includes empty value
- Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...
- solution: step1: Replace the year > 100 from this year with empty value at EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2)) (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
- trivial approach : EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2)) this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
- Using PHP empty() function to find 0, null, false, empty string, empty array values.
check if field value was fulfilled
length of string > 0
column value is not null or 0
- Excel: COUNTIFS(criteria_range1, "<>NULL", criteria_range1, "<>0")[4]
find if number or cell value is positive integer
- EXCEL: =IFERROR(IF(AND(INT( value )= value, value>0), TRUE, FALSE), FALSE)[5] online demo
check numeric range
- MySQL: SELECT * FROM table_name WHERE column_name BETWEEN min_number AND max_number; the value >= min_number AND value <= max_number ( min_number ≤ value ≤ max_number )
find NOT empty records means records without NULL or empty value:
- MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;
- MySQL: SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;
verify the format of field value
email:
- EXCEL: =IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE) only check the field if contains @ symbol or not
- result: (1) normal condition: return TRUE; (2) exceptional condition: return FALSE if @ symbol was not found
- EXCEL: =FIND("@", A2, 2) only check the field if contains @ symbol or not
- syntax: FIND(find_text, with_text, [start_num]) the start_num is 2 because the position of @ symbol should be larger than 1 (position of first char is 1)
- result: (1) normal condition: return the number larger than 1; (2) exceptional condition: return #VALUE! if @ symbol was not found
- PHP: PHP FILTER_VALIDATE_EMAIL Filter
- "Returns the filtered data, or FALSE if the filter fails." quoted from PHP.net
duplicate data
- EXCEL: How to count duplicate values in a column in Excel?
- PHP: PHP: array_unique, PHP: array_intersect
outlier
(left blank intentionally)
data handling
remove first, last or certain characters from text
- Excel: using RIGHT[6] + LEN[7] functions [8]
- Excel: if the text length will be removed was fixed, you may try to use REPLACE[9] + LEN functions (demo)
Data modeling: Data type
references
- ↑ MySQL COALESCE() function - w3resource
- ↑ How to check if field is null or empty mysql? - Stack Overflow
- ↑ MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values
- ↑ Excel COUNTIFS and COUNTIF with multiple criteria – examples of usage
- ↑ Check if number is an Integer
- ↑ RIGHT、RIGHTB 函數 - Excel - Office.com
- ↑ LEN、LENB 函數 - Excel - Office.com
- ↑ How to remove first, last or certain characters from text in Excel?
- ↑ REPLACE、REPLACEB 函數 - Excel - Office.com