Data cleaning: Difference between revisions
Line 266: | Line 266: | ||
Excel & [https://www.google.com/sheets/about/ Google Sheets]: | Excel & [https://www.google.com/sheets/about/ Google Sheets]: | ||
* Using [http://www.techonthenet.com/excel/formulas/isnumber.php ISNUMBER Function]: {{code | code = <nowiki>=INT(ISNUMBER(A1))</nowiki>}} | * Using [http://www.techonthenet.com/excel/formulas/isnumber.php ISNUMBER Function]: {{code | code = <nowiki>=INT(ISNUMBER(A1))</nowiki>}} | ||
** Return 1 if the cell value is (1) Numbers (2) Numbers in [https://en.wikipedia.org/wiki/Scientific_notation scientific (exponential) notation] e.g. {{code | code = <nowiki>1.36184E+14</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} | ** Return 1 if the cell value is (1) Numbers (2) Numbers in [https://en.wikipedia.org/wiki/Scientific_notation scientific (exponential) notation] e.g. {{code | code = <nowiki>1.36184E+14</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers | ||
** Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}} | ** Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}} | ||
* Google Sheets only: Using [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH], [https://support.google.com/docs/answer/3094140?hl=zh-Hant TRIM] & [https://support.google.com/docs/answer/3093592?hl=zh-Hant CONCAT]<ref>[https://errerrors.blogspot.tw/2015/08/google.html GOOGLE 試算表: 數字轉成文字]</ref> functions: {{code | code = <nowiki>=IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)</nowiki>}} | * Google Sheets only: Using [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH], [https://support.google.com/docs/answer/3094140?hl=zh-Hant TRIM] & [https://support.google.com/docs/answer/3093592?hl=zh-Hant CONCAT]<ref>[https://errerrors.blogspot.tw/2015/08/google.html GOOGLE 試算表: 數字轉成文字]</ref> functions: {{code | code = <nowiki>=IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)</nowiki>}} | ||
** Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}} | ** Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g. {{code | code = <nowiki>="5"</nowiki>}} | ||
** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} | ** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers | ||
<pre> | <pre> |
Revision as of 17:14, 7 July 2017
check if field value was not fulfilled
by purpose
purpose | method (MySQL query syntax) | value1: fulfilled value what I want |
value2: fulfilled value NOT I want |
value3: 0 |
value4: NULL value |
value5: empty or spaces value |
find not fulfilled or empty values (not contains 0) |
WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0 | V | V | |||
find not fulfilled or empty values (contains 0) |
V | V | V | |||
find fulfilled and non-empty values (not contains 0) |
V | V | ||||
find fulfilled and non-empty values (contains 0) |
WHERE LENGTH(TRIM( column_name )) > 0 | V | V | V | ||
find (1) not fulfilled or empty values (2) values NOT I want (not contains 0) |
WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0 OR column_name LIKE 'values NOT I want' | V | V | V | ||
find (1) not fulfilled or empty values (2) values NOT I want (contains 0) |
V | V | V | V |
by datatype
VARCHAR and allows NULL value
data type of column | possible column values | method1: find not fulfilled or empty values |
method2: find fulfilled and non-empty values |
method3: find NULL values |
method4: find not NULL values |
VARCHAR and allows NULL | fulfilled value ex:123 | V | V | ||
NULL type:null | V | V | |||
'NULL' type:string | |||||
0 | V | V | |||
EMPTY ex: '' or space(s) ' ' | V | V |
symbol V: means the column value will be able to find by means of the method
- method1:
- method2:
- SELECT * FROM `my_table` WHERE column_name > ''
- SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0
- SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0
- method3: SELECT * FROM `my_table` WHERE column_name IS NULL
- method4: SELECT * FROM `my_table` WHERE column_name IS NOT NULL
VARCHAR or numeric
data type of column | possible column values | method5: find values within the range |
method6: find values out of the range, empty & NULL values |
VARCHAR or numeric | values within the range ex: min ≤ value ≤ max | V | |
values out of range | V | ||
NULL | V | ||
EMPTY ex: '' or space ' ' | V |
- method5: SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max
- method6: SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )
datetime and allows NULL value
possible column values
- 2024-04-26
- 00-00-00 00:00:00
- NULL
is null
Fill 0 if the value is NA or NULL
- MySQL SQL syntax: COALESCE(): SELECT COALESCE(column_name, 0) or SELECT COALESCE(column_name, 'other_filled_value')
- (1) Using COALESCE() function to replace the NULL value with 0.
- (2) The case: 0/0 = null should be handled.
- MySQL SQL syntax: combined IF() & ISNULL():SELECT IF(ISNULL((column_name), 0, column_name) or SELECT IF(ISNULL((column_name), 'other_filled_value', column_name)
- python: pandas.DataFrame.fillna — pandas 0.16.0 documentation "Fill NA/NaN values using the specified method"
Find whether a variable is NULL. online demo
- PHP is_null to find type:null null NOT type:string 'null'
- Google spreadsheet / Excel:
- ISERR(value) " value - The value to be verified as an error type other than #N/A." ex: #NULL!
- If the cell value is exactly NULL not #NULL!, You may use COUNTIF(value, "NULL") or EXACT(value, "NULL")
- MySQL SQL syntax: SELECT * FROM table WHERE column IS NULL;[3]
Find whether a variable is NOT NULL
- MySQL SQL syntax: SELECT * FROM table WHERE column IS NOT NULL;
Find whether a variable is NOT #N/A
- Excel: =NOT(ISERROR(cell_value))
javascript
check if field value was not fulfilled: NULL, empty value
NOT include those data which its field value fulfilled with default value automatically (demo on sqlfiddle)
- quick solution: find records with NULL value OR empty, space value
- MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0;
- find records with NULL value: (note: not #NULL!)
- MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL;
- EXCEL: =EXACT(A2, "NULL")
- find records with empty value: (not contains NULL value)
- MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0; SQL query SELECT * FROM table_name WHERE column_name IS NOT NULL includes empty value
- MS SQL Server: SELECT * FROM table_name WHERE LEN( LTRIM(RTRIM(column_name)) ) = 0; [4]
- Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...
- solution: step1: Replace the year > 100 from this year with empty value at EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2)) (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
- trivial approach : EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2)) this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
- Using PHP
- empty() function to find 0, null, false, empty string, empty array values.
if(empty($var) && $var !== 0 && $var !== "0"){ .. }
to find null, false, empty string, empty array values BUT not 0.
- check if field value was NULL & not equal to some value
check if field value was fulfilled
length of string > 0
column value is not null or 0
- Excel: COUNTIFS(criteria_range1, "<>NULL", criteria_range1, "<>0")[5]
find if number or cell value is positive integer
- EXCEL: =IFERROR(IF(AND(INT( value )= value, value>0), TRUE, FALSE), FALSE)[6] online demo
check numeric range
- MySQL: SELECT * FROM table_name WHERE column_name BETWEEN min_number AND max_number; the value >= min_number AND value <= max_number ( min_number ≤ value ≤ max_number )
find NOT empty records means records without NULL or empty value:
- MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;
- MySQL: SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;
Verify the format of field value
related page: Regular expression
Email contains @ symbol
- EXCEL: =IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE) only check the field if contains @ symbol or not
- result: (1) normal condition: return TRUE; (2) exceptional condition: return FALSE if @ symbol was not found
- EXCEL: =FIND("@", A2, 2) only check the field if contains @ symbol or not
- syntax: FIND(find_text, with_text, [start_num]) the start_num is 2 because the position of @ symbol should be larger than 1 (position of first char is 1)
- result: (1) normal condition: return the number larger than 1; (2) exceptional condition: return #VALUE! if @ symbol was not found
- PHP: PHP FILTER_VALIDATE_EMAIL Filter
- "Returns the filtered data, or FALSE if the filter fails." quoted from PHP.net
Number precision in Excel
Number precision: 15 digits (Excel中最多的有效位數為15位)[7][8]
raw data: 1234567890123456 ->
- (numeric format 數值格式) 1234567890123450.00 losing precision
- (general format 通用格式) 1.23457E+15 losing precision
- (text format 文字格式) 1234567890123456
large numbers
- If the data was imported from Excel, you should notice the 15 digit precision issue.
Numeric only
PHP:
MySQL:
- find numeric values
SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'
[9] - find NOT numeric values
SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'
Excel & Google Sheets:
- Using ISNUMBER Function:
=INT(ISNUMBER(A1))
- Return 1 if the cell value is (1) Numbers (2) Numbers in scientific (exponential) notation e.g.
1.36184E+14
(3) Decimal numbers e.g.3.141592654
(4) Negative numbers - Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g.
="5"
- Return 1 if the cell value is (1) Numbers (2) Numbers in scientific (exponential) notation e.g.
- Google Sheets only: Using REGEXMATCH, TRIM & CONCAT[10] functions:
=IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)
- Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g.
="5"
- Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g.
1.23E+16
(3) Decimal numbers e.g.3.141592654
(4) Negative numbers
- Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g.
test data: 3.141592654 1.36184E+14 20740199601 346183773390240 ="5"
Abnormal values of time data
Definition of abnormal values of the time data (time series) if they
- were generated 10 years before or
- newer than today
List of the possible abnormal values:
0001-01 00:00:00
occurred in MySQLdatetime
type1900/1/0
(converted time formatted value from 0),1900/1/1
(converted time formatted value from 1),1900/1/2
... occurred in MS Excel- future data: the date after today
Find the normal values:
- MySQL: Assume the data was generated in recent 10 years & not newer than today
SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` < CURDATE() + 1);
- NOT
`my_time_column` < CURDATE()
。 ex:CURDATE()
is 2024-04-26. Which is the same with 2024-04-26 00:00:00
- NOT
SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 ) AND ( `my_time_column` < CURDATE() + 1);
- MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.
- :
SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` <= CURRENT_TIMESTAMP);
- You need to check the
SELECT CURRENT_TIMESTAMP);
if correct or not before you delete the abnormal data (timezone issue)
- You need to check the
- :
Check if the date valid
- Excel: WEEKDAY function [11]
- legal results: 1 (Sunday) ~ 7 (Saturday)
- illegal result: #VALUE!
- PHP: php - Correctly determine if date string is a valid date in that format - Stack Overflow
duplicate data
find duplicate data
- EXCEL:
- one column data: How to count duplicate values in a column in Excel? Using COUNTIF(range, criteria) [Last visited: 2015-08-25] or using Pivot Tables(樞紐分析表) to find the occurrence of value >= 2
- two columns data: How to compare data in two columns to find duplicates in Excel [Last visited: 2015-06-16] It may costs too much time (larger than one hour) if the number of records exceeds 1,000,000
- MySQL:
Find the duplicated data for one column[12]
-- Generate test data. CREATE TABLE `table_name` ( `id` int(11) NOT NULL, `content` varchar(5) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8; INSERT INTO `table_name` (`id`, `content`) VALUES (1, 'apple'), (2, 'lemon'), (3, 'apple'); ALTER TABLE `table_name` ADD PRIMARY KEY (`id`); -- Find duplicated data SELECT `content`, COUNT(*) count FROM `table_name` GROUP BY `content` HAVING count > 1; SELECT tmp.* FROM ( SELECT `content`, count(*) count FROM `table_name` GROUP BY `content` ) tmp WHERE tmp.count >1;
Using CONCAT for multiple columns ex: column_1, column_2
SELECT tmp.key FROM ( SELECT count(*) count, CONCAT( `column_1`, `column_2` ) 'key' FROM `table_name` GROUP BY CONCAT( `column_1`, `column_2` ) ) tmp WHERE tmp.count >=2
For counting purpose: find the count of repeated id (type: int) between table_a and table_b
SELECT count(DISTINCT(id)) FROM table_a WHERE id IN ( SELECT DISTINCT(id) FROM table_b )
- Power Tools for Google Spreadsheet [Last visited: 2016-02-26]
- Menu: Data -> Remove duplicates
deduplicate
- EXCEL: Data Tools -> Remove Duplicates: Filter for unique values or remove duplicate values [Last visited: 2015-10-20]
- MySQL: select deduplicated records
- MySQL DISTINCT - Eliminate Duplicate Rows in a Result Set. Using GROUP_CONCAT to handle the multiple columns[13]
- SQL UNIQUE Constraint "Note that you can have many UNIQUE constraints per table, but only one PRIMARY KEY constraint per table." Quoted from w3schools webpage.
- "UNION removes duplicates, whereas UNION ALL does not." source: sql - What is the difference between UNION and UNION ALL? - Stack Overflow
- MySQL: delete duplicated records
- GNU Coreutils: sort invocation ex: sort -us -o output_unique.file input.file to remove duplicate lines in a large text file (GB)[14] OS: Linux , cygwin of Win
- Google spreadsheet add-on: Remove Duplicates for Google Sheets help
counting number of duplicate occurrence
- MySQL: find the number of duplicate occurrence between list_a & list_b which using the same primary key: column name id
- SELECT count(DISTINCT(`id`)) FROM `list_a` WHERE `id` IN (SELECT DISTINCT(`id`) FROM `list_b`) ;
- Excel: microsoft excel - How to count number of repeat occurrences - Super User un-tested
other
- symbol e.g. data-mining or data_mining
outlier
- Median
- Range Checks
More on: Outlier - Wikipedia
unique number of data values
If the data values were generated by different users, the unique number of data values should be larger than ____
Data handling
Remove first, last or certain characters from text
- Excel: using RIGHT[15] + LEN[16] functions [17]
- Excel: if the length of text was fixed after removed, you may try to use REPLACE[18] + LEN functions (demo)
Remove leading and trailing spaces from text
- PHP: trim function
- Excel: Excel formula: Remove leading and trailing spaces from text | Exceljet
- MySQL: Using TRIM() function & LENGTH() function
UPDATE `table` SET `column` = TRIM( `column` ) WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );
other text look like whitespace
- IDEOGRAPHIC SPACE (全形空白, U+3000)[19]:
- diaplay:
<?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?>
- replace with space:
<?php echo str_replace(json_decode('"\u3000"'), " ", $string);?>
- diaplay:
- ASCII Vertical Tab \v
- ASCII Horizontal Tab (TAB) \t
- ASCII Backspace \b
- Data Cleaning | DATA + DESIGN 資料+設計
- Data modeling: Data type
references
- ↑ MySQL COALESCE() function - w3resource
- ↑ How to check if field is null or empty mysql? - Stack Overflow
- ↑ MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values
- ↑ SQL TRIM 函數 - 1Keydata SQL 語法教學
- ↑ Excel COUNTIFS and COUNTIF with multiple criteria – examples of usage
- ↑ Check if number is an Integer
- ↑ Excel specifications and limits
- ↑ A2
- ↑ regex - Mysql REGEXP with . and numbers only - Stack Overflow
- ↑ GOOGLE 試算表: 數字轉成文字
- ↑ Check if a valid date?
- ↑ Finding duplicate values in MySQL - Stack Overflow
- ↑ sql - MySQL SELECT DISTINCT multiple columns - Stack Overflow
- ↑ linux - How to remove duplicate lines in a large multi-GB textfile? - Unix & Linux Stack Exchange
- ↑ RIGHT、RIGHTB 函數 - Excel - Office.com
- ↑ LEN、LENB 函數 - Excel - Office.com
- ↑ How to remove first, last or certain characters from text in Excel?
- ↑ REPLACE、REPLACEB 函數 - Excel - Office.com
- ↑ Re: 請益 mysql空百=? - 看板 PHP - 批踢踢實業坊