Data cleaning

From LemonWiki共筆
Jump to: navigation, search

check if field value was not fulfilled[edit]

By purpose[edit]

Purpose Method (MySQL query syntax) Value1:
Fulfilled value what I want
Value2:
Fulfilled value NOT I want
Value3:
0
Value4:
NULL value
Value5:
Empty or white-spaces characters
values were not fulfilled or empty
(not contains 0)
WHERE column_name IS NULL
OR LENGTH(TRIM( column_name )) = 0
V V
values were not fulfilled or empty
(contains 0)
V V V
values were fulfilled and non-empty
(not contains 0)
V V
values were fulfilled and non-empty
(contains 0)
WHERE LENGTH(TRIM( column_name )) > 0 V V V
values (1) were not fulfilled or empty values (2) NOT I want
(not contains 0)
WHERE column_name IS NULL
OR LENGTH(TRIM( column_name )) = 0
OR column_name LIKE 'values NOT I want'
V V V
values (1) were not fulfilled or empty values (2) NOT I want
(contains 0)
V V V V

by datatype[edit]

VARCHAR and allows NULL value[edit]

data type of column possible column values method1:
find not fulfilled or empty values
method2:
find fulfilled and non-empty values
method3:
find NULL values
method4:
find not NULL values
VARCHAR and allows NULL fulfilled value ex:123 V V
NULL type:null V V
'NULL' type:string
0 V V
EMPTY ex: '' or space(s) ' ' V V

symbol V: means the column value will be able to find by means of the method


  • method1:
    • SELECT * FROM `my_table` WHERE COALESCE(column_name, '') = ''[1]
    • SELECT * FROM `my_table` WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0
    • SELECT * FROM `my_table` WHERE column_name IS NULL OR column_name = ''[2]
  • method2:
    • SELECT * FROM `my_table` WHERE column_name > ''
    • SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0
    • SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0
  • method3: SELECT * FROM `my_table` WHERE column_name IS NULL
  • method4: SELECT * FROM `my_table` WHERE column_name IS NOT NULL

VARCHAR or numeric[edit]

data type of column possible column values method5:
find values within the range
method6:
find values out of the range, empty & NULL values
VARCHAR or numeric values within the range ex: min ≤ value ≤ max V
values out of range V
NULL V
EMPTY ex: '' or space ' ' V
  • method5: SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max
  • method6: SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )

datetime and allows NULL value[edit]

possible column values

  1. 2017-08-21
  2. 00-00-00 00:00:00
  3. NULL

is null[edit]

Fill 0 if the value is NA or NULL

  • MySQL SQL syntax: COALESCE(): SELECT COALESCE(column_name, 0) or SELECT COALESCE(column_name, 'other_filled_value')
    • (1) Using COALESCE() function to replace the NULL value with 0.
    • (2) The case: 0/0 = null should be handled.
  • MySQL SQL syntax: combined IF() & ISNULL():SELECT IF(ISNULL((column_name), 0, column_name) or SELECT IF(ISNULL((column_name), 'other_filled_value', column_name)
  • python: pandas.DataFrame.fillna — pandas 0.16.0 documentation "Fill NA/NaN values using the specified method"

Find whether a variable is NULL. online demo

  • PHP is_null to find type:null null NOT type:string 'null' Icon exclaim.gif
  • Google spreadsheet / Excel:
    • ISERR(value) " value - The value to be verified as an error type other than #N/A." ex: #NULL!
    • If the cell value is exactly NULL not #NULL!, You may use COUNTIF(value, "NULL") or EXACT(value, "NULL")
  • MySQL SQL syntax: SELECT * FROM table WHERE column IS NULL;[3]

Find whether a variable is NOT NULL

  • MySQL SQL syntax: SELECT * FROM table WHERE column IS NOT NULL;

Find whether a variable is NOT #N/A

  • Excel: =NOT(ISERROR(cell_value))

javascript

check if field value was not fulfilled: NULL, empty value[edit]

Icon exclaim.gif NOT include those data which its field value fulfilled with default value automatically (demo on sqlfiddle)

  1. Good! quick solution: find records with NULL value OR empty, space value
    • MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0;
  2. find records with NULL value: (note: not #NULL!)
    • MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL;
    • EXCEL: =EXACT(A2, "NULL")
  3. find records with empty value: (not contains NULL value)
    • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0; Icon exclaim.gif SQL query SELECT * FROM table_name WHERE column_name IS NOT NULL includes empty value
    • MS SQL Server: SELECT * FROM table_name WHERE LEN( LTRIM(RTRIM(column_name)) ) = 0; [4]
  4. Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...
    • solution: step1: Replace the year > 100 from this year with empty value at EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2)) (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
    • trivial approach : EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2)) Icon exclaim.gif this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
  5. Using PHP
    • empty() function to find 0, null, false, empty string, empty array values.
    • if(empty($var) && $var !== 0 && $var !== "0"){ .. } to find null, false, empty string, empty array values BUT not 0.
  6. check if field value was NULL & not equal to some value

check if field value was fulfilled[edit]

length of string > 0

  • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0; demo[1]

column value is not null or 0

  • Excel: COUNTIFS(criteria_range1, "<>NULL", criteria_range1, "<>0")[5]

find if number or cell value is positive integer

  • EXCEL: =IFERROR(IF(AND(INT( value )= value, value>0), TRUE, FALSE), FALSE)[6] online demo

check numeric range

  • MySQL: SELECT * FROM table_name WHERE column_name BETWEEN min_number AND max_number; the value >= min_number AND value <= max_number ( min_number ≤ value ≤ max_number )

find NOT empty records means records without NULL or empty value:

  • MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;
  • MySQL: SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;

Verify the format of field value[edit]

related page: Regular expression

Email contains @ symbol[edit]

  • EXCEL: =IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE) only check the field if contains @ symbol or not
    • result: (1) normal condition: return TRUE; (2) exceptional condition: return FALSE if @ symbol was not found
  • EXCEL: =FIND("@", A2, 2) only check the field if contains @ symbol or not
    • syntax: FIND(find_text, with_text, [start_num]) the start_num is 2 because the position of @ symbol should be larger than 1 (position of first char is 1)
    • result: (1) normal condition: return the number larger than 1; (2) exceptional condition: return #VALUE! if @ symbol was not found
  • PHP: PHP FILTER_VALIDATE_EMAIL Filter
    • "Returns the filtered data, or FALSE if the filter fails." quoted from PHP.net

Number precision in Excel[edit]

Number precision: 15 digits (Excel中最多的有效位數為15位)[7][8]

raw data: 1234567890123456 ->

  • (numeric format 數值格式) 1234567890123450.00 Icon exclaim.gif losing precision
  • (general format 通用格式) 1.23457E+15 Icon exclaim.gif losing precision
  • (text format 文字格式) 1234567890123456

large numbers

  • If the data was imported from Excel, you should notice the 15 digit precision issue.

Numeric only[edit]

List of the possible abnormal values:

  • All numeric value are odd or even if the data were generated by user naturally.

PHP:

MySQL:

  • find numeric values SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'[9]
  • find NOT numeric values SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'

Excel & Google Sheets:

  • Using ISNUMBER Function: =INT(ISNUMBER(A1))
    • Return 1 if the cell value is (1) Numbers (2) Numbers in scientific (exponential) notation e.g. 1.36184E+14 (3) Decimal numbers e.g. 3.141592654 (4) Negative numbers
    • Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g. ="5"
  • Google Sheets only: Using REGEXMATCH, TRIM & CONCAT[10] functions: =IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)
    • Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g. ="5"
    • Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. 1.23E+16 (3) Decimal numbers e.g. 3.141592654 (4) Negative numbers
test data:
3.141592654
1.36184E+14
20740199601
346183773390240
="5"

Time data[edit]

Definition of abnormal values of the time data (time series) if they

  • were generated 10 years before or
  • newer than today

List of the possible abnormal values:

  • 0001-01 00:00:00 occurred in MySQL datetime type
  • 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ... occurred in MS Excel
  • future data: the date after today


Find the normal values:

  • MySQL: Assume the data was generated in recent 10 years & not newer than today
    • SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` < CURDATE() + 1);
      • Icon exclaim.gif NOT `my_time_column` < CURDATE()。 ex: CURDATE() is 2017-08-21. Which is the same with 2017-08-21 00:00:00
    • SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 ) AND ( `my_time_column` < CURDATE() + 1);
  • MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.
    •  :SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` <= CURRENT_TIMESTAMP);
      • You need to check the SELECT CURRENT_TIMESTAMP); if correct or not before you delete the abnormal data (timezone issue)

Check if the date valid

Duplicate data[edit]

Find duplicate data[edit]

  • MySQL:

Find the duplicated data for one column[12]

-- Generate test data.
CREATE TABLE `table_name` (
  `id` int(11) NOT NULL,
  `content` varchar(5) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `table_name` (`id`, `content`) VALUES
(1, 'apple'),
(2, 'lemon'),
(3, 'apple');

ALTER TABLE `table_name`
  ADD PRIMARY KEY (`id`);

-- Find duplicated data
SELECT `content`, COUNT(*) count FROM `table_name` GROUP BY `content` HAVING count > 1;

SELECT tmp.* FROM 
( 
  SELECT `content`, count(*) count FROM `table_name` GROUP BY `content` 
) tmp 
WHERE tmp.count >1;

Using CONCAT for multiple columns ex: column_1, column_2

SELECT tmp.key FROM
(
	SELECT count(*) count, CONCAT(  `column_1`, `column_2`  ) 'key'
	FROM `table_name`
	GROUP BY CONCAT(  `column_1`, `column_2`  )
) tmp
WHERE tmp.count >=2

For counting purpose: find the count of repeated id (type: int) between table_a and table_b

SELECT count(DISTINCT(id)) FROM table_a WHERE id IN
(
   SELECT DISTINCT(id) FROM table_b
)
  • Power Tools for Google Spreadsheet [Last visited: 2016-02-26]
    • Menu: Data -> Remove duplicates

Deduplicate[edit]

Counting number of duplicate occurrence[edit]

Other[edit]

  • symbol e.g. data-mining or data_mining

outlier[edit]

  • Median
  • Range Checks

More on: Outlier - Wikipedia

unique number of data values[edit]

If the data values were generated by different users, the unique number of data values should be larger than ____

Data handling[edit]

Remove first, last or certain characters from text[edit]

  • Excel: using RIGHT[15] + LEN[16] functions [17]
  • Excel: if the length of text was fixed after removed, you may try to use REPLACE[18] + LEN functions (demo)

Remove leading and trailing spaces from text[edit]

UPDATE `table` 
SET `column` = TRIM( `column` ) 
WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );

other text look like whitespace[edit]

Whitespace character

  1. IDEOGRAPHIC SPACE (全形空白, U+3000)[19]:
    • diaplay:
      <?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?>
    • replace with space:
      <?php echo str_replace(json_decode('"\u3000"'), " ", $string);?>
  2. ASCII Vertical Tab \v
  3. ASCII Horizontal Tab (TAB) \t
  4. ASCII Backspace \b

related pages[edit]

references[edit]

  1. MySQL COALESCE() function - w3resource
  2. How to check if field is null or empty mysql? - Stack Overflow
  3. MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values
  4. SQL TRIM 函數 - 1Keydata SQL 語法教學
  5. Excel COUNTIFS and COUNTIF with multiple criteria – examples of usage
  6. Check if number is an Integer
  7. Excel specifications and limits
  8. A2
  9. regex - Mysql REGEXP with . and numbers only - Stack Overflow
  10. GOOGLE 試算表: 數字轉成文字
  11. Check if a valid date?
  12. Finding duplicate values in MySQL - Stack Overflow
  13. sql - MySQL SELECT DISTINCT multiple columns - Stack Overflow
  14. linux - How to remove duplicate lines in a large multi-GB textfile? - Unix & Linux Stack Exchange
  15. RIGHT、RIGHTB 函數 - Excel - Office.com
  16. LEN、LENB 函數 - Excel - Office.com
  17. How to remove first, last or certain characters from text in Excel?
  18. REPLACE、REPLACEB 函數 - Excel - Office.com
  19. Re: 請益 mysql空百=? - 看板 PHP - 批踢踢實業坊