Data cleaning: Difference between revisions
m (→Counting) |
|||
(34 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== | == Check list == | ||
* row count | |||
* duplicate data | |||
== Check if field value was not fulfilled == | |||
=== By purpose === | === By purpose === | ||
<table border="1" style="width: 100%"> | <table border="1" style="width: 100%"> | ||
Line 69: | Line 74: | ||
=== by datatype === | === by datatype === | ||
==== VARCHAR and NOT allows NULL value ==== | |||
Using NULLIF() function<ref>[https://www.w3schools.com/sql/func_mysql_nullif.asp MySQL NULLIF() Function]</ref> | |||
SQL query: | |||
<pre> | |||
SELECT NULLIF(TRIM(`my_column`), "") | |||
</pre> | |||
Example result: | |||
<pre> | |||
SELECT NULLIF(null, ""); | |||
-- return NULL | |||
SELECT NULLIF("", ""); | |||
-- return NULL | |||
SELECT NULLIF(TRIM(" "), ""); | |||
-- return NULL | |||
SELECT NULLIF(TRIM("not empty string "), ""); | |||
-- return "not empty string" | |||
</pre> | |||
==== VARCHAR and allows NULL value ==== | ==== VARCHAR and allows NULL value ==== | ||
<table border="1" style="width: 100%"> | <table border="1" style="width: 100%"> | ||
Line 234: | Line 265: | ||
* MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}} | * MySQL: {{kbd | key =<nowiki>SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;</nowiki>}} | ||
== | == Data Validation == | ||
Validate the format of field value. Related page: [[Regular expression]] | |||
=== Email contains @ symbol === | === Email contains @ symbol === | ||
Line 258: | Line 289: | ||
* If the data was imported from Excel, you should notice the 15 digit precision issue. | * If the data was imported from Excel, you should notice the 15 digit precision issue. | ||
=== | === Check if the column value is numeric === | ||
Possibile values | |||
<pre> | |||
test data: | |||
3.141592654 | |||
1.36184E+14 | |||
123,456.789 | |||
20740199601 | |||
346183773390240 | |||
="5" | |||
</pre> | |||
MySQL: | MySQL: | ||
* Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref> | |||
* Find the records which the value of `my_column` is '''NOT''' numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'}} | * Check if a value is integer e.g. 1234567 | ||
** Find the records which the value of `my_column` is numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9]+$'}}<ref>[http://stackoverflow.com/questions/14343767/mysql-regexp-with-and-numbers-only regex - Mysql REGEXP with . and numbers only - Stack Overflow]</ref><ref>[https://stackoverflow.com/questions/75704/how-do-i-check-to-see-if-a-value-is-an-integer-in-mysql How do I check to see if a value is an integer in MySQL? - Stack Overflow]</ref> | |||
* Check if a value is integer which may contains comma and dot symbols e.g. 1,234.567 or 3.414 | |||
** {{code | code = SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9,\.]+$'}}<ref>[https://community.denodo.com/answers/question/details?questionId=9060g000000XelhAAC&title=How+to+identify+if+values+in+a+column+is+numeric+%28+Function+similar+to+Isnumeric+is+SQL%29 How to identify if values in a column is numeric ( Function similar to Isnumeric is SQL)]</ref> | |||
* Check if a value is NOT integer | |||
** Find the records which the value of `my_column` is '''NOT''' numeric values entirely {{code | code = SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'}} | |||
If the digit of number is known, the SQL syntax could be more specific | If the digit of number is known, the SQL syntax could be more specific | ||
* The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}} | * The {{kbd | key=tax_id}} column is 8 digits only. Find the well-formatted {{kbd | key=tax_id}} records by using {{code | code = SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'}} | ||
PHP: | |||
* [http://php.net/manual/en/function.is-numeric.php is_numeric] function | |||
* [https://www.php.net/manual/en/function.is-int.php is_int] function | |||
Excel & [https://www.google.com/sheets/about/ Google Sheets]: | Excel & [https://www.google.com/sheets/about/ Google Sheets]: | ||
Line 281: | Line 332: | ||
** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers | ** Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g. {{code | code = <nowiki>1.23E+16</nowiki>}} (3) Decimal numbers e.g. {{code | code = <nowiki>3.141592654</nowiki>}} (4) Negative numbers | ||
=== Time data: Validate the data format === | |||
[[Validate the datetime value]] | |||
=== Time data === | |||
=== | === Time data: Data was generated in N years === | ||
Define the abnormal values of the time data ([http://en.wikipedia.org/wiki/Time_series time series]) | |||
* were generated | * Verfiy the data were generated in N years. Possible abnormal values: {{code | code = 0001-01 00:00:00}} occurred in MySQL {{code | code = datetime}} type. | ||
* Verfiy the data were not newer than today | |||
* Verfiy the year of data were not {{kbd | key=1900}} if the data were imported from Microsoft Excel file. Datevalue<ref>[https://support.microsoft.com/zh-tw/office/datevalue-%E5%87%BD%E6%95%B8-df8b07d4-7761-4a93-bc33-b7471bbff252 DATEVALUE 函數 - Office 支援]</ref> was started from the year {{kbd | key=1900}} e.g. | |||
** {{code | code = 1900/1/0}} (converted time formatted value from 0), | |||
** {{code | code = 1900/1/1}} (converted time formatted value from 1) | |||
* {{code | code = 1900/1/0}} (converted time formatted value from 0), {{code | code = 1900/1/1}} (converted time formatted value from 1) | * Verfiy the value of data were not {{kbd | key=0000-00-00 00:00:00}} | ||
* | * Verfiy the diversity of data values e.g. [https://en.wikipedia.org/wiki/Variance Variance] | ||
Find the normal values: | Find the normal values: | ||
Line 317: | Line 360: | ||
* PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow] | * PHP: [http://stackoverflow.com/questions/19271381/correctly-determine-if-date-string-is-a-valid-date-in-that-format php - Correctly determine if date string is a valid date in that format - Stack Overflow] | ||
=== | === Time data: Human birth year (age) data === | ||
Based on the existing record, the longest-living person who lived to 122<ref>[https://en.wikipedia.org/wiki/Maximum_life_span Maximum life span - Wikipedia]</ref>. | Based on the existing record, the longest-living person who lived to 122<ref>[https://en.wikipedia.org/wiki/Maximum_life_span Maximum life span - Wikipedia]</ref>. | ||
Line 333: | Line 376: | ||
* [http://www.fileformat.info/info/unicode/char/a0/index.htm Unicode Character 'NO-BREAK SPACE' (U+00A0)] | * [http://www.fileformat.info/info/unicode/char/a0/index.htm Unicode Character 'NO-BREAK SPACE' (U+00A0)] | ||
== Find and remove duplicates == | |||
[[Find and remove duplicates]] in Excel/BASH/MySQL/PHP | |||
Find | |||
== Counting == | == Counting == | ||
[[Count occurrences of a word in string]] | * [[Count occurrences of a word in string]] | ||
* Count number of unique values | |||
** Excel: [https://www.excel-easy.com/examples/count-unique-values.html Count Unique Values in Excel - Easy Excel Tutorial] | |||
** Google sheet: [https://support.google.com/docs/answer/3093405?hl=zh-Hant COUNTUNIQUE - 文件編輯器說明] & [https://infoinspired.com/google-docs/spreadsheet/unique-function-in-horizontal-data-range-in-google-sheets/ How to Use UNIQUE Function in Horizontal Data Range in Google Sheets] | |||
== Outlier / Anomaly detection == | == Outlier / Anomaly detection == | ||
Anomaly detection | [[Anomaly detection]] | ||
== unique number of data values == | == unique number of data values == | ||
Line 489: | Line 415: | ||
# ASCII Horizontal Tab (TAB) {{kbd | key=<nowiki>\t</nowiki>}} | # ASCII Horizontal Tab (TAB) {{kbd | key=<nowiki>\t</nowiki>}} | ||
# ASCII Backspace {{kbd | key=<nowiki>\b</nowiki>}} | # ASCII Backspace {{kbd | key=<nowiki>\b</nowiki>}} | ||
# [ | # [[Remove non breaking space]] | ||
== Remove control character == | == Remove control character == | ||
Line 533: | Line 427: | ||
=== Fix garbled message text === | === Fix garbled message text === | ||
[[Fix garbled message text]] | [[Fix garbled message text]] | ||
== Tools == | |||
* [https://github.com/IvanMathy/Boop IvanMathy/Boop: A scriptable scratchpad for developers. In slow yet steady progress.] ([https://apps.apple.com/us/app/boop/id1518425043?mt=12 Boop on the Mac App Store]) " ... to paste some plain text and run some basic text operations on it. " | |||
== Further reading == | == Further reading == | ||
Line 543: | Line 440: | ||
* [https://medium.com/bryanyang0528/%E8%B3%87%E6%96%99%E5%93%81%E8%B3%AA%E5%88%9D%E6%8E%A2-data-quality-b765eb56a7c2?fbclid=IwAR3NBb2BtFm9O3FeY7JgQ5HLE5VG5nFe3m5Zx8zNW9XkvOUPlqV9hXmaoXI 資料品質初探(Data Quality) – 亂點技能的跨界人生 – Medium] {{access | date = 2018-01-13}} | * [https://medium.com/bryanyang0528/%E8%B3%87%E6%96%99%E5%93%81%E8%B3%AA%E5%88%9D%E6%8E%A2-data-quality-b765eb56a7c2?fbclid=IwAR3NBb2BtFm9O3FeY7JgQ5HLE5VG5nFe3m5Zx8zNW9XkvOUPlqV9hXmaoXI 資料品質初探(Data Quality) – 亂點技能的跨界人生 – Medium] {{access | date = 2018-01-13}} | ||
== | == References == | ||
<references/> | <references/> | ||
Line 550: | Line 447: | ||
[[Category:Data Science]] | [[Category:Data Science]] | ||
[[Category:MySQL]] | [[Category:MySQL]] | ||
[[Category: | [[Category:String manipulation]] |
Revision as of 16:28, 14 September 2022
Check list
- row count
- duplicate data
Check if field value was not fulfilled
By purpose
Purpose | Method (MySQL query syntax) | Value1: Fulfilled value what I want |
Value2: Fulfilled value NOT I want |
Value3: 0 |
Value4: NULL value |
Value5: Empty or white-spaces characters |
values were not fulfilled or empty (not contains 0) |
WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0 |
V | V | |||
values were not fulfilled or empty (contains 0) |
V | V | V | |||
values were fulfilled and non-empty (not contains 0) |
V | V | ||||
values were fulfilled and non-empty (contains 0) |
WHERE LENGTH(TRIM( column_name )) > 0 | V | V | V | ||
values (1) were not fulfilled or empty values (2) NOT I want (not contains 0) |
WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0 OR column_name LIKE 'values NOT I want' |
V | V | V | ||
values (1) were not fulfilled or empty values (2) NOT I want (contains 0) |
V | V | V | V |
by datatype
VARCHAR and NOT allows NULL value
Using NULLIF() function[1]
SQL query:
SELECT NULLIF(TRIM(`my_column`), "")
Example result:
SELECT NULLIF(null, ""); -- return NULL SELECT NULLIF("", ""); -- return NULL SELECT NULLIF(TRIM(" "), ""); -- return NULL SELECT NULLIF(TRIM("not empty string "), ""); -- return "not empty string"
VARCHAR and allows NULL value
data type of column | possible column values | method1: find not fulfilled or empty values |
method2: find fulfilled and non-empty values |
method3: find NULL values |
method4: find not NULL values |
VARCHAR and allows NULL | fulfilled value ex:123 | V | V | ||
NULL type:null | V | V | |||
'NULL' type:string | |||||
0 | V | V | |||
EMPTY ex: '' or space(s) ' ' | V | V |
symbol V: means the column value will be able to find by means of the method
- method1:
- method2:
- SELECT * FROM `my_table` WHERE column_name > ''
- SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) > 0
- SELECT * FROM `my_table` WHERE LENGTH(TRIM( column_name )) != 0
- method3: SELECT * FROM `my_table` WHERE column_name IS NULL
- method4: SELECT * FROM `my_table` WHERE column_name IS NOT NULL
VARCHAR or numeric
data type of column | possible column values | method5: find values within the range |
method6: find values out of the range, empty & NULL values |
VARCHAR or numeric | values within the range ex: min ≤ value ≤ max | V | |
values out of range | V | ||
NULL | V | ||
EMPTY ex: '' or space ' ' | V |
- method5: SELECT * FROM `my_table` WHERE column_name BETWEEN min AND max
- method6: SELECT * FROM `my_table` WHERE ( (COALESCE(column_name, '') = '') OR (column_name NOT BETWEEN min AND max) )
datetime and allows NULL value
possible column values
- 2024-03-28
- 00-00-00 00:00:00
- NULL
is null
Fill 0 if the value is NA or NULL
- MySQL SQL syntax: COALESCE(): SELECT COALESCE(column_name, 0) or SELECT COALESCE(column_name, 'other_filled_value')
- (1) Using COALESCE() function to replace the NULL value with 0.
- (2) The case: 0/0 = null should be handled.
- MySQL SQL syntax: combined IF() & ISNULL():SELECT IF(ISNULL((column_name), 0, column_name) or SELECT IF(ISNULL((column_name), 'other_filled_value', column_name)
- python: pandas.DataFrame.fillna — pandas 0.16.0 documentation "Fill NA/NaN values using the specified method"
Find whether a variable is NULL. online demo
- PHP is_null to find type:null null NOT type:string 'null'
- Google spreadsheet / Excel:
- ISERR(value) " value - The value to be verified as an error type other than #N/A." ex: #NULL!
- If the cell value is exactly NULL not #NULL!, You may use COUNTIF(value, "NULL") or EXACT(value, "NULL")
- MySQL SQL syntax: SELECT * FROM table WHERE column IS NULL;[4]
- R: is.null(): R: The Null Object
- Excel[5]
Find whether a variable is NOT NULL
- MySQL SQL syntax: SELECT * FROM table WHERE column IS NOT NULL;
Find whether a variable is NOT #N/A
- Excel: =NOT(ISERROR(cell_value))
javascript
check if field value was not fulfilled: NULL, empty value
NOT include those data which its field value fulfilled with default value automatically (demo on sqlfiddle)
- quick solution: find records with NULL value OR empty, space value
- MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL OR LENGTH(TRIM( column_name )) = 0;
- find records with NULL value: (note: not #NULL!)
- MySQL solution: SELECT * FROM table_name WHERE column_name IS NULL;
- EXCEL: =EXACT(A2, "NULL")
- find records with empty value: (not contains NULL value)
- MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) = 0; SQL query SELECT * FROM table_name WHERE column_name IS NOT NULL includes empty value
- MS SQL Server: SELECT * FROM table_name WHERE LEN( LTRIM(RTRIM(column_name)) ) = 0; [6]
- Excel starting date: 1900/1/0 (converted time formatted value from 0), 1900/1/1 (converted time formatted value from 1), 1900/1/2 ...
- solution: step1: Replace the year > 100 from this year with empty value at EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)<1914, "", A2)) (this formula also handle empty value and non well-formatted column value ex: 0000-12-31 ) ; step2: change the format of cell to time format
- trivial approach : EXCEL: =IF(ISERR(YEAR(A2)), "", IF(YEAR(A2)-YEAR(NOW())>100, "", A2)) this formula could not handle empty value because it return 0. If I change the format of cell to time format, 0 will become 1900/1/0.
- Using PHP
- empty() function to find 0, null, false, empty string, empty array values.
if(empty($var) && $var !== 0 && $var !== "0"){ .. }
to find null, false, empty string, empty array values BUT not 0.
- check if field value was NULL & not equal to some value
check if field value was fulfilled
length of string > 0
column value is not null or 0
- Excel: COUNTIFS(criteria_range1, "<>NULL", criteria_range1, "<>0")[7]
find if number or cell value is positive integer
- EXCEL: =IFERROR(IF(AND(INT( value )= value, value>0), TRUE, FALSE), FALSE)[8] online demo
check numeric range
- MySQL: SELECT * FROM table_name WHERE column_name BETWEEN min_number AND max_number; the value >= min_number AND value <= max_number ( min_number ≤ value ≤ max_number )
find NOT empty records means records without NULL or empty value:
- MySQL: SELECT * FROM table_name WHERE LENGTH(TRIM( column_name )) != 0;
- MySQL: SELECT * FROM table_name WHERE column_name != '' AND column_name IS NOT NULL;
Data Validation
Validate the format of field value. Related page: Regular expression
Email contains @ symbol
- EXCEL: =IF(ISERR(FIND("@", A2, 1)), FALSE, TRUE) only check the field if contains @ symbol or not
- result: (1) normal condition: return TRUE; (2) exceptional condition: return FALSE if @ symbol was not found
- EXCEL: =FIND("@", A2, 2) only check the field if contains @ symbol or not
- syntax: FIND(find_text, with_text, [start_num]) the start_num is 2 because the position of @ symbol should be larger than 1 (position of first char is 1)
- result: (1) normal condition: return the number larger than 1; (2) exceptional condition: return #VALUE! if @ symbol was not found
- PHP: PHP FILTER_VALIDATE_EMAIL Filter
- "Returns the filtered data, or FALSE if the filter fails." quoted from PHP.net
Number precision in Excel
Number precision: 15 digits (Excel中最多的有效位數為15位)[9][10]
raw data: 1234567890123456 ->
- (numeric format 數值格式) 1234567890123450.00 losing precision
- (general format 通用格式) 1.23457E+15 losing precision
- (text format 文字格式) 1234567890123456
large numbers
- If the data was imported from Excel, you should notice the 15 digit precision issue.
Check if the column value is numeric
Possibile values
test data: 3.141592654 1.36184E+14 123,456.789 20740199601 346183773390240 ="5"
MySQL:
- Check if a value is integer e.g. 1234567
- Check if a value is integer which may contains comma and dot symbols e.g. 1,234.567 or 3.414
SELECT * FROM `my_table` WHERE `my_column` REGEXP '^[0-9,\.]+$'
[13]
- Check if a value is NOT integer
- Find the records which the value of `my_column` is NOT numeric values entirely
SELECT * FROM `my_table` WHERE `my_column` NOT REGEXP '^[0-9]+$'
- Find the records which the value of `my_column` is NOT numeric values entirely
If the digit of number is known, the SQL syntax could be more specific
- The tax_id column is 8 digits only. Find the well-formatted tax_id records by using
SELECT * FROM `tax_id` WHERE `tax_id` REGEXP '^[0-9]{8}$'
PHP:
- is_numeric function
- is_int function
Excel & Google Sheets:
- Using ISNUMBER Function:
=INT(ISNUMBER(A1))
- Return 1 if the cell value is (1) Numbers (2) Numbers in scientific (exponential) notation e.g.
1.36184E+14
(3) Decimal numbers e.g.3.141592654
(4) Negative numbers - Return 0 if the cell value is (1) Text (2) Numbers that are stored as text e.g.
="5"
- Return 1 if the cell value is (1) Numbers (2) Numbers in scientific (exponential) notation e.g.
- Google Sheets only: Using REGEXMATCH, TRIM & CONCAT[14] functions:
=IF(REGEXMATCH(CONCAT("", TRIM(A1)), "^\d+$"), 1, 0)
- Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g.
="5"
- Return 0 if the cell value is (1) Text (2) Numbers in scientific (exponential) notation e.g.
1.23E+16
(3) Decimal numbers e.g.3.141592654
(4) Negative numbers
- Return 1 if the cell value is (1) Numbers (2) Numbers that are stored as text e.g.
Time data: Validate the data format
Time data: Data was generated in N years
Define the abnormal values of the time data (time series)
- Verfiy the data were generated in N years. Possible abnormal values:
0001-01 00:00:00
occurred in MySQLdatetime
type. - Verfiy the data were not newer than today
- Verfiy the year of data were not 1900 if the data were imported from Microsoft Excel file. Datevalue[15] was started from the year 1900 e.g.
1900/1/0
(converted time formatted value from 0),1900/1/1
(converted time formatted value from 1)
- Verfiy the value of data were not 0000-00-00 00:00:00
- Verfiy the diversity of data values e.g. Variance
Find the normal values:
- MySQL: Assume the data was generated in recent 10 years & not newer than today
SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` < CURDATE() + 1);
- NOT
`my_time_column` < CURDATE()
。 ex:CURDATE()
is 2024-03-28. Which is the same with 2024-03-28 00:00:00
- NOT
SELECT * FROM `my_table` WHERE ( YEAR( CURDATE() ) - YEAR( `my_time_column`) <= 10 ) AND ( `my_time_column` < CURDATE() + 1);
- MySQL: Assume the data was generated in recent 10 years & not newer than current timestamp. More precision to second compared with the above approach.
SELECT * FROM `my_table` WHERE ( `my_time_column` >= CURDATE() - INTERVAL 10 YEAR ) AND ( `my_time_column` <= CURRENT_TIMESTAMP);
- You need to check the
SELECT CURRENT_TIMESTAMP);
if correct or not before you delete the abnormal data (timezone issue)
- You need to check the
Check if the date valid
- Excel: WEEKDAY function [16]
- legal results: 1 (Sunday) ~ 7 (Saturday)
- illegal result: #VALUE!
- PHP: php - Correctly determine if date string is a valid date in that format - Stack Overflow
Time data: Human birth year (age) data
Based on the existing record, the longest-living person who lived to 122[17].
MySQL query is as follows[18] where the column `birthday` is date type.
WHERE TIMESTAMPDIFF(YEAR, `birthday`, CURDATE()) <= 122
Using UNIX_TIMESTAMP() function to check the abnormality of birthday data is not appropriate. Because the birthdays which are earlier 1970-01-01 00:00:00 UTC will all become zero.
String contains special characters
Find and remove duplicates
Find and remove duplicates in Excel/BASH/MySQL/PHP
Counting
- Count occurrences of a word in string
- Count number of unique values
Outlier / Anomaly detection
unique number of data values
If the data values were generated by different users, the unique number of data values should be larger than ____
Data handling
Remove first, last or certain characters from text
- Excel: using RIGHT[19] + LEN[20] functions [21]
- Excel: if the length of text was fixed after removed, you may try to use REPLACE[22] + LEN functions (demo)
Remove leading and trailing spaces from text
- PHP: trim function
- Excel: Excel formula: Remove leading and trailing spaces from text | Exceljet
- MySQL: Using TRIM() function & LENGTH() function
UPDATE `table` SET `column` = TRIM( `column` ) WHERE LENGTH(TRIM( `column` )) != LENGTH( `column` );
Remove other string look like whitespace
- IDEOGRAPHIC SPACE (全形空白、全型空白, U+3000)[23]:
- diaplay:
<?php $string = "111" . json_decode('"\u3000"') . "222"; echo $string;?>
- replace with space:
<?php echo str_replace(json_decode('"\u3000"'), " ", $string);?>
- diaplay:
- ASCII Vertical Tab \v
- ASCII Horizontal Tab (TAB) \t
- ASCII Backspace \b
- Remove non breaking space
Remove control character
Control character - Wikipedia Using PHP to clean control character:
$input = 'some string may contains control characters'; $replacement = ''; $result = preg_replace('/[\x00-\x1F]/', $replacement, $input);
Fix garbled message text
Tools
- IvanMathy/Boop: A scriptable scratchpad for developers. In slow yet steady progress. (Boop on the Mac App Store) " ... to paste some plain text and run some basic text operations on it. "
Further reading
- Data Cleaning | DATA + DESIGN 資料+設計
- Data modeling: Data type
- What is a codebook?
- 台灣R軟體Club 「由於真實社會的資料千奇百怪,遺失值、空字串、空格、甚至是假的 "NA" (NA兩邊多了雙引號)、假的"NULL" (NULL兩邊多了雙引號) 。。。,其中一個顯著的差異是,社會實務應用的 R 程式中,通常會多出一堆檢查機制。 ... ...」[Last visited: 2017-12-18]
- 判斷 Excel 欄位值是否錯誤 除了常見的 Excel 錯誤,如何判斷來自其他資料來源的可能錯誤欄位值
- Web user behavior
- 資料品質初探(Data Quality) – 亂點技能的跨界人生 – Medium [Last visited: 2018-01-13]
References
- ↑ MySQL NULLIF() Function
- ↑ MySQL COALESCE() function - w3resource
- ↑ How to check if field is null or empty mysql? - Stack Overflow
- ↑ MySQL :: MySQL 5.0 Reference Manual :: 3.3.4.6 Working with NULL Values
- ↑ 如何判斷 Excel 儲存格的欄位值是 NULL
- ↑ SQL TRIM 函數 - 1Keydata SQL 語法教學
- ↑ Excel COUNTIFS and COUNTIF with multiple criteria – examples of usage
- ↑ Check if number is an Integer
- ↑ Excel specifications and limits
- ↑ A2
- ↑ regex - Mysql REGEXP with . and numbers only - Stack Overflow
- ↑ How do I check to see if a value is an integer in MySQL? - Stack Overflow
- ↑ How to identify if values in a column is numeric ( Function similar to Isnumeric is SQL)
- ↑ GOOGLE 試算表: 數字轉成文字
- ↑ DATEVALUE 函數 - Office 支援
- ↑ Check if a valid date?
- ↑ Maximum life span - Wikipedia
- ↑ sql - Calculate Age in MySQL (InnoDb) - Stack Overflow
- ↑ RIGHT、RIGHTB 函數 - Excel - Office.com
- ↑ LEN、LENB 函數 - Excel - Office.com
- ↑ How to remove first, last or certain characters from text in Excel?
- ↑ REPLACE、REPLACEB 函數 - Excel - Office.com
- ↑ Re: 請益 mysql空百=? - 看板 PHP - 批踢踢實業坊