Byte order mark: Difference between revisions
m (→MySQL way) |
|||
| (14 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
[https://en.wikipedia.org/wiki/Byte_order_mark Byte order mark] (BOM, [https://zh.wikipedia.org/wiki/%E4%BD%8D%E5%85%83%E7%B5%84%E9%A0%86%E5%BA%8F%E8%A8%98%E8%99%9F 位元組順序記號]) | [https://en.wikipedia.org/wiki/Byte_order_mark Byte order mark] (BOM, [https://zh.wikipedia.org/wiki/%E4%BD%8D%E5%85%83%E7%B5%84%E9%A0%86%E5%BA%8F%E8%A8%98%E8%99%9F 位元組順序記號], 部分編輯器稱為「簽名」) | ||
== How to see Byte order mark == | == How to see Byte order mark == | ||
| Line 20: | Line 20: | ||
ADD UNIQUE KEY `id` (`id`) USING BTREE; | ADD UNIQUE KEY `id` (`id`) USING BTREE; | ||
SELECT HEX(id), id, notes FROM articles; | SELECT HEX(`id`), `id`, `notes` FROM `articles`; | ||
</pre> | </pre> | ||
| Line 33: | Line 33: | ||
<td>31323334353637383930</td> | <td>31323334353637383930</td> | ||
<td>1234567890</td> | <td>1234567890</td> | ||
<td> | <td>UTF-8 without BOM</td> | ||
</tr> | </tr> | ||
<tr> | <tr> | ||
| Line 40: | Line 40: | ||
<td>EFBBBF31323334353637383930</td> | <td>EFBBBF31323334353637383930</td> | ||
<td>1234567890</td> | <td>1234567890</td> | ||
<td>BOM</td> | <td>UTF-8 with BOM</td> | ||
</tr> | </tr> | ||
<tr> | <tr> | ||
| Line 46: | Line 46: | ||
</table> | </table> | ||
If the column {{kbd | key=`id`}} was only allowed integer in column value, you can use the following sql query to find the records contains BOM: | |||
<pre> | |||
SELECT * | |||
FROM `articles` | |||
WHERE HEX(`id`) REGEXP '[^0-9]+' | |||
</pre> | |||
=== PHP way === | |||
PHP code<ref>[https://stackoverflow.com/questions/14674834/php-convert-string-to-hex-and-hex-to-string PHP convert string to hex and hex to string - Stack Overflow]</ref><ref>[https://www.w3schools.com/php/func_misc_unpack.asp PHP unpack() Function]</ref><ref>[https://stackoverflow.com/questions/10290849/how-to-remove-multiple-utf-8-bom-sequences-before-doctype php - How to remove multiple UTF-8 BOM sequences before "<!DOCTYPE>"? - Stack Overflow]</ref>: | |||
<pre> | |||
$string = "1234567890"; | |||
echo $string . " NOT contains BOM --> after str2hex: " . str2hex($string) . PHP_EOL; | |||
$string = "\xEF\xBB\xBF" . "1234567890"; | |||
echo $string . " contains BOM --> after str2hex: " . str2hex($string) . PHP_EOL; | |||
function str2hex($string) { | |||
$hexstr = unpack('H*', $string); | |||
return array_shift($hexstr); | |||
} | |||
< | </pre> | ||
< | |||
< | Result: | ||
< | <pre> | ||
</ | 1234567890 NOT contains BOM --> after str2hex: 31323334353637383930 | ||
1234567890 contains BOM --> after str2hex: efbbbf31323334353637383930 | |||
</pre> | |||
=== Excel / Google sheet way === | |||
Using the [https://support.office.com/en-us/article/code-function-c32b692b-2ed0-4a04-bdd9-75640144b928 CODE function] to check the "numeric code for the first character in a text string". If the cell {{kbd | key=A1}} contains BOM, | |||
* {{kbd | key=<nowiki>=CODE(A1)</nowiki>}} returns {{kbd | key=63}} on Excel 2016 of {{Win}}<ref>[http://www.asciitable.com/ Ascii Table - ASCII character codes and html, octal, hex and decimal chart conversion]</ref> | |||
* {{kbd | key=<nowiki>=CODE(A1)</nowiki>}} returns {{kbd | key=95}} on Excel 2016 of {{Mac}} | |||
* {{kbd | key=<nowiki>=CODE(A1)</nowiki>}} returns {{kbd | key=65279}} or other numeric value e.g. {{kbd | key=28201}} on Google sheet | |||
=== BASH command === | |||
Check if a UTF-8 encoded file contains a BOM. The first line result of {{kbd | key=hexdump}} mentioning ef bb bf indicates it contains a BOM | |||
<pre> | |||
% hexdump -n 3 -C filename | |||
00000000 ef bb bf |...| | |||
00000003 | |||
</pre> | |||
=== File command === | |||
Using [https://en.wikipedia.org/wiki/File_(command) file (command)]: {{kbd | key=<nowiki>file filename.txt</nowiki>}} on {{Linux}}, {{Mac}}<ref>[https://developer.apple.com/legacy/library/documentation/Darwin/Reference/ManPages/man1/file.1.html file(1) Mac OS X Manual Page]</ref> & Cygwin on {{Win}}. See details on [[Text file encoding]] | |||
=== Hex editor === | |||
Using Hext editor to open the text file. [https://zh.wikipedia.org/wiki/%E4%BD%8D%E5%85%83%E7%B5%84%E9%A0%86%E5%BA%8F%E8%A8%98%E8%99%9F 位元組順序記號 - 維基百科,自由的百科全書] | |||
== How to remove Byte order mark == | |||
PHP | |||
* [https://stackoverflow.com/questions/10290849/how-to-remove-multiple-utf-8-bom-sequences php - How to remove multiple UTF-8 BOM sequences - Stack Overflow] | |||
* [https://stackoverflow.com/questions/22600235/remove-or-match-a-unicode-zero-width-space-php replace - Remove or match a Unicode Zero Width Space PHP - Stack Overflow] | |||
< | <pre> | ||
// 原始資料:程式碼編輯器會顯示為 ZWNBSP,但是一般編輯器試看不到 | |||
$text = "\xef\xbb\xbf" . "單位名稱"; | |||
</ | // 移除 BOM | ||
$text = preg_replace('/[\x{200B}-\x{200D}\x{FEFF}]/u', '', $text); | |||
</pre> | |||
== References == | == References == | ||
| Line 72: | Line 116: | ||
<references /> | <references /> | ||
[[Category:Programming]] [[Category:Data Science]] | [[Category:Programming]] [[Category:Data Science]] [[Category:String manipulation]] | ||
Latest revision as of 18:15, 17 September 2023
Byte order mark (BOM, 位元組順序記號, 部分編輯器稱為「簽名」)
How to see Byte order mark[edit]
MySQL way[edit]
Using MySQL HEX() function "returns a string representation of a hexadecimal value of a decimal or string value specified as an argument."
Run sql on sqlfiddle.com or Download the Sql file directly.
CREATE TABLE `articles` (
`id` varchar(50) NOT NULL,
`notes` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `articles` (`id`, `notes`) VALUES
('1234567890', 'no BOM'),
('1234567890', 'BOM');
ALTER TABLE `articles`
ADD UNIQUE KEY `id` (`id`) USING BTREE;
SELECT HEX(`id`), `id`, `notes` FROM `articles`;
| HEX(id) | id | notes |
|---|---|---|
| 31323334353637383930 | 1234567890 | UTF-8 without BOM |
| EFBBBF31323334353637383930 | 1234567890 | UTF-8 with BOM |
If the column `id` was only allowed integer in column value, you can use the following sql query to find the records contains BOM:
SELECT * FROM `articles` WHERE HEX(`id`) REGEXP '[^0-9]+'
PHP way[edit]
$string = "1234567890";
echo $string . " NOT contains BOM --> after str2hex: " . str2hex($string) . PHP_EOL;
$string = "\xEF\xBB\xBF" . "1234567890";
echo $string . " contains BOM --> after str2hex: " . str2hex($string) . PHP_EOL;
function str2hex($string) {
$hexstr = unpack('H*', $string);
return array_shift($hexstr);
}
Result:
1234567890 NOT contains BOM --> after str2hex: 31323334353637383930 1234567890 contains BOM --> after str2hex: efbbbf31323334353637383930
Excel / Google sheet way[edit]
Using the CODE function to check the "numeric code for the first character in a text string". If the cell A1 contains BOM,
- =CODE(A1) returns 63 on Excel 2016 of Win
[4] - =CODE(A1) returns 95 on Excel 2016 of macOS
- =CODE(A1) returns 65279 or other numeric value e.g. 28201 on Google sheet
BASH command[edit]
Check if a UTF-8 encoded file contains a BOM. The first line result of hexdump mentioning ef bb bf indicates it contains a BOM
% hexdump -n 3 -C filename 00000000 ef bb bf |...| 00000003
File command[edit]
Using file (command): file filename.txt on Linux
, macOS
[5] & Cygwin on Win
. See details on Text file encoding
Hex editor[edit]
Using Hext editor to open the text file. 位元組順序記號 - 維基百科,自由的百科全書
How to remove Byte order mark[edit]
PHP
- php - How to remove multiple UTF-8 BOM sequences - Stack Overflow
- replace - Remove or match a Unicode Zero Width Space PHP - Stack Overflow
// 原始資料:程式碼編輯器會顯示為 ZWNBSP,但是一般編輯器試看不到
$text = "\xef\xbb\xbf" . "單位名稱";
// 移除 BOM
$text = preg_replace('/[\x{200B}-\x{200D}\x{FEFF}]/u', '', $text);