Byte order mark: Difference between revisions

Latest revision as of 18:15, 17 September 2023

Byte order mark (BOM, 位元組順序記號, 部分編輯器稱為「簽名」)

How to see Byte order mark[edit]

MySQL way[edit]

Using MySQL HEX() function "returns a string representation of a hexadecimal value of a decimal or string value specified as an argument."

Run sql on sqlfiddle.com or Download the Sql file directly.

CREATE TABLE `articles` (
  `id` varchar(50) NOT NULL,
  `notes` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `articles` (`id`, `notes`) VALUES
('1234567890', 'no BOM'),
('1234567890', 'BOM');

ALTER TABLE `articles`
  ADD UNIQUE KEY `id` (`id`) USING BTREE;

SELECT HEX(`id`), `id`, `notes` FROM `articles`;

HEX(id)	id	notes
31323334353637383930	1234567890	UTF-8 without BOM
EFBBBF31323334353637383930	1234567890	UTF-8 with BOM

If the column `id` was only allowed integer in column value, you can use the following sql query to find the records contains BOM:

SELECT * 
FROM `articles`
WHERE HEX(`id`) REGEXP '[^0-9]+'

PHP way[edit]

PHP code^[1]^[2]^[3]:


$string = "1234567890";
echo $string . " NOT contains BOM --> after str2hex: " . str2hex($string) . PHP_EOL;

$string = "\xEF\xBB\xBF" . "1234567890";
echo $string . " contains BOM  --> after str2hex: " . str2hex($string) . PHP_EOL;

function str2hex($string) {
	$hexstr = unpack('H*', $string);
	return array_shift($hexstr);
}

Result:

1234567890 NOT contains BOM --> after str2hex: 31323334353637383930
1234567890 contains BOM  --> after str2hex: efbbbf31323334353637383930

Excel / Google sheet way[edit]

Using the CODE function to check the "numeric code for the first character in a text string". If the cell A1 contains BOM,

=CODE(A1) returns 63 on Excel 2016 of Win ^[4]
=CODE(A1) returns 95 on Excel 2016 of macOS
=CODE(A1) returns 65279 or other numeric value e.g. 28201 on Google sheet

BASH command[edit]

Check if a UTF-8 encoded file contains a BOM. The first line result of hexdump mentioning ef bb bf indicates it contains a BOM

% hexdump -n 3 -C filename

00000000  ef bb bf                                          |...|
00000003

File command[edit]

Using file (command): file filename.txt on Linux , macOS ^[5] & Cygwin on Win . See details on Text file encoding

Hex editor[edit]

Using Hext editor to open the text file. 位元組順序記號 - 維基百科，自由的百科全書

How to remove Byte order mark[edit]

PHP

// 原始資料：程式碼編輯器會顯示為 ZWNBSP，但是一般編輯器試看不到
$text = "\xef\xbb\xbf" . "單位名稱";

// 移除 BOM
$text = preg_replace('/[\x{200B}-\x{200D}\x{FEFF}]/u', '', $text);

References[edit]

[1] PHP convert string to hex and hex to string - Stack Overflow

[2] PHP unpack() Function

[3] - How to remove multiple UTF-8 BOM sequences before "<!DOCTYPE>"? - Stack Overflow

[4] Ascii Table - ASCII character codes and html, octal, hex and decimal chart conversion

[5] (1) Mac OS X Manual Page

[1]

[2]

[3]

[4]

[5]

@@ Line 1: / Line 1: @@
-[https://en.wikipedia.org/wiki/Byte_order_mark Byte order mark] (BOM, [https://zh.wikipedia.org/wiki/%E4%BD%8D%E5%85%83%E7%B5%84%E9%A0%86%E5%BA%8F%E8%A8%98%E8%99%9F 位元組順序記號])
+[https://en.wikipedia.org/wiki/Byte_order_mark Byte order mark] (BOM, [https://zh.wikipedia.org/wiki/%E4%BD%8D%E5%85%83%E7%B5%84%E9%A0%86%E5%BA%8F%E8%A8%98%E8%99%9F 位元組順序記號], 部分編輯器稱為「簽名」)
 == How to see Byte order mark ==
@@ Line 20: / Line 20: @@
    ADD UNIQUE KEY `id` (`id`) USING BTREE;
-SELECT HEX(id), id, notes FROM articles;
+SELECT HEX(`id`), `id`, `notes` FROM `articles`;
 </pre>
@@ Line 33: / Line 33: @@
 <td>31323334353637383930</td>
 <td>1234567890</td>
-<td>no BOM</td>
+<td>UTF-8 without BOM</td>
 </tr>
 <tr>
@@ Line 40: / Line 40: @@
 <td>EFBBBF31323334353637383930</td>
 <td>1234567890</td>
-<td>BOM</td>
+<td>UTF-8 with BOM</td>
 </tr>
 <tr>
@@ Line 46: / Line 46: @@
 </table>
-=== File command way ===
-Using [https://en.wikipedia.org/wiki/File_(command) file (command)]: {{kbd | key=<nowiki>file filename.txt</nowiki>}} on {{Linux}}, {{Mac}}<ref>[https://developer.apple.com/legacy/library/documentation/Darwin/Reference/ManPages/man1/file.1.html file(1) Mac OS X Manual Page]</ref> & Cygwin on {{Win}}
+If the column {{kbd | key=`id`}} was only allowed integer in column value, you can use the following sql query to find the records contains BOM:
+<pre>
+SELECT *
+FROM `articles`
+WHERE HEX(`id`) REGEXP '[^0-9]+'
+</pre>
+=== PHP way ===
+PHP code<ref>[https://stackoverflow.com/questions/14674834/php-convert-string-to-hex-and-hex-to-string PHP convert string to hex and hex to string - Stack Overflow]</ref><ref>[https://www.w3schools.com/php/func_misc_unpack.asp PHP unpack() Function]</ref><ref>[https://stackoverflow.com/questions/10290849/how-to-remove-multiple-utf-8-bom-sequences-before-doctype php - How to remove multiple UTF-8 BOM sequences before "<!DOCTYPE>"? - Stack Overflow]</ref>:
+<pre>
+$string = "1234567890";
+echo $string . " NOT contains BOM --> after str2hex: " . str2hex($string) . PHP_EOL;
+$string = "\xEF\xBB\xBF" . "1234567890";
+echo $string . " contains BOM  --> after str2hex: " . str2hex($string) . PHP_EOL;
+function str2hex($string) {
+	$hexstr = unpack('H*', $string);
+	return array_shift($hexstr);
+}
-<table border="1" class="wikitable" >
+</pre>
-<tr>
-<th>File content</th>
+Result:
-<th>Example result returned by file command</th>
+<pre>
-</tr>
+1234567890 NOT contains BOM --> after str2hex: 31323334353637383930
+1234567890 contains BOM  --> after str2hex: efbbbf31323334353637383930
+</pre>
+=== Excel / Google sheet way ===
+Using the [https://support.office.com/en-us/article/code-function-c32b692b-2ed0-4a04-bdd9-75640144b928 CODE function] to check the "numeric code for the first character in a text string". If the cell {{kbd | key=A1}} contains BOM,
+* {{kbd | key=<nowiki>=CODE(A1)</nowiki>}} returns {{kbd | key=63}} on Excel 2016 of {{Win}}<ref>[http://www.asciitable.com/ Ascii Table - ASCII character codes and html, octal, hex and decimal chart conversion]</ref>
+* {{kbd | key=<nowiki>=CODE(A1)</nowiki>}} returns {{kbd | key=95}} on Excel 2016 of {{Mac}}
+* {{kbd | key=<nowiki>=CODE(A1)</nowiki>}} returns {{kbd | key=65279}} or other numeric value e.g. {{kbd | key=28201}} on Google sheet
+=== BASH command ===
+Check if a UTF-8 encoded file contains a BOM. The first line result of {{kbd | key=hexdump}} mentioning ef bb bf indicates it contains a BOM
+<pre>
+% hexdump -n 3 -C filename
+00000000  ef bb bf                                          |...|
+00000003
+</pre>
+=== File command ===
+Using [https://en.wikipedia.org/wiki/File_(command) file (command)]: {{kbd | key=<nowiki>file filename.txt</nowiki>}} on {{Linux}}, {{Mac}}<ref>[https://developer.apple.com/legacy/library/documentation/Darwin/Reference/ManPages/man1/file.1.html file(1) Mac OS X Manual Page]</ref> & Cygwin on {{Win}}. See details on [[Text file encoding]]
+=== Hex editor ===
+Using Hext editor to open the text file. [https://zh.wikipedia.org/wiki/%E4%BD%8D%E5%85%83%E7%B5%84%E9%A0%86%E5%BA%8F%E8%A8%98%E8%99%9F 位元組順序記號 - 維基百科，自由的百科全書]
-<tr>
+== How to remove Byte order mark ==
-<td>File contains BOM</td>
+PHP
-<td>UTF-8 Unicode (with BOM) text</td>
+* [https://stackoverflow.com/questions/10290849/how-to-remove-multiple-utf-8-bom-sequences php - How to remove multiple UTF-8 BOM sequences - Stack Overflow]
-</tr>
+* [https://stackoverflow.com/questions/22600235/remove-or-match-a-unicode-zero-width-space-php replace - Remove or match a Unicode Zero Width Space PHP - Stack Overflow]
-<tr>
-<td>File NOT contains BOM</td>
+<pre>
-<td>UTF-8 Unicode text</td>
+// 原始資料：程式碼編輯器會顯示為 ZWNBSP，但是一般編輯器試看不到
-</tr>
+$text = "\xef\xbb\xbf" . "單位名稱";
-</table>
+// 移除 BOM
+$text = preg_replace('/[\x{200B}-\x{200D}\x{FEFF}]/u', '', $text);
+</pre>
 == References ==
@@ Line 72: / Line 116: @@
 <references />
-[[Category:Programming]] [[Category:Data Science]]
+[[Category:Programming]] [[Category:Data Science]] [[Category:String manipulation]]

Byte order mark: Difference between revisions

Latest revision as of 18:15, 17 September 2023

Contents

How to see Byte order mark[edit]

MySQL way[edit]

PHP way[edit]

Excel / Google sheet way[edit]

BASH command[edit]

File command[edit]

Hex editor[edit]

How to remove Byte order mark[edit]

References[edit]

Navigation menu

Byte order mark: Difference between revisions

Latest revision as of 18:15, 17 September 2023

How to see Byte order mark[edit]

MySQL way[edit]

PHP way[edit]

Excel / Google sheet way[edit]

BASH command[edit]

File command[edit]

Hex editor[edit]

How to remove Byte order mark[edit]

References[edit]

Navigation menu

Search