Fix garbled message text: Difference between revisions

Jump to navigation Jump to search
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Ideas on how to fix garbled message text ==
== How to fix garbled message text ==
Ideas on how to fix garbled message text
# Possible cause
# Possible cause
#* Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools
#* Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools
Line 5: Line 6:
# (optional) convert the current encode to UTF-8
# (optional) convert the current encode to UTF-8
# (optional) Making text wrap to window size
# (optional) Making text wrap to window size
List of the (look like but not) garbled text and possible root cause
<table border="1" style="width: 100%; table-layout: fixed;" class="wikitable sortable">
<tr>
<th>Feature</th>
<th>Example</th>
<th>Meaning</th>
<th>Restore to human readable ↔ encode text</th>
</tr>
<tr>
<td>String contains {{kbd | key=<nowiki>%2</nowiki>}} or {{kbd | key=<nowiki>%20</nowiki>}} symbols and meaningfulness English characters</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2F</nowiki>}}</td>
<td>"converts characters into a format that can be transmitted over the Internet ... " Cited from [http://www.w3schools.com/tags/ref_urlencode.asp w3schools]</td>
<td>URL decode ↔ URL eocode</td>
</tr>
<tr>
<td>String start from {{kbd | key=<nowiki>\u</nowiki>}}, {{kbd | key=<nowiki>\U</nowiki>}} or {{kbd | key=<nowiki>U+</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>\u8c61</nowiki>}}, {{kbd | key=<nowiki>\U0001f418</nowiki>}} or {{kbd | key=<nowiki>U+1F418</nowiki>}}</td>
<td>Unicode number: "Unicode code point is referred to by writing "U+" followed by its hexadecimal number.<ref>[https://en.wikipedia.org/wiki/Unicode Unicode - Wikipedia]</ref>" (1) 16-bit or 32-bit hex value (2) "JSON representation of the supplied value"<ref>[http://php.net/manual/en/function.json-encode.php PHP: json_encode - Manual]</ref><ref>[http://www.faqs.org/rfcs/rfc7159.html RFC 7159: The JavaScript Object Notation (JSON) Data Interchange Format]</ref></td>
<td>JSON decode ↔ JSON eocode</td>
</tr>
<tr>
<td>String starting from {{kbd | key=<nowiki>0x</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>0x8c61</nowiki>}}</td>
<td>hexadecimal string<ref>[https://www.programiz.com/python-programming/methods/built-in/hex Python hex() - Python Standard Library]</ref></td>
<td></td>
</tr>
<tr>
<td>String starting from {{kbd | key=<nowiki>\x</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>\xe8\xa8\xb1</nowiki>}}</td>
<td>"\x is a string escape code, which happens to use hex notation" (hexadecimal notation)<ref>[https://stackoverflow.com/questions/13123877/difference-between-different-hex-types-representations-in-python Difference between different hex types/representations in Python - Stack Overflow]</ref></td>
<td>hexadecimal to text ↔ text to hexadecimal</td>
</tr>
<tr>
<td>String starting from {{kbd | key=<nowiki>&#</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>&amp;#35937;</nowiki>}}</td>
<td>Unicode HTML code. "Unicode number in decimal, hex or octal"<ref>[http://www.amp-what.com/help.html &what Help]</ref></td>
<td>[https://www.php.net/manual/en/function.html-entity-decode.php PHP: html_entity_decode] ↔ (See the following section to understand how to encode)</td>
</tr>
<tr>
<td>HTML source code starting from {{kbd | key=<nowiki>& ... ;</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>& a m p ;</nowiki>}} (without whitespace) is {{kbd | key=<nowiki>&amp;</nowiki>}}</td>
<td>"all characters which have HTML character entity equivalents are translated into these entities"<ref>[https://www.php.net/manual/en/function.htmlentities.php PHP: htmlentities - Manual]</ref></td>
<td>[https://www.php.net/manual/en/function.htmlspecialchars-decode.php PHP: htmlspecialchars_decode] ↔ [https://www.php.net/manual/en/function.htmlentities.php PHP: htmlentities]</td>
</tr>
</table>




Line 75: Line 122:
* [https://www.w3schools.com/js/js_json_parse.asp JSON.parse()] or [http://api.jquery.com/jquery.parsejson/ jQuery.parseJSON() | jQuery API Documentation]
* [https://www.w3schools.com/js/js_json_parse.asp JSON.parse()] or [http://api.jquery.com/jquery.parsejson/ jQuery.parseJSON() | jQuery API Documentation]


== List of the (look like but not) garbled text and possible cause ==
<table border="1" style="width: 100%; table-layout: fixed;" class="wikitable sortable">
<tr>
<th>Feature</th>
<th>Example</th>
<th>Meaning</th>
<th>Restore to human readable ↔ encode text</th>
</tr>
<tr>
<td>Website address contains {{kbd | key=<nowiki>%2</nowiki>}} or {{kbd | key=<nowiki>%20</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2F</nowiki>}}</td>
<td>"converts characters into a format that can be transmitted over the Internet ... " Cited from [http://www.w3schools.com/tags/ref_urlencode.asp w3schools]</td>
<td>URL decode ↔ URL eocode</td>
</tr>
<tr>
<td>String start from {{kbd | key=<nowiki>\u</nowiki>}}, {{kbd | key=<nowiki>\U</nowiki>}} or {{kbd | key=<nowiki>U+</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>\u8c61</nowiki>}}, {{kbd | key=<nowiki>\U0001f418</nowiki>}} or {{kbd | key=<nowiki>U+1F418</nowiki>}}</td>
<td>Unicode number: "Unicode code point is referred to by writing "U+" followed by its hexadecimal number.<ref>[https://en.wikipedia.org/wiki/Unicode Unicode - Wikipedia]</ref>" (1) 16-bit or 32-bit hex value (2) "JSON representation of the supplied value"<ref>[http://php.net/manual/en/function.json-encode.php PHP: json_encode - Manual]</ref><ref>[http://www.faqs.org/rfcs/rfc7159.html RFC 7159: The JavaScript Object Notation (JSON) Data Interchange Format]</ref></td>
<td>JSON decode ↔ JSON eocode</td>
</tr>
<tr>
<td>String starting from {{kbd | key=<nowiki>0x</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>0x8c61</nowiki>}}</td>
<td>hexadecimal string<ref>[https://www.programiz.com/python-programming/methods/built-in/hex Python hex() - Python Standard Library]</ref></td>
<td></td>
</tr>
<tr>
<td>String starting from {{kbd | key=<nowiki>\x</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>\xe8\xa8\xb1</nowiki>}}</td>
<td>"\x is a string escape code, which happens to use hex notation" (hexadecimal notation)<ref>[https://stackoverflow.com/questions/13123877/difference-between-different-hex-types-representations-in-python Difference between different hex types/representations in Python - Stack Overflow]</ref></td>
<td>hexadecimal to text ↔ text to hexadecimal</td>
</tr>
<tr>
<td>String starting from {{kbd | key=<nowiki>&#</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>&amp;#35937;</nowiki>}}</td>
<td>Unicode HTML code. "Unicode number in decimal, hex or octal"<ref>[http://www.amp-what.com/help.html &what Help]</ref></td>
<td></td>
</tr>
</table>


=== String contains {{kbd | key=<nowiki>%2</nowiki>}} or {{kbd | key=<nowiki>%20</nowiki>}} symbols ===
=== String contains {{kbd | key=<nowiki>%2</nowiki>}} or {{kbd | key=<nowiki>%20</nowiki>}} symbols ===
Using [http://php.net/manual/en/function.urlencode.php PHP: urlencode - Manual] or [https://www.w3schools.com/jsref/jsref_encodeuri.asp JavaScript encodeURI() Function]
Using the following functions
* PHP [http://php.net/manual/en/function.urlencode.php urlencode]
* JavaScript [https://www.w3schools.com/jsref/jsref_encodeuri.asp encodeURI() Function]
* Excel [https://support.microsoft.com/en-us/office/encodeurl-function-07c7fb90-7c60-4bff-8687-fac50fe33d0e ENCODEURL function]


=== String starting from \u, \U or U+ symbol ===
=== String starting from \u, \U or U+ symbol ===
Line 137: Line 148:
echo "encoded string: " . json_encode("🐘") . PHP_EOL; // print "\ud83d\udc18"
echo "encoded string: " . json_encode("🐘") . PHP_EOL; // print "\ud83d\udc18"
</pre>
</pre>
when using the heredoc syntax (<<<EOT ... EOT;), it's possible that unnecessary whitespace or hidden characters at the beginning or end of the block might cause json_decode to fail in parsing the string correctly. Direct assignment avoids potential whitespace or format issues from heredoc.
<pre>
$encoded = '"\u8c61"';
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 象
echo "encoded string: " . json_encode("象") . PHP_EOL; // print "\u8c61"
</pre>


Using PHP v. 7.0 [https://wiki.php.net/rfc/unicode_escape Unicode Codepoint Escape Syntax]<ref>[https://secure.php.net/manual/en/migration70.new-features.php#migration70.new-features.unicode-codepoint-escape-syntax PHP: New features - Manual]</ref>
Using PHP v. 7.0 [https://wiki.php.net/rfc/unicode_escape Unicode Codepoint Escape Syntax]<ref>[https://secure.php.net/manual/en/migration70.new-features.php#migration70.new-features.unicode-codepoint-escape-syntax PHP: New features - Manual]</ref>
Line 212: Line 231:
for each_unicode_character in hex_notation.decode('utf-8'):
for each_unicode_character in hex_notation.decode('utf-8'):
     print(each_unicode_character)
     print(each_unicode_character)
</pre>
Using PHP<ref>[https://stackoverflow.com/questions/7320516/how-to-convert-text-to-x-codes php - How to convert text to \x codes? - Stack Overflow]</ref>: [https://www.ideone.com/m58rEZ See it in action]
<pre>
echo preg_replace_callback("/./", function($matched) {
    return '\x'.dechex(ord($matched[0]));
}, '🐘');
# print \xf0\x9f\x90\x98
</pre>
</pre>


=== String starting from &# symbols ===
=== String starting from &# symbols ===
Using PHP [https://www.w3schools.com/php/func_string_html_entity_decode.asp html_entity_decode() Function]<ref>[https://blog.longwin.com.tw/2011/06/php-html-unicode-convert-2011/ PHP 將 文字 轉換成 &#xxxxx; UNICODE 碼 | Tsung's Blog]</ref><ref>[http://hinablue.blogspot.com/2008/01/php-tech-unicode-html-convert.html [php tech.] unicode html convert | HINA::工程幼稚園] unicode html 字碼來元是由原本的編碼,轉換為 UCS-2 之後,再取二進制轉換,再取一次 16 to 10 進制轉換,在加上 &# 而得到這個字碼。</ref>
Using PHP [https://www.w3schools.com/php/func_string_html_entity_decode.asp html_entity_decode() Function]<ref>[https://blog.longwin.com.tw/2011/06/php-html-unicode-convert-2011/ PHP 將 文字 轉換成 &#xxxxx; UNICODE 碼 | Tsung's Blog]</ref><ref>[http://hinablue.blogspot.com/2008/01/php-tech-unicode-html-convert.html [php tech.] unicode html convert | HINA::工程幼稚園] unicode html 字碼來元是由原本的編碼,轉換為 UCS-2 之後,再取二進制轉換,再取一次 16 to 10 進制轉換,在加上 &# 而得到這個字碼。</ref>
To decode the text
<pre>
<pre>
$unicode_html = '&amp;#128024;';
$unicode_html = '&amp;#128024;';
Line 222: Line 253:
$unicode_html = '&amp;#128024;';
$unicode_html = '&amp;#128024;';
echo mb_convert_encoding($unicode_html, 'UTF-8', 'HTML-ENTITIES') . PHP_EOL; // print 🐘
echo mb_convert_encoding($unicode_html, 'UTF-8', 'HTML-ENTITIES') . PHP_EOL; // print 🐘
</pre>


To encode the text
<pre>
$input = "🐘";
$input = "🐘";
$unicode_html = base_convert(bin2hex(mb_convert_encoding($input, 'UTF-32', 'utf-8')), 16, 10);
$unicode_html = base_convert(bin2hex(mb_convert_encoding($input, 'UTF-32', 'utf-8')), 16, 10);
Line 296: Line 330:
* [http://en.wikipedia.org/wiki/Character_encoding Character encoding - Wikipedia, the free encyclopedia]
* [http://en.wikipedia.org/wiki/Character_encoding Character encoding - Wikipedia, the free encyclopedia]
* [https://pjchender.blogspot.com/2018/06/guide-unicode-javascript.html (Guide) 瞭解網頁中看不懂的編碼:Unicode 在 JavaScript 中的使用 ~ PJCHENder 那些沒告訴你的小細節]
* [https://pjchender.blogspot.com/2018/06/guide-unicode-javascript.html (Guide) 瞭解網頁中看不懂的編碼:Unicode 在 JavaScript 中的使用 ~ PJCHENder 那些沒告訴你的小細節]
* [https://www.multiutil.com/base64-to-text-converter/ Base64 to Text Converter]
* [https://www.multiutil.com/gzip-to-text-decompress/ Gzip to Text Decompress using gzip, deflate and brotli algoithms]
* [[URL Encoding]]
* [[URL Encoding]]


Unicode table
Unicode table
Line 310: Line 347:
[[Category:Software]]
[[Category:Software]]
[[Category:Data Science]]
[[Category:Data Science]]
[[Category:Text file processing]]
[[Category:String manipulation]]
[[Category:Programming]]
[[Category:Programming]]

Navigation menu