Fix garbled message text: Difference between revisions
Jump to navigation
Jump to search
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Ideas on how to fix garbled message text | == How to fix garbled message text == | ||
Ideas on how to fix garbled message text | |||
# Possible cause | # Possible cause | ||
#* Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools | #* Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools | ||
Line 5: | Line 6: | ||
# (optional) convert the current encode to UTF-8 | # (optional) convert the current encode to UTF-8 | ||
# (optional) Making text wrap to window size | # (optional) Making text wrap to window size | ||
List of the (look like but not) garbled text and possible root cause | |||
<table border="1" style="width: 100%; table-layout: fixed;" class="wikitable sortable"> | |||
<tr> | |||
<th>Feature</th> | |||
<th>Example</th> | |||
<th>Meaning</th> | |||
<th>Restore to human readable ↔ encode text</th> | |||
</tr> | |||
<tr> | |||
<td>String contains {{kbd | key=<nowiki>%2</nowiki>}} or {{kbd | key=<nowiki>%20</nowiki>}} symbols and meaningfulness English characters</td> | |||
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2F</nowiki>}}</td> | |||
<td>"converts characters into a format that can be transmitted over the Internet ... " Cited from [http://www.w3schools.com/tags/ref_urlencode.asp w3schools]</td> | |||
<td>URL decode ↔ URL eocode</td> | |||
</tr> | |||
<tr> | |||
<td>String start from {{kbd | key=<nowiki>\u</nowiki>}}, {{kbd | key=<nowiki>\U</nowiki>}} or {{kbd | key=<nowiki>U+</nowiki>}} symbols</td> | |||
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>\u8c61</nowiki>}}, {{kbd | key=<nowiki>\U0001f418</nowiki>}} or {{kbd | key=<nowiki>U+1F418</nowiki>}}</td> | |||
<td>Unicode number: "Unicode code point is referred to by writing "U+" followed by its hexadecimal number.<ref>[https://en.wikipedia.org/wiki/Unicode Unicode - Wikipedia]</ref>" (1) 16-bit or 32-bit hex value (2) "JSON representation of the supplied value"<ref>[http://php.net/manual/en/function.json-encode.php PHP: json_encode - Manual]</ref><ref>[http://www.faqs.org/rfcs/rfc7159.html RFC 7159: The JavaScript Object Notation (JSON) Data Interchange Format]</ref></td> | |||
<td>JSON decode ↔ JSON eocode</td> | |||
</tr> | |||
<tr> | |||
<td>String starting from {{kbd | key=<nowiki>0x</nowiki>}} symbols</td> | |||
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>0x8c61</nowiki>}}</td> | |||
<td>hexadecimal string<ref>[https://www.programiz.com/python-programming/methods/built-in/hex Python hex() - Python Standard Library]</ref></td> | |||
<td></td> | |||
</tr> | |||
<tr> | |||
<td>String starting from {{kbd | key=<nowiki>\x</nowiki>}} symbols</td> | |||
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>\xe8\xa8\xb1</nowiki>}}</td> | |||
<td>"\x is a string escape code, which happens to use hex notation" (hexadecimal notation)<ref>[https://stackoverflow.com/questions/13123877/difference-between-different-hex-types-representations-in-python Difference between different hex types/representations in Python - Stack Overflow]</ref></td> | |||
<td>hexadecimal to text ↔ text to hexadecimal</td> | |||
</tr> | |||
<tr> | |||
<td>String starting from {{kbd | key=<nowiki>&#</nowiki>}} symbols</td> | |||
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>&#35937;</nowiki>}}</td> | |||
<td>Unicode HTML code. "Unicode number in decimal, hex or octal"<ref>[http://www.amp-what.com/help.html &what Help]</ref></td> | |||
<td>[https://www.php.net/manual/en/function.html-entity-decode.php PHP: html_entity_decode] ↔ (See the following section to understand how to encode)</td> | |||
</tr> | |||
<tr> | |||
<td>HTML source code starting from {{kbd | key=<nowiki>& ... ;</nowiki>}} symbols</td> | |||
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>& a m p ;</nowiki>}} (without whitespace) is {{kbd | key=<nowiki>&</nowiki>}}</td> | |||
<td>"all characters which have HTML character entity equivalents are translated into these entities"<ref>[https://www.php.net/manual/en/function.htmlentities.php PHP: htmlentities - Manual]</ref></td> | |||
<td>[https://www.php.net/manual/en/function.htmlspecialchars-decode.php PHP: htmlspecialchars_decode] ↔ [https://www.php.net/manual/en/function.htmlentities.php PHP: htmlentities]</td> | |||
</tr> | |||
</table> | |||
Line 75: | Line 122: | ||
* [https://www.w3schools.com/js/js_json_parse.asp JSON.parse()] or [http://api.jquery.com/jquery.parsejson/ jQuery.parseJSON() | jQuery API Documentation] | * [https://www.w3schools.com/js/js_json_parse.asp JSON.parse()] or [http://api.jquery.com/jquery.parsejson/ jQuery.parseJSON() | jQuery API Documentation] | ||
=== String contains {{kbd | key=<nowiki>%2</nowiki>}} or {{kbd | key=<nowiki>%20</nowiki>}} symbols === | === String contains {{kbd | key=<nowiki>%2</nowiki>}} or {{kbd | key=<nowiki>%20</nowiki>}} symbols === | ||
Line 229: | Line 237: | ||
=== String starting from &# symbols === | === String starting from &# symbols === | ||
Using PHP [https://www.w3schools.com/php/func_string_html_entity_decode.asp html_entity_decode() Function]<ref>[https://blog.longwin.com.tw/2011/06/php-html-unicode-convert-2011/ PHP 將 文字 轉換成 &#xxxxx; UNICODE 碼 | Tsung's Blog]</ref><ref>[http://hinablue.blogspot.com/2008/01/php-tech-unicode-html-convert.html [php tech.] unicode html convert | HINA::工程幼稚園] unicode html 字碼來元是由原本的編碼,轉換為 UCS-2 之後,再取二進制轉換,再取一次 16 to 10 進制轉換,在加上 &# 而得到這個字碼。</ref> | Using PHP [https://www.w3schools.com/php/func_string_html_entity_decode.asp html_entity_decode() Function]<ref>[https://blog.longwin.com.tw/2011/06/php-html-unicode-convert-2011/ PHP 將 文字 轉換成 &#xxxxx; UNICODE 碼 | Tsung's Blog]</ref><ref>[http://hinablue.blogspot.com/2008/01/php-tech-unicode-html-convert.html [php tech.] unicode html convert | HINA::工程幼稚園] unicode html 字碼來元是由原本的編碼,轉換為 UCS-2 之後,再取二進制轉換,再取一次 16 to 10 進制轉換,在加上 &# 而得到這個字碼。</ref> | ||
To decode the text | |||
<pre> | <pre> | ||
$unicode_html = '&#128024;'; | $unicode_html = '&#128024;'; | ||
Line 235: | Line 245: | ||
$unicode_html = '&#128024;'; | $unicode_html = '&#128024;'; | ||
echo mb_convert_encoding($unicode_html, 'UTF-8', 'HTML-ENTITIES') . PHP_EOL; // print 🐘 | echo mb_convert_encoding($unicode_html, 'UTF-8', 'HTML-ENTITIES') . PHP_EOL; // print 🐘 | ||
</pre> | |||
To encode the text | |||
<pre> | |||
$input = "🐘"; | $input = "🐘"; | ||
$unicode_html = base_convert(bin2hex(mb_convert_encoding($input, 'UTF-32', 'utf-8')), 16, 10); | $unicode_html = base_convert(bin2hex(mb_convert_encoding($input, 'UTF-32', 'utf-8')), 16, 10); | ||
Line 309: | Line 322: | ||
* [http://en.wikipedia.org/wiki/Character_encoding Character encoding - Wikipedia, the free encyclopedia] | * [http://en.wikipedia.org/wiki/Character_encoding Character encoding - Wikipedia, the free encyclopedia] | ||
* [https://pjchender.blogspot.com/2018/06/guide-unicode-javascript.html (Guide) 瞭解網頁中看不懂的編碼:Unicode 在 JavaScript 中的使用 ~ PJCHENder 那些沒告訴你的小細節] | * [https://pjchender.blogspot.com/2018/06/guide-unicode-javascript.html (Guide) 瞭解網頁中看不懂的編碼:Unicode 在 JavaScript 中的使用 ~ PJCHENder 那些沒告訴你的小細節] | ||
* [https://www.multiutil.com/base64-to-text-converter/ Base64 to Text Converter] | |||
* [https://www.multiutil.com/gzip-to-text-decompress/ Gzip to Text Decompress using gzip, deflate and brotli algoithms] | |||
* [[URL Encoding]] | * [[URL Encoding]] | ||
Unicode table | Unicode table |
Latest revision as of 11:13, 19 June 2023
How to fix garbled message text[edit]
Ideas on how to fix garbled message text
- Possible cause
- Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools
- PHP utf8_encode() & utf8_decode()
- (optional) convert the current encode to UTF-8
- (optional) Making text wrap to window size
List of the (look like but not) garbled text and possible root cause
Feature | Example | Meaning | Restore to human readable ↔ encode text |
---|---|---|---|
String contains %2 or %20 symbols and meaningfulness English characters | http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2F | "converts characters into a format that can be transmitted over the Internet ... " Cited from w3schools | URL decode ↔ URL eocode |
String start from \u, \U or U+ symbols | \u8c61, \U0001f418 or U+1F418 | Unicode number: "Unicode code point is referred to by writing "U+" followed by its hexadecimal number.[1]" (1) 16-bit or 32-bit hex value (2) "JSON representation of the supplied value"[2][3] | JSON decode ↔ JSON eocode |
String starting from 0x symbols | 0x8c61 | hexadecimal string[4] | |
String starting from \x symbols | \xe8\xa8\xb1 | "\x is a string escape code, which happens to use hex notation" (hexadecimal notation)[5] | hexadecimal to text ↔ text to hexadecimal |
String starting from &# symbols | 象 | Unicode HTML code. "Unicode number in decimal, hex or octal"[6] | PHP: html_entity_decode ↔ (See the following section to understand how to encode) |
HTML source code starting from & ... ; symbols | & a m p ; (without whitespace) is & | "all characters which have HTML character entity equivalents are translated into these entities"[7] | PHP: htmlspecialchars_decode ↔ PHP: htmlentities |
Possible approaches to encode the message text:
Approach | Goal | Is Chinese text garbled/encoded? | Sample text before encoded or after encoded |
---|---|---|---|
JavaScript encodeURIComponent() ↔ JavaScript decodeURIComponent()[8] |
"converts characters into a format that can be transmitted over the Internet ... " Cited from w3schools | TRUE |
|
URL Decoder/Encoder[9] | (same as above) | TRUE | (same as above) |
PHP: json_encode ↔ PHP: json_decode |
Save array in mysql database | TRUE |
|
PHP: serialize ↔ PHP: unserialize |
Save array in mysql database | FALSE |
|
PHP: htmlentities[1] ↔ PHP: html_entity_decode |
Replace reserved characters e.g. double quote symbol | FALSE |
|
Other functions
String contains %2 or %20 symbols[edit]
Using the following functions
- PHP urlencode
- JavaScript encodeURI() Function
- Excel ENCODEURL function
String starting from \u, \U or U+ symbol[edit]
Using PHP. Type is string
$encoded = <<<EOT "\u8c61" EOT; echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 象 echo "encoded string: " . json_encode("象") . PHP_EOL; // print "\u8c61" $encoded = <<<EOT "\ud83d\udc18" EOT; echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 🐘 echo "encoded string: " . json_encode("🐘") . PHP_EOL; // print "\ud83d\udc18"
Using PHP v. 7.0 Unicode Codepoint Escape Syntax[10]
echo "\u{8c61}" . PHP_EOL; // print 象 echo "\u{0001f418}" . PHP_EOL; // print 🐘
Using Python. Type is string
x = u'象' x.encode('ascii', 'backslashreplace') # print b'\\u8c61' x = u'🐘' x.encode('ascii', 'backslashreplace') # print b'\\U0001f418'
Using PHP. Type is array
$input = <<<EOT ["\u8c61"] EOT; $input = trim($input); var_dump(json_decode($input, true)); // print array("象") var_dump(json_encode(array("象")); // print ["\u8c61"]
String starting from 0x symbol[edit]
Using Python chr() Function ↔ hex() function
int('0x8c61', 16) # print 35937 -- "An integer representing a valid Unicode code point" cited from w3schools chr(int('0x8c61', 16)) # print '象' -- "returns the character that represents the specified unicode." cited from w3schools hex(ord('象')) # print '0x8c61' -- "converts an integer number to the corresponding hexadecimal string." cited from programiz.com chr(int('0x1f418', 16)) # print '🐘' hex(ord('🐘')) # print '0x1f418'
string starting from \x symbol[edit]
data = u"象" data hex_notation = data.encode('utf-8') hex_notation # print b'\xe8\xb1\xa1' for each_unicode_character in hex_notation.decode('utf-8'): print(each_unicode_character) data = u"🐘" data hex_notation = data.encode('utf-8') hex_notation # print b'\xf0\x9f\x90\x98' for each_unicode_character in hex_notation.decode('utf-8'): print(each_unicode_character) data = u"だいじょうぶ" data hex_notation = data.encode('utf-8') hex_notation # print b'\xe3\x81\xa0\xe3\x81\x84\xe3\x81\x98\xe3\x82\x87\xe3\x81\x86\xe3\x81\xb6' for each_unicode_character in hex_notation.decode('utf-8'): print(each_unicode_character)
Using PHP[14]: See it in action
echo preg_replace_callback("/./", function($matched) { return '\x'.dechex(ord($matched[0])); }, '🐘'); # print \xf0\x9f\x90\x98
String starting from &# symbols[edit]
Using PHP html_entity_decode() Function[15][16]
To decode the text
$unicode_html = '🐘'; echo html_entity_decode($unicode_html) . PHP_EOL; // print 🐘 $unicode_html = '🐘'; echo mb_convert_encoding($unicode_html, 'UTF-8', 'HTML-ENTITIES') . PHP_EOL; // print 🐘
To encode the text
$input = "🐘"; $unicode_html = base_convert(bin2hex(mb_convert_encoding($input, 'UTF-32', 'utf-8')), 16, 10); $unicode_html = '&#' . $unicode_html . ';'; echo 'unicode_html: ' . $unicode_html . PHP_EOL; // print 🐘
Ways to fix garbled message text[edit]
ConvertZ v.8.02[edit]
- choose encode: manually (mainly in Asia language)
- convert to UTF-8: available
- convert to big5 from UTF-8: available the wording may be changed by the software ex: 余美人 -> 於美人
- allow to wrap long text: available
EmEditor v.14.3.1 ($)[edit]
- choose encode: manually and auto-detect
- convert to UTF-8: available
- allow to wrap long text: available
- support command line: EmEditor FAQ: How can I convert file encodings with the command line?
Google Chrome v.10 (viewer)[edit]
- choose encode: manually and auto-detect
- allow to wrap long text: available (auto)
MadEdit v.0.2.9.1[edit]
- choose encode: manually and auto-detect
- convert to UTF-8: available
- allow to wrap long text: available
Microsoft Internet Explorer v.8 (viewer)[edit]
- choose encode: manually and auto-detect
- allow to wrap long text:
Microsoft notepad (記事本) for Windows[edit]
method 1: Err: 解決用記事本(notepad)開啟簡體字txt檔,出現亂碼的問題(2010): notepad + Notepad++
- choose encode: manually
- convert to UTF-8: available by Notepad++
- allow to wrap long text: available
method 2: Microsoft AppLocale 公用程式(patched: piaip pAppLocale) + notepad
- choose encode: manually
- convert to UTF-8: not available
- allow to wrap long text: available
Microsoft Office Word 2003 ($)[edit]
- choose encode: manually
- convert to UTF-8: available
- allow to wrap long text: available
Mozilla Firefox v.3.6 (viewer)[edit]
- choose encode: manually and auto-detect
- allow to wrap long text: no but you can copy the following code into the web address bar to wrap long text (Thanks, Return of the Sasquatch: word wrap for Firefox bookmarklet!)
javascript:(function() { var D = document; F(D.body); function F(n) { var u, r, c, x; if (n.nodeType == 3) { u = n.data.search(/\S{45}/); if (u >= 0) { r = n.splitText(u + 45); n.parentNode.insertBefore(D.createElement('wbr'), r); } } else if ((n.tagName != 'STYLE') && (n.tagName != 'SCRIPT')) { for (c = 0; x = n.childNodes[c]; ++c) { F(x); } } } D.body.innerHTML += ' '; })();
Notepad++ v.5.8[edit]
- choose encode: manually
- convert to UTF-8: available
- allow to wrap long text: available
not supported at this moment[edit]
- LibreOffice 3.3.0 - Writer
- OpenOffice.org 3.3.0 - Writer is not supported but OpenOffice.org Calc is supported.
Further reading[edit]
- 簡繁體文件轉換
- Character encoding - Wikipedia, the free encyclopedia
- (Guide) 瞭解網頁中看不懂的編碼:Unicode 在 JavaScript 中的使用 ~ PJCHENder 那些沒告訴你的小細節
- Base64 to Text Converter
- Gzip to Text Decompress using gzip, deflate and brotli algoithms
- URL Encoding
Unicode table
- Unicode® Character Table
- &what: Discover Unicode & HTML Character Entities
- HTML Symbols, Entities, Characters and Codes — HTML Arrows
- Unicode/UTF-8-character table
References[edit]
- ↑ Unicode - Wikipedia
- ↑ PHP: json_encode - Manual
- ↑ RFC 7159: The JavaScript Object Notation (JSON) Data Interchange Format
- ↑ Python hex() - Python Standard Library
- ↑ Difference between different hex types/representations in Python - Stack Overflow
- ↑ &what Help
- ↑ PHP: htmlentities - Manual
- ↑ urlencode - How to Encode URL Contains Unicode Characters with PHP - Stack Overflow
- ↑ PHP urlencode()
- ↑ PHP: New features - Manual
- ↑ bytes.decode()
- ↑ str.encode()
- ↑ python - How to decode unicode in a Chinese text - Stack Overflow
- ↑ php - How to convert text to \x codes? - Stack Overflow
- ↑ PHP 將 文字 轉換成 &#xxxxx; UNICODE 碼 | Tsung's Blog
- ↑ [php tech. unicode html convert | HINA::工程幼稚園] unicode html 字碼來元是由原本的編碼,轉換為 UCS-2 之後,再取二進制轉換,再取一次 16 to 10 進制轉換,在加上 &# 而得到這個字碼。