Fix garbled message text: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Ideas on how to fix garbled message text ==
== How to fix garbled message text ==
Ideas on how to fix garbled message text
# Possible cause
# Possible cause
#* Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools
#* Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools
Line 6: Line 7:
# (optional) Making text wrap to window size
# (optional) Making text wrap to window size


List of the (look like but not) garbled text and possible cause
List of the (look like but not) garbled text and possible root cause
<table border="1" style="width: 100%; table-layout: fixed;" class="wikitable sortable">
<table border="1" style="width: 100%; table-layout: fixed;" class="wikitable sortable">
<tr>
<tr>
Line 43: Line 44:
<td>Unicode HTML code. "Unicode number in decimal, hex or octal"<ref>[http://www.amp-what.com/help.html &what Help]</ref></td>
<td>Unicode HTML code. "Unicode number in decimal, hex or octal"<ref>[http://www.amp-what.com/help.html &what Help]</ref></td>
<td>[https://www.php.net/manual/en/function.html-entity-decode.php PHP: html_entity_decode] ↔ (See the following section to understand how to encode)</td>
<td>[https://www.php.net/manual/en/function.html-entity-decode.php PHP: html_entity_decode] ↔ (See the following section to understand how to encode)</td>
</tr>
<tr>
<td>HTML source code starting from {{kbd | key=<nowiki>& ... ;</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>& a m p ;</nowiki>}} (without whitespace) is {{kbd | key=<nowiki>&amp;</nowiki>}}</td>
<td>"all characters which have HTML character entity equivalents are translated into these entities"<ref>[https://www.php.net/manual/en/function.htmlentities.php PHP: htmlentities - Manual]</ref></td>
<td>[https://www.php.net/manual/en/function.htmlspecialchars-decode.php PHP: htmlspecialchars_decode] ↔ [https://www.php.net/manual/en/function.htmlentities.php PHP: htmlentities]</td>
</tr>
</tr>
</table>
</table>
Line 141: Line 148:
echo "encoded string: " . json_encode("🐘") . PHP_EOL; // print "\ud83d\udc18"
echo "encoded string: " . json_encode("🐘") . PHP_EOL; // print "\ud83d\udc18"
</pre>
</pre>
when using the heredoc syntax (<<<EOT ... EOT;), it's possible that unnecessary whitespace or hidden characters at the beginning or end of the block might cause json_decode to fail in parsing the string correctly. Direct assignment avoids potential whitespace or format issues from heredoc.
<pre>
$encoded = '"\u8c61"';
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 象
echo "encoded string: " . json_encode("象") . PHP_EOL; // print "\u8c61"
</pre>


Using PHP v. 7.0 [https://wiki.php.net/rfc/unicode_escape Unicode Codepoint Escape Syntax]<ref>[https://secure.php.net/manual/en/migration70.new-features.php#migration70.new-features.unicode-codepoint-escape-syntax PHP: New features - Manual]</ref>
Using PHP v. 7.0 [https://wiki.php.net/rfc/unicode_escape Unicode Codepoint Escape Syntax]<ref>[https://secure.php.net/manual/en/migration70.new-features.php#migration70.new-features.unicode-codepoint-escape-syntax PHP: New features - Manual]</ref>
Line 315: Line 330:
* [http://en.wikipedia.org/wiki/Character_encoding Character encoding - Wikipedia, the free encyclopedia]
* [http://en.wikipedia.org/wiki/Character_encoding Character encoding - Wikipedia, the free encyclopedia]
* [https://pjchender.blogspot.com/2018/06/guide-unicode-javascript.html (Guide) 瞭解網頁中看不懂的編碼:Unicode 在 JavaScript 中的使用 ~ PJCHENder 那些沒告訴你的小細節]
* [https://pjchender.blogspot.com/2018/06/guide-unicode-javascript.html (Guide) 瞭解網頁中看不懂的編碼:Unicode 在 JavaScript 中的使用 ~ PJCHENder 那些沒告訴你的小細節]
* [https://www.multiutil.com/base64-to-text-converter/ Base64 to Text Converter]
* [https://www.multiutil.com/gzip-to-text-decompress/ Gzip to Text Decompress using gzip, deflate and brotli algoithms]
* [[URL Encoding]]
* [[URL Encoding]]


Unicode table
Unicode table

Latest revision as of 14:21, 2 May 2024

How to fix garbled message text[edit]

Ideas on how to fix garbled message text

  1. Possible cause
    • Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools
    • PHP utf8_encode() & utf8_decode()
  2. (optional) convert the current encode to UTF-8
  3. (optional) Making text wrap to window size

List of the (look like but not) garbled text and possible root cause

Feature Example Meaning Restore to human readable ↔ encode text
String contains %2 or %20 symbols and meaningfulness English characters http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2F "converts characters into a format that can be transmitted over the Internet ... " Cited from w3schools URL decode ↔ URL eocode
String start from \u, \U or U+ symbols \u8c61, \U0001f418 or U+1F418 Unicode number: "Unicode code point is referred to by writing "U+" followed by its hexadecimal number.[1]" (1) 16-bit or 32-bit hex value (2) "JSON representation of the supplied value"[2][3] JSON decode ↔ JSON eocode
String starting from 0x symbols 0x8c61 hexadecimal string[4]
String starting from \x symbols \xe8\xa8\xb1 "\x is a string escape code, which happens to use hex notation" (hexadecimal notation)[5] hexadecimal to text ↔ text to hexadecimal
String starting from &# symbols &#35937; Unicode HTML code. "Unicode number in decimal, hex or octal"[6] PHP: html_entity_decode ↔ (See the following section to understand how to encode)
HTML source code starting from & ... ; symbols & a m p ; (without whitespace) is & "all characters which have HTML character entity equivalents are translated into these entities"[7] PHP: htmlspecialchars_decodePHP: htmlentities


Possible approaches to encode the message text:

Approach Goal Is Chinese text garbled/encoded? Sample text before encoded or after encoded
JavaScript encodeURIComponent()

JavaScript decodeURIComponent()[8]
"converts characters into a format that can be transmitted over the Internet ... " Cited from w3schools TRUE
  • before: http://www.中文網址.tw/my test.asp?name=ståle&car=saab
  • after: http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2Fmy%20test.asp%3Fname%3Dst%C3%A5le%26car%3Dsaab
URL Decoder/Encoder[9] (same as above) TRUE (same as above)
PHP: json_encode

PHP: json_decode
Save array in mysql database TRUE
  • before: array("作者" => "馬克吐溫", "名言" => "\"To a man with a hammer, everything looks like a nail.\" He said.");
  • after: {"\u4f5c\u8005":"\u99ac\u514b\u5410\u6eab","\u540d\u8a00":"\"To a man with a hammer, everything looks like a nail.\" He said."}
PHP: serialize

PHP: unserialize
Save array in mysql database FALSE
  • before: array("作者" => "馬克吐溫", "名言" => "\"To a man with a hammer, everything looks like a nail.\" He said.");
  • after: a:2:{s:6:"作者";s:12:"馬克吐溫";s:6:"名言";s:64:""To a man with a hammer, everything looks like a nail." He said.";}
PHP: htmlentities[1]

PHP: html_entity_decode
Replace reserved characters e.g. double quote symbol FALSE
  • before: 馬克吐溫名言 "To a man with a hammer, everything looks like a nail."
  • after: 馬克吐溫名言 &quot;To a man with a hammer, everything looks like a nail.&quot;

Other functions


String contains %2 or %20 symbols[edit]

Using the following functions

String starting from \u, \U or U+ symbol[edit]

Using PHP. Type is string

$encoded = <<<EOT

"\u8c61"

EOT;
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 象
echo "encoded string: " . json_encode("象") . PHP_EOL; // print "\u8c61"

$encoded = <<<EOT

"\ud83d\udc18"

EOT;
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 🐘
echo "encoded string: " . json_encode("🐘") . PHP_EOL; // print "\ud83d\udc18"

when using the heredoc syntax (<<<EOT ... EOT;), it's possible that unnecessary whitespace or hidden characters at the beginning or end of the block might cause json_decode to fail in parsing the string correctly. Direct assignment avoids potential whitespace or format issues from heredoc.

$encoded = '"\u8c61"';
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 象
echo "encoded string: " . json_encode("象") . PHP_EOL; // print "\u8c61"


Using PHP v. 7.0 Unicode Codepoint Escape Syntax[10]

echo "\u{8c61}" . PHP_EOL; // print 象
echo "\u{0001f418}" . PHP_EOL; // print 🐘

Using Python. Type is string

x = u'象'
x.encode('ascii', 'backslashreplace') 
# print b'\\u8c61'

x = u'🐘'
x.encode('ascii', 'backslashreplace') 
# print b'\\U0001f418'

Using PHP. Type is array

$input = <<<EOT

["\u8c61"]

EOT;

$input = trim($input);
var_dump(json_decode($input, true)); // print array("象")
var_dump(json_encode(array("象")); // print ["\u8c61"]

String starting from 0x symbol[edit]

Using Python chr() Functionhex() function

int('0x8c61', 16)
# print 35937 -- "An integer representing a valid Unicode code point" cited from w3schools
chr(int('0x8c61', 16))
# print '象' -- "returns the character that represents the specified unicode." cited from w3schools
hex(ord('象'))
# print '0x8c61' -- "converts an integer number to the corresponding hexadecimal string." cited from programiz.com

chr(int('0x1f418', 16))
# print '🐘'
hex(ord('🐘'))
# print '0x1f418'

string starting from \x symbol[edit]

Using Python[11][12][13]

data = u"象"
data
hex_notation = data.encode('utf-8')
hex_notation
# print b'\xe8\xb1\xa1'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)


data = u"🐘"
data
hex_notation = data.encode('utf-8')
hex_notation
# print b'\xf0\x9f\x90\x98'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)


data = u"だいじょうぶ"
data
hex_notation = data.encode('utf-8')
hex_notation 
# print b'\xe3\x81\xa0\xe3\x81\x84\xe3\x81\x98\xe3\x82\x87\xe3\x81\x86\xe3\x81\xb6'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)

Using PHP[14]: See it in action

echo preg_replace_callback("/./", function($matched) {
    return '\x'.dechex(ord($matched[0]));
}, '🐘');

# print \xf0\x9f\x90\x98

String starting from &# symbols[edit]

Using PHP html_entity_decode() Function[15][16]

To decode the text

$unicode_html = '&#128024;';
echo html_entity_decode($unicode_html) . PHP_EOL; // print 🐘

$unicode_html = '&#128024;';
echo mb_convert_encoding($unicode_html, 'UTF-8', 'HTML-ENTITIES') . PHP_EOL; // print 🐘

To encode the text

$input = "🐘";
$unicode_html = base_convert(bin2hex(mb_convert_encoding($input, 'UTF-32', 'utf-8')), 16, 10);
$unicode_html = '&#' . $unicode_html . ';';
echo 'unicode_html: ' . $unicode_html . PHP_EOL; // print &#128024

Ways to fix garbled message text[edit]

ConvertZ v.8.02[edit]

  • choose encode: manually (mainly in Asia language)
  • convert to UTF-8: available
  • convert to big5 from UTF-8: available Icon_exclaim.gif the wording may be changed by the software ex: 余美人 -> 於美人
  • allow to wrap long text: available

EmEditor v.14.3.1 ($)[edit]

Google Chrome v.10 (viewer)[edit]

  • choose encode: manually and auto-detect
  • allow to wrap long text: available (auto) Good.gif

MadEdit v.0.2.9.1[edit]

  • choose encode: manually and auto-detect Good.gif
  • convert to UTF-8: available
  • allow to wrap long text: available

Microsoft Internet Explorer v.8 (viewer)[edit]

  • choose encode: manually and auto-detect
  • allow to wrap long text:

Microsoft notepad (記事本) for Windows[edit]

method 1: Err: 解決用記事本(notepad)開啟簡體字txt檔,出現亂碼的問題(2010): notepad + Notepad++

  • choose encode: manually
  • convert to UTF-8: available by Notepad++
  • allow to wrap long text: available


method 2: Microsoft AppLocale 公用程式(patched: piaip pAppLocale) + notepad

  • choose encode: manually
  • convert to UTF-8: not available
  • allow to wrap long text: available

Microsoft Office Word 2003 ($)[edit]

  • choose encode: manually
  • convert to UTF-8: available
  • allow to wrap long text: available

Mozilla Firefox v.3.6 (viewer)[edit]

javascript:(function() { var D = document; F(D.body); function F(n) { var u, r, c, x; if (n.nodeType == 3) { u = n.data.search(/\S{45}/); if (u >= 0) { r = n.splitText(u + 45); n.parentNode.insertBefore(D.createElement('wbr'), r); } } else if ((n.tagName != 'STYLE') && (n.tagName != 'SCRIPT')) { for (c = 0; x = n.childNodes[c]; ++c) { F(x); } } } D.body.innerHTML += ' '; })();


Notepad++ v.5.8[edit]

  • choose encode: manually
  • convert to UTF-8: available
  • allow to wrap long text: available


not supported at this moment[edit]

Further reading[edit]


Unicode table

References[edit]