Fix garbled message text

From LemonWiki共筆
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

How to fix garbled message text

Ideas on how to fix garbled message text

  1. Possible cause
    • Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools
    • PHP utf8_encode() & utf8_decode()
  2. (optional) convert the current encode to UTF-8
  3. (optional) Making text wrap to window size

List of the (look like but not) garbled text and possible root cause

Feature Example Meaning Restore to human readable ↔ encode text
String contains %2 or %20 symbols and meaningfulness English characters http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2F "converts characters into a format that can be transmitted over the Internet ... " Cited from w3schools URL decode ↔ URL eocode
String start from \u, \U or U+ symbols \u8c61, \U0001f418 or U+1F418 Unicode number: "Unicode code point is referred to by writing "U+" followed by its hexadecimal number.[1]" (1) 16-bit or 32-bit hex value (2) "JSON representation of the supplied value"[2][3] JSON decode ↔ JSON eocode
String starting from 0x symbols 0x8c61 hexadecimal string[4]
String starting from \x symbols \xe8\xa8\xb1 "\x is a string escape code, which happens to use hex notation" (hexadecimal notation)[5] hexadecimal to text ↔ text to hexadecimal
String starting from &# symbols 象 Unicode HTML code. "Unicode number in decimal, hex or octal"[6] PHP: html_entity_decode ↔ (See the following section to understand how to encode)
HTML source code starting from & ... ; symbols & a m p ; (without whitespace) is & "all characters which have HTML character entity equivalents are translated into these entities"[7] PHP: htmlspecialchars_decodePHP: htmlentities


Possible approaches to encode the message text:

Approach Goal Is Chinese text garbled/encoded? Sample text before encoded or after encoded
JavaScript encodeURIComponent()

JavaScript decodeURIComponent()[8]
"converts characters into a format that can be transmitted over the Internet ... " Cited from w3schools TRUE
  • before: http://www.中文網址.tw/my test.asp?name=ståle&car=saab
  • after: http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2Fmy%20test.asp%3Fname%3Dst%C3%A5le%26car%3Dsaab
URL Decoder/Encoder[9] (same as above) TRUE (same as above)
PHP: json_encode

PHP: json_decode
Save array in mysql database TRUE
  • before: array("作者" => "馬克吐溫", "名言" => "\"To a man with a hammer, everything looks like a nail.\" He said.");
  • after: {"\u4f5c\u8005":"\u99ac\u514b\u5410\u6eab","\u540d\u8a00":"\"To a man with a hammer, everything looks like a nail.\" He said."}
PHP: serialize

PHP: unserialize
Save array in mysql database FALSE
  • before: array("作者" => "馬克吐溫", "名言" => "\"To a man with a hammer, everything looks like a nail.\" He said.");
  • after: a:2:{s:6:"作者";s:12:"馬克吐溫";s:6:"名言";s:64:""To a man with a hammer, everything looks like a nail." He said.";}
PHP: htmlentities[1]

PHP: html_entity_decode
Replace reserved characters e.g. double quote symbol FALSE
  • before: 馬克吐溫名言 "To a man with a hammer, everything looks like a nail."
  • after: 馬克吐溫名言 "To a man with a hammer, everything looks like a nail."

Other functions


String contains %2 or %20 symbols

Using the following functions

String starting from \u, \U or U+ symbol

Using PHP. Type is string

$encoded = <<<EOT

"\u8c61"

EOT;
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 象
echo "encoded string: " . json_encode("象") . PHP_EOL; // print "\u8c61"

$encoded = <<<EOT

"\ud83d\udc18"

EOT;
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 🐘
echo "encoded string: " . json_encode("🐘") . PHP_EOL; // print "\ud83d\udc18"

Using PHP v. 7.0 Unicode Codepoint Escape Syntax[10]

echo "\u{8c61}" . PHP_EOL; // print 象
echo "\u{0001f418}" . PHP_EOL; // print 🐘

Using Python. Type is string

x = u'象'
x.encode('ascii', 'backslashreplace') 
# print b'\\u8c61'

x = u'🐘'
x.encode('ascii', 'backslashreplace') 
# print b'\\U0001f418'

Using PHP. Type is array

$input = <<<EOT

["\u8c61"]

EOT;

$input = trim($input);
var_dump(json_decode($input, true)); // print array("象")
var_dump(json_encode(array("象")); // print ["\u8c61"]

String starting from 0x symbol

Using Python chr() Functionhex() function

int('0x8c61', 16)
# print 35937 -- "An integer representing a valid Unicode code point" cited from w3schools
chr(int('0x8c61', 16))
# print '象' -- "returns the character that represents the specified unicode." cited from w3schools
hex(ord('象'))
# print '0x8c61' -- "converts an integer number to the corresponding hexadecimal string." cited from programiz.com

chr(int('0x1f418', 16))
# print '🐘'
hex(ord('🐘'))
# print '0x1f418'

string starting from \x symbol

Using Python[11][12][13]

data = u"象"
data
hex_notation = data.encode('utf-8')
hex_notation
# print b'\xe8\xb1\xa1'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)


data = u"🐘"
data
hex_notation = data.encode('utf-8')
hex_notation
# print b'\xf0\x9f\x90\x98'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)


data = u"だいじょうぶ"
data
hex_notation = data.encode('utf-8')
hex_notation 
# print b'\xe3\x81\xa0\xe3\x81\x84\xe3\x81\x98\xe3\x82\x87\xe3\x81\x86\xe3\x81\xb6'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)

Using PHP[14]: See it in action

echo preg_replace_callback("/./", function($matched) {
    return '\x'.dechex(ord($matched[0]));
}, '🐘');

# print \xf0\x9f\x90\x98

String starting from &# symbols

Using PHP html_entity_decode() Function[15][16]

To decode the text

$unicode_html = '&#128024;';
echo html_entity_decode($unicode_html) . PHP_EOL; // print 🐘

$unicode_html = '&#128024;';
echo mb_convert_encoding($unicode_html, 'UTF-8', 'HTML-ENTITIES') . PHP_EOL; // print 🐘

To encode the text

$input = "🐘";
$unicode_html = base_convert(bin2hex(mb_convert_encoding($input, 'UTF-32', 'utf-8')), 16, 10);
$unicode_html = '&#' . $unicode_html . ';';
echo 'unicode_html: ' . $unicode_html . PHP_EOL; // print &#128024

Ways to fix garbled message text

ConvertZ v.8.02

  • choose encode: manually (mainly in Asia language)
  • convert to UTF-8: available
  • convert to big5 from UTF-8: available Icon_exclaim.gif the wording may be changed by the software ex: 余美人 -> 於美人
  • allow to wrap long text: available

EmEditor v.14.3.1 ($)

Google Chrome v.10 (viewer)

  • choose encode: manually and auto-detect
  • allow to wrap long text: available (auto) Good.gif

MadEdit v.0.2.9.1

  • choose encode: manually and auto-detect Good.gif
  • convert to UTF-8: available
  • allow to wrap long text: available

Microsoft Internet Explorer v.8 (viewer)

  • choose encode: manually and auto-detect
  • allow to wrap long text:

Microsoft notepad (記事本) for Windows

method 1: Err: 解決用記事本(notepad)開啟簡體字txt檔,出現亂碼的問題(2010): notepad + Notepad++

  • choose encode: manually
  • convert to UTF-8: available by Notepad++
  • allow to wrap long text: available


method 2: Microsoft AppLocale 公用程式(patched: piaip pAppLocale) + notepad

  • choose encode: manually
  • convert to UTF-8: not available
  • allow to wrap long text: available

Microsoft Office Word 2003 ($)

  • choose encode: manually
  • convert to UTF-8: available
  • allow to wrap long text: available

Mozilla Firefox v.3.6 (viewer)

javascript:(function() { var D = document; F(D.body); function F(n) { var u, r, c, x; if (n.nodeType == 3) { u = n.data.search(/\S{45}/); if (u >= 0) { r = n.splitText(u + 45); n.parentNode.insertBefore(D.createElement('wbr'), r); } } else if ((n.tagName != 'STYLE') && (n.tagName != 'SCRIPT')) { for (c = 0; x = n.childNodes[c]; ++c) { F(x); } } } D.body.innerHTML += ' '; })();


Notepad++ v.5.8

  • choose encode: manually
  • convert to UTF-8: available
  • allow to wrap long text: available


not supported at this moment

Further reading


Unicode table

References