Fix garbled message text: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
Line 102: Line 102:
</tr>
</tr>
<tr>
<tr>
<td>String contains {{kbd | key=<nowiki>\x</nowiki>}} symbols</td>
<td>String starting from {{kbd | key=<nowiki>\x</nowiki>}} symbols</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>b'\xe8\xa8\xb1'</nowiki>}}</td>
<td style="word-wrap: break-word;">{{kbd | key=<nowiki>\xe8\xa8\xb1</nowiki>}}</td>
<td>"\x is a string escape code, which happens to use hex notation" (hexadecimal notation)<ref>[https://stackoverflow.com/questions/13123877/difference-between-different-hex-types-representations-in-python Difference between different hex types/representations in Python - Stack Overflow]</ref></td>
<td>"\x is a string escape code, which happens to use hex notation" (hexadecimal notation)<ref>[https://stackoverflow.com/questions/13123877/difference-between-different-hex-types-representations-in-python Difference between different hex types/representations in Python - Stack Overflow]</ref></td>
<td>hexadecimal to text ↔ text to hexadecimal</td>
<td>hexadecimal to text ↔ text to hexadecimal</td>

Revision as of 08:55, 2 February 2019

Ideas on how to fix garbled message text

  1. Possible cause
    • Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools
    • PHP utf8_encode() & utf8_decode()
  2. (optional) convert the current encode to UTF-8
  3. (optional) Making text wrap to window size


Possible approaches to encode the message text:

Approach Goal Is Chinese text garbled/encoded? Sample text before encoded or after encoded
JavaScript encodeURIComponent()

JavaScript decodeURIComponent()[1]
"converts characters into a format that can be transmitted over the Internet ... " Cited from w3schools TRUE
  • before: http://www.中文網址.tw/my test.asp?name=ståle&car=saab
  • after: http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2Fmy%20test.asp%3Fname%3Dst%C3%A5le%26car%3Dsaab
URL Decoder/Encoder[2] (same as above) TRUE (same as above)
PHP: json_encode

PHP: json_decode
Save array in mysql database TRUE
  • before: array("作者" => "馬克吐溫", "名言" => "\"To a man with a hammer, everything looks like a nail.\" He said.");
  • after: {"\u4f5c\u8005":"\u99ac\u514b\u5410\u6eab","\u540d\u8a00":"\"To a man with a hammer, everything looks like a nail.\" He said."}
PHP: serialize

PHP: unserialize
Save array in mysql database FALSE
  • before: array("作者" => "馬克吐溫", "名言" => "\"To a man with a hammer, everything looks like a nail.\" He said.");
  • after: a:2:{s:6:"作者";s:12:"馬克吐溫";s:6:"名言";s:64:""To a man with a hammer, everything looks like a nail." He said.";}
PHP: htmlentities[1]

PHP: html_entity_decode
Replace reserved characters e.g. double quote symbol FALSE
  • before: 馬克吐溫名言 "To a man with a hammer, everything looks like a nail."
  • after: 馬克吐溫名言 &quot;To a man with a hammer, everything looks like a nail.&quot;

Other functions

List of the garbled text and possible cause

Feature Example Meaning Restore to human readable ↔ encode text
Website address contains %2 symbols http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2F "converts characters into a format that can be transmitted over the Internet ... " Cited from w3schools URL decode ↔ URL eocode
Downloaded Json or JavaScript file which its content contains \u or \U symbols \u8c61 or \U0001f418 (1) 16-bit or 32-bit hex value (2) "JSON representation of the supplied value"[3][4] JSON decode ↔ JSON eocode
String starting from 0x symbols 0x8c61 hexadecimal string[5]
String starting from \x symbols \xe8\xa8\xb1 "\x is a string escape code, which happens to use hex notation" (hexadecimal notation)[6] hexadecimal to text ↔ text to hexadecimal

string starting from \u or \U symbol

Using PHP. Type is string

$encoded = json_encode("象");
echo "encoded string: " . $encoded . PHP_EOL; // print "\u8c61"
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 象

$encoded = json_encode("🐘");
echo "encoded string: " . $encoded . PHP_EOL; // print "\ud83d\udc18"
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 🐘

Using Python. Type is string

x = u'象'
x.encode('ascii', 'backslashreplace') 
# print b'\\u8c61'

x = u'🐘'
x.encode('ascii', 'backslashreplace') 
# print b'\\U0001f418'

Using PHP. Type is array

$input = <<<EOT

["\u8c61"]

EOT;

$input = trim($input);
var_dump(json_decode($input, true)); // print array("象")
var_dump(json_encode(array("象")); // print ["\u8c61"]

string starting from 0x symbol

Using Python chr() Functionhex() function

int('0x8c61', 16)
# print 35937 -- "An integer representing a valid Unicode code point" cited from w3schools
chr(int('0x8c61', 16))
# print '象' -- "returns the character that represents the specified unicode." cited from w3schools
hex(ord('象'))
# print '0x8c61' -- "converts an integer number to the corresponding hexadecimal string." cited from programiz.com

chr(int('0x1f418', 16))
# print '🐘'
hex(ord('🐘'))
# print '0x1f418'

string starting from \x symbol

Using Python[7][8][9]

data = u"象"
data
hex_notation = data.encode('utf-8')
hex_notation
# print b'\xe8\xb1\xa1'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)


data = u"🐘"
data
hex_notation = data.encode('utf-8')
hex_notation
# print b'\xf0\x9f\x90\x98'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)


data = u"だいじょうぶ"
data
hex_notation = data.encode('utf-8')
hex_notation 
# print b'\xe3\x81\xa0\xe3\x81\x84\xe3\x81\x98\xe3\x82\x87\xe3\x81\x86\xe3\x81\xb6'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)

Ways to fix garbled message text

ConvertZ v.8.02

  • choose encode: manually (mainly in Asia language)
  • convert to UTF-8: available
  • convert to big5 from UTF-8: available Icon_exclaim.gif the wording may be changed by the software ex: 余美人 -> 於美人
  • allow to wrap long text: available

EmEditor v.14.3.1 ($)

Google Chrome v.10 (viewer)

  • choose encode: manually and auto-detect
  • allow to wrap long text: available (auto) Good.gif

MadEdit v.0.2.9.1

  • choose encode: manually and auto-detect Good.gif
  • convert to UTF-8: available
  • allow to wrap long text: available

Microsoft Internet Explorer v.8 (viewer)

  • choose encode: manually and auto-detect
  • allow to wrap long text:

Microsoft notepad (記事本) for Windows

method 1: Err: 解決用記事本(notepad)開啟簡體字txt檔,出現亂碼的問題(2010): notepad + Notepad++

  • choose encode: manually
  • convert to UTF-8: available by Notepad++
  • allow to wrap long text: available


method 2: Microsoft AppLocale 公用程式(patched: piaip pAppLocale) + notepad

  • choose encode: manually
  • convert to UTF-8: not available
  • allow to wrap long text: available

Microsoft Office Word 2003 ($)

  • choose encode: manually
  • convert to UTF-8: available
  • allow to wrap long text: available

Mozilla Firefox v.3.6 (viewer)

javascript:(function() { var D = document; F(D.body); function F(n) { var u, r, c, x; if (n.nodeType == 3) { u = n.data.search(/\S{45}/); if (u >= 0) { r = n.splitText(u + 45); n.parentNode.insertBefore(D.createElement('wbr'), r); } } else if ((n.tagName != 'STYLE') && (n.tagName != 'SCRIPT')) { for (c = 0; x = n.childNodes[c]; ++c) { F(x); } } } D.body.innerHTML += ' '; })();


Notepad++ v.5.8

  • choose encode: manually
  • convert to UTF-8: available
  • allow to wrap long text: available


not supported at this moment

Further reading

References