Fix garbled message text: Difference between revisions

Latest revision as of 14:21, 2 May 2024

How to fix garbled message text[edit]

Ideas on how to fix garbled message text

Possible cause
- Encoding issue: Choose the correct the language/encode of message text or auto detect the encode by tools
- PHP utf8_encode() & utf8_decode()
(optional) convert the current encode to UTF-8
(optional) Making text wrap to window size

List of the (look like but not) garbled text and possible root cause

Feature	Example	Meaning	Restore to human readable ↔ encode text
String contains %2 or %20 symbols and meaningfulness English characters	http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2F	"converts characters into a format that can be transmitted over the Internet ... " Cited from w3schools	URL decode ↔ URL eocode
String start from \u, \U or U+ symbols	\u8c61, \U0001f418 or U+1F418	Unicode number: "Unicode code point is referred to by writing "U+" followed by its hexadecimal number.^[1]" (1) 16-bit or 32-bit hex value (2) "JSON representation of the supplied value"^[2]^[3]	JSON decode ↔ JSON eocode
String starting from 0x symbols	0x8c61	hexadecimal string^[4]
String starting from \x symbols	\xe8\xa8\xb1	"\x is a string escape code, which happens to use hex notation" (hexadecimal notation)^[5]	hexadecimal to text ↔ text to hexadecimal
String starting from &# symbols	象	Unicode HTML code. "Unicode number in decimal, hex or octal"^[6]	PHP: html_entity_decode ↔ (See the following section to understand how to encode)
HTML source code starting from & ... ; symbols	& a m p ; (without whitespace) is &	"all characters which have HTML character entity equivalents are translated into these entities"^[7]	PHP: htmlspecialchars_decode ↔ PHP: htmlentities

Possible approaches to encode the message text:

Approach	Goal	Is Chinese text garbled/encoded?	Sample text before encoded or after encoded
JavaScript encodeURIComponent() ↔ JavaScript decodeURIComponent()^[8]	"converts characters into a format that can be transmitted over the Internet ... " Cited from w3schools	TRUE	before: http://www.中文網址.tw/my test.asp?name=ståle&car=saab after: http%3A%2F%2Fwww.%E4%B8%AD%E6%96%87%E7%B6%B2%E5%9D%80.tw%2Fmy%20test.asp%3Fname%3Dst%C3%A5le%26car%3Dsaab
URL Decoder/Encoder^[9]	(same as above)	TRUE	(same as above)
PHP: json_encode ↔ PHP: json_decode	Save array in mysql database	TRUE	before: array("作者" => "馬克吐溫", "名言" => "\"To a man with a hammer, everything looks like a nail.\" He said."); after: {"\u4f5c\u8005":"\u99ac\u514b\u5410\u6eab","\u540d\u8a00":"\"To a man with a hammer, everything looks like a nail.\" He said."}
PHP: serialize ↔ PHP: unserialize	Save array in mysql database	FALSE	before: array("作者" => "馬克吐溫", "名言" => "\"To a man with a hammer, everything looks like a nail.\" He said."); after: a:2:{s:6:"作者";s:12:"馬克吐溫";s:6:"名言";s:64:""To a man with a hammer, everything looks like a nail." He said.";}
PHP: htmlentities [1] ↔ PHP: html_entity_decode	Replace reserved characters e.g. double quote symbol	FALSE	before: 馬克吐溫名言 "To a man with a hammer, everything looks like a nail." after: 馬克吐溫名言 "To a man with a hammer, everything looks like a nail."

Other functions

JSON.parse() or jQuery.parseJSON() | jQuery API Documentation

String contains %2 or %20 symbols[edit]

Using the following functions

String starting from \u, \U or U+ symbol[edit]

Using PHP. Type is string

$encoded = <<<EOT

"\u8c61"

EOT;
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 象
echo "encoded string: " . json_encode("象") . PHP_EOL; // print "\u8c61"

$encoded = <<<EOT

"\ud83d\udc18"

EOT;
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 🐘
echo "encoded string: " . json_encode("🐘") . PHP_EOL; // print "\ud83d\udc18"

when using the heredoc syntax (<<<EOT ... EOT;), it's possible that unnecessary whitespace or hidden characters at the beginning or end of the block might cause json_decode to fail in parsing the string correctly. Direct assignment avoids potential whitespace or format issues from heredoc.

$encoded = '"\u8c61"';
echo "decoded string: " . json_decode($encoded, true) . PHP_EOL; // print 象
echo "encoded string: " . json_encode("象") . PHP_EOL; // print "\u8c61"

Using PHP v. 7.0 Unicode Codepoint Escape Syntax^[10]

echo "\u{8c61}" . PHP_EOL; // print 象
echo "\u{0001f418}" . PHP_EOL; // print 🐘

Using Python. Type is string

x = u'象'
x.encode('ascii', 'backslashreplace') 
# print b'\\u8c61'

x = u'🐘'
x.encode('ascii', 'backslashreplace') 
# print b'\\U0001f418'

Using PHP. Type is array

$input = <<<EOT

["\u8c61"]

EOT;

$input = trim($input);
var_dump(json_decode($input, true)); // print array("象")
var_dump(json_encode(array("象")); // print ["\u8c61"]

String starting from 0x symbol[edit]

Using Python chr() Function ↔ hex() function

int('0x8c61', 16)
# print 35937 -- "An integer representing a valid Unicode code point" cited from w3schools
chr(int('0x8c61', 16))
# print '象' -- "returns the character that represents the specified unicode." cited from w3schools
hex(ord('象'))
# print '0x8c61' -- "converts an integer number to the corresponding hexadecimal string." cited from programiz.com

chr(int('0x1f418', 16))
# print '🐘'
hex(ord('🐘'))
# print '0x1f418'

string starting from \x symbol[edit]

Using Python^[11]^[12]^[13]

data = u"象"
data
hex_notation = data.encode('utf-8')
hex_notation
# print b'\xe8\xb1\xa1'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)


data = u"🐘"
data
hex_notation = data.encode('utf-8')
hex_notation
# print b'\xf0\x9f\x90\x98'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)


data = u"だいじょうぶ"
data
hex_notation = data.encode('utf-8')
hex_notation 
# print b'\xe3\x81\xa0\xe3\x81\x84\xe3\x81\x98\xe3\x82\x87\xe3\x81\x86\xe3\x81\xb6'
for each_unicode_character in hex_notation.decode('utf-8'):
    print(each_unicode_character)

Using PHP^[14]: See it in action

echo preg_replace_callback("/./", function($matched) {
    return '\x'.dechex(ord($matched[0]));
}, '🐘');

# print \xf0\x9f\x90\x98

String starting from &# symbols[edit]

Using PHP html_entity_decode() Function^[15]^[16]

To decode the text

$unicode_html = '&#128024;';
echo html_entity_decode($unicode_html) . PHP_EOL; // print 🐘

$unicode_html = '&#128024;';
echo mb_convert_encoding($unicode_html, 'UTF-8', 'HTML-ENTITIES') . PHP_EOL; // print 🐘

To encode the text

$input = "🐘";
$unicode_html = base_convert(bin2hex(mb_convert_encoding($input, 'UTF-32', 'utf-8')), 16, 10);
$unicode_html = '&#' . $unicode_html . ';';
echo 'unicode_html: ' . $unicode_html . PHP_EOL; // print &#128024

Ways to fix garbled message text[edit]

ConvertZ v.8.02[edit]

choose encode: manually (mainly in Asia language)
convert to UTF-8: available
convert to big5 from UTF-8: available the wording may be changed by the software ex: 余美人 -> 於美人
allow to wrap long text: available

EmEditor v.14.3.1 ($)[edit]

choose encode: manually and auto-detect
convert to UTF-8: available
allow to wrap long text: available
support command line: EmEditor FAQ: How can I convert file encodings with the command line?

Google Chrome v.10 (viewer)[edit]

choose encode: manually and auto-detect
allow to wrap long text: available (auto)

MadEdit v.0.2.9.1[edit]

choose encode: manually and auto-detect
convert to UTF-8: available
allow to wrap long text: available

Microsoft Internet Explorer v.8 (viewer)[edit]

choose encode: manually and auto-detect
allow to wrap long text:

Microsoft notepad (記事本) for Windows[edit]

method 1: Err: 解決用記事本(notepad)開啟簡體字txt檔，出現亂碼的問題(2010): notepad + Notepad++

choose encode: manually
convert to UTF-8: available by Notepad++
allow to wrap long text: available

method 2: Microsoft AppLocale 公用程式(patched: piaip pAppLocale) + notepad

choose encode: manually
convert to UTF-8: not available
allow to wrap long text: available

Microsoft Office Word 2003 ($)[edit]

choose encode: manually
convert to UTF-8: available
allow to wrap long text: available

Mozilla Firefox v.3.6 (viewer)[edit]

choose encode: manually and auto-detect
allow to wrap long text: no but you can copy the following code into the web address bar to wrap long text (Thanks, Return of the Sasquatch: word wrap for Firefox bookmarklet!)

javascript:(function() { var D = document; F(D.body); function F(n) { var u, r, c, x; if (n.nodeType == 3) { u = n.data.search(/\S{45}/); if (u >= 0) { r = n.splitText(u + 45); n.parentNode.insertBefore(D.createElement('wbr'), r); } } else if ((n.tagName != 'STYLE') && (n.tagName != 'SCRIPT')) { for (c = 0; x = n.childNodes[c]; ++c) { F(x); } } } D.body.innerHTML += ' '; })();

Notepad++ v.5.8[edit]

choose encode: manually
convert to UTF-8: available
allow to wrap long text: available

not supported at this moment[edit]

LibreOffice 3.3.0 - Writer
OpenOffice.org 3.3.0 - Writer is not supported but OpenOffice.org Calc is supported.

References[edit]

↑ Unicode - Wikipedia
↑ PHP: json_encode - Manual
↑ RFC 7159: The JavaScript Object Notation (JSON) Data Interchange Format
↑ Python hex() - Python Standard Library
↑ Difference between different hex types/representations in Python - Stack Overflow
↑ &what Help
↑ PHP: htmlentities - Manual
↑ urlencode - How to Encode URL Contains Unicode Characters with PHP - Stack Overflow
↑ PHP urlencode()
↑ PHP: New features - Manual
↑ bytes.decode()
↑ str.encode()
↑ python - How to decode unicode in a Chinese text - Stack Overflow
↑ php - How to convert text to \x codes? - Stack Overflow
↑ PHP 將文字轉換成 &#xxxxx; UNICODE 碼 | Tsung's Blog
↑ [php tech. unicode html convert | HINA::工程幼稚園] unicode html 字碼來元是由原本的編碼，轉換為 UCS-2 之後，再取二進制轉換，再取一次 16 to 10 進制轉換，在加上 &# 而得到這個字碼。

[1] Unicode - Wikipedia

[2] PHP: json_encode - Manual

[3] RFC 7159: The JavaScript Object Notation (JSON) Data Interchange Format

[4] Python hex() - Python Standard Library

[5] Difference between different hex types/representations in Python - Stack Overflow

[6] &what Help

[7] PHP: htmlentities - Manual

[8] urlencode - How to Encode URL Contains Unicode Characters with PHP - Stack Overflow

[9] PHP urlencode()

[10] PHP: New features - Manual

[11] ytes.decode()

[12] str.encode()

[13] ython - How to decode unicode in a Chinese text - Stack Overflow

[14] - How to convert text to \x codes? - Stack Overflow

[15] PHP 將文字轉換成 &#xxxxx; UNICODE 碼 | Tsung's Blog

[16] [php tech. unicode html convert | HINA::工程幼稚園] unicode html 字碼來元是由原本的編碼，轉換為 UCS-2 之後，再取二進制轉換，再取一次 16 to 10 進制轉換，在加上 &# 而得到這個字碼。

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

Fix garbled message text: Difference between revisions

Latest revision as of 14:21, 2 May 2024

Contents

How to fix garbled message text[edit]

String contains %2 or %20 symbols[edit]

String starting from \u, \U or U+ symbol[edit]

String starting from 0x symbol[edit]

string starting from \x symbol[edit]

String starting from &# symbols[edit]

Ways to fix garbled message text[edit]

ConvertZ v.8.02[edit]

EmEditor v.14.3.1 ($)[edit]

Google Chrome v.10 (viewer)[edit]

MadEdit v.0.2.9.1[edit]

Microsoft Internet Explorer v.8 (viewer)[edit]

Microsoft notepad (記事本) for Windows[edit]

Microsoft Office Word 2003 ($)[edit]

Mozilla Firefox v.3.6 (viewer)[edit]

Notepad++ v.5.8[edit]

not supported at this moment[edit]

Further reading[edit]

References[edit]

Navigation menu

Fix garbled message text: Difference between revisions

Latest revision as of 14:21, 2 May 2024

How to fix garbled message text[edit]

String contains %2 or %20 symbols[edit]

String starting from \u, \U or U+ symbol[edit]

String starting from 0x symbol[edit]

string starting from \x symbol[edit]

String starting from &# symbols[edit]

Ways to fix garbled message text[edit]

ConvertZ v.8.02[edit]

EmEditor v.14.3.1 ($)[edit]

Google Chrome v.10 (viewer)[edit]

MadEdit v.0.2.9.1[edit]

Microsoft Internet Explorer v.8 (viewer)[edit]

Microsoft notepad (記事本) for Windows[edit]

Microsoft Office Word 2003 ($)[edit]

Mozilla Firefox v.3.6 (viewer)[edit]

Notepad++ v.5.8[edit]

not supported at this moment[edit]

Further reading[edit]

References[edit]

Navigation menu

Search