Editing Regular expression (section)

=== Find Non-ASCII Characters (Chinese/Non-English Text) ===


==== In LibreOffice ====

<pre>[^\u0000-\u0080]+</pre>


==== Find Chinese Characters in Google Sheets ====

Example: If cell {{kbd | key=A2}} contains any Chinese character, display “Chinese”, otherwise display “English”:

<pre>=IF(REGEXMATCH(A2, &quot;[\一-\龥]&quot;), &quot;Chinese&quot;, &quot;English&quot;)</pre>

==== Find Non-ASCII Characters in Google Sheets ====
Extract non-ASCII characters (such as Chinese, Japanese, emoji, etc.) from cell {{kbd | key=A2}}
<pre>
=IF(ISERROR(REGEXEXTRACT(A2, "[^\x00-\x80]+")), "", REGEXEXTRACT(A2, "[^\x00-\x80]+"))
</pre>

Explanation of regular expression {{kbd | key=<nowiki>[^\x00-\x80]+</nowiki>}}

* {{kbd | key=<nowiki>[\x00-\x80]</nowiki>}}: Represents the ASCII character range (character codes 0-128). (1) Standard ASCII range: 0-127 ({{kbd | key=<nowiki>0x00-0x7F</nowiki>}} aka * {{kbd | key=<nowiki>[\x00-\x7F]</nowiki>}})<ref>[https://www.commfront.com/pages/ascii-chart ASCII Chart – CommFront]</ref> (2) Character 128 (({{kbd | key=<nowiki>0x80</nowiki>}}) is actually the first character in the extended ASCII range, not part of the original ASCII standard.<ref>[https://en.wikipedia.org/wiki/UTF-8 UTF-8 - Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/Control_character Control character - Wikipedia]</ref>
* {{kbd | key=<nowiki>[^...]</nowiki>}}: Means "not" these characters
* {{kbd | key=<nowiki>+</nowiki>}}: Means one or more

Overall meaning: Matches one or more non-ASCII characters

==== Find Chinese Characters in MySQL ====

Find rows where <code>column_name</code> contains Chinese characters:

<pre lang="sql">SELECT `column_name`
FROM `table_name`
WHERE HEX(`column_name`) REGEXP '^(..)*(E[4-9])';</pre>

Query condition used to match records where the <code>column_name</code> field contains only Chinese characters.
<pre lang="sql">SELECT `column_name`
FROM `table_name`
WHERE `column_name` REGEXP '^[一-龯]+$';</pre>

Explanation:
* {{kbd | key=<nowiki>[一-龯]</nowiki>}} - Character set that matches all characters from "一" to "龯" in Unicode
* "一" has Unicode code point {{kbd | key=<nowiki>U+4E00</nowiki>}}<ref>[https://www.compart.com/en/unicode/U+4E00 “一” U+4E00 CJK Unified Ideograph-4E00 Unicode Character]</ref>
* "龯" has Unicode code point {{kbd | key=<nowiki>U+9FEF</nowiki>}}<ref>[https://www.compart.com/en/unicode/U+9FAF “龯” U+9FAF CJK Unified Ideograph-9FAF Unicode Character]</ref>
* This range U+4E00-U+9FFF already covers over 99% of daily Chinese usage requirements [https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_B Extension B] and later blocks mainly contain ancient Chinese characters, variant characters, etc., which rarely appear in modern texts

==== Find Non-ASCII Characters in MySQL ====

Find rows where <code>column_name</code> is not entirely ASCII characters:

<syntaxhighlight lang="sql">SELECT `column_name`
FROM `table_name`
WHERE `column_name` <> CONVERT(`column_name` USING ASCII)</syntaxhighlight>

==== Find Chinese Characters in PHP ====

'''Exact match:'''

<syntaxhighlight lang="php">// Approach 1
if (preg_match('/^[\x{4e00}-\x{9fa5}]+$/u', $string)) {
    echo "All text is Chinese characters" . PHP_EOL;
} else {
    echo "Some text is not Chinese characters" . PHP_EOL;
}

// Approach 2
if (preg_match('/^[\p{Han}]+$/u', $string)) {
    echo "All text is Chinese characters" . PHP_EOL;
} else {
    echo "Some text is not Chinese characters" . PHP_EOL;
}</syntaxhighlight>
'''Partial match:'''

<syntaxhighlight lang="php">// Approach 1
$string = '繁體中文-简体中文-English-12345-。，！-.,!-⭐';
$pattern = '/[\p{Han}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);

// Approach 2
$string = '繁體中文-简体中文-English-12345-。，！-.,!-⭐';
$pattern = '/[\x{4e00}-\x{9fa5}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);</syntaxhighlight>