Regular expression: Difference between revisions
mNo edit summary |
mNo edit summary |
||
| Line 1: | Line 1: | ||
= Regular Expression Guide = | |||
When processing text files through regular expressions, you can quickly search for or replace strings that match specific rules. Processing is done on a line-by-line basis for string manipulation. | |||
Regular expressions are also known as regex, regexp, or pattern matching expressions. | |||
<blockquote>'''Need Help?''' You can use the provided explanatory [[#regular-expression-online-tools|online tools]] to try debugging yourself. | |||
</blockquote> | |||
<span id="quick-reference-table"></span> | |||
== Quick Reference Table == | |||
Note: (1) Blue highlighted areas in samples represent text matching the rules, (2) The same text rule can have multiple representations | |||
== | {| class="wikitable" | ||
=== | |- | ||
! Text Rule | |||
< | ! Sample | ||
! Opposite Text Rule | |||
! Sample | |||
|- | |||
| Any single character (including spaces, but not newline) <br> <code>.</code> | |||
| <span style="background:#C6E3FF">W</span>hat Does the Fox Say? 12 狐狸怎叫 34 | |||
| | |||
| | |||
|- | |||
| Any character (including spaces), appears 1 or 0 times <br> <code>.?</code> = <code>.{0,1}</code> | |||
| <span style="background:#C6E3FF">W</span>hat Does the Fox Say? 12 狐狸怎叫 34 | |||
| | |||
| | |||
|- | |||
| Any number of multiple characters (including spaces) <br> <code>.*</code> = <code>.{0,}</code> | |||
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span> | |||
| | |||
| | |||
|- | |||
| Any number of characters (including spaces), at least 1 occurrence <br> <code>.+</code> = <code>.{1,}</code> | |||
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span> | |||
| | |||
| | |||
|- | |||
| Any number of spaces or newlines (at least 1 occurrence) <br> <code>\s+</code> | |||
| What<span style="background:#C6E3FF"> </span>Does the Fox Say? 12 狐狸怎叫 34 | |||
| Any number of characters (not including spaces or newlines) <br> <code>[^\s]+</code> = <code>[^\s]{1,}</code> = <code>[\S]+</code> = <code>[^ ]+</code> | |||
| <span style="background:#C6E3FF">What</span> Does the Fox Say? 12 狐狸怎叫 34 | |||
|- | |||
| Any number of ASCII characters (including English, numbers and spaces) <br> <code>[\x00-\x80]+</code> or <code>[[:ascii:]]+</code> | |||
| <span style="background:#C6E3FF">What Does the Fox Say? 12</span> 狐狸怎叫 34 | |||
| Non-ASCII, i.e., Chinese characters appearing any number of times <br> <code>[^\x00-\x80]+</code> | |||
| What Does the Fox Say? 12 <span style="background:#C6E3FF">狐狸怎叫</span> 34 | |||
|- | |||
| Any number of uppercase/lowercase English letters, numbers and underscore (_) (not including spaces) <br> <code>[\w]+</code> = <code>[a-zA-Z0-9_]+</code> <br> PHP with <code>u</code> modifier supports Chinese characters | |||
| <span style="background:#C6E3FF">What</span> <span style="background:#C6E3FF">Does</span> <span style="background:#C6E3FF">the</span> <span style="background:#C6E3FF">Fox</span> <span style="background:#C6E3FF">Say</span>? <span style="background:#C6E3FF">12</span> 狐狸怎叫 <span style="background:#C6E3FF">_34</span> | |||
| Any number of characters that are not English letters, numbers and underscore (_) <br> <code>\W+</code> = <code>[^a-zA-Z0-9_]+</code> | |||
| | |||
|- | |||
| Any number of digits (not including spaces) <br> <code>[\d]+</code> = <code>[0-9]+</code> | |||
| What Does the Fox Say? <span style="background:#C6E3FF">12</span> 狐狸怎叫 34 | |||
| Any number of characters not including digits (including spaces) <br> <code>[^\d]+</code> = <code>[^0-9]+</code> = <code>\D+</code> | |||
| <span style="background:#C6E3FF">What Does the Fox Say? </span>12 狐狸怎叫 34 | |||
|- | |||
| Any number of Chinese characters <br> <code>[\p{Han}]+</code> | |||
| What Does the Fox Say? 12 <span style="background:#C6E3FF">狐狸怎叫</span> 34 | |||
| Any number of characters not including Chinese <br> <code>[^\p{Han}]+</code> | |||
| | |||
|- | |||
| Lines starting with “狐狸” <br> <code>^狐狸.*$</code> | |||
| <span style="background:#C6E3FF">狐狸怎叫 34 What Does the Fox Say?</span><br>柴犬怎叫 What Does the shiba inu say? | |||
| Lines not starting with “狐狸” <br> <code>^(?!狐狸).*$</code> | |||
| 狐狸怎叫 34 What Does the Fox Say?<br><span style="background:#C6E3FF">柴犬怎叫 What Does the shiba inu say?</span> | |||
|- | |||
| Lines ending with “怎叫” <br> <code>^.*怎叫$</code> | |||
| What Does the Fox Say? 12 狐狸怎叫 34<br><span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span> | |||
| Lines not ending with “怎叫” <br> <code>.*(?<!怎叫)$</code> | |||
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br>What Does the shiba inu say? 柴犬怎叫 | |||
|- | |||
| Lines containing “狐狸” <br> <code>^.*狐狸.*$</code> or <code>(狐狸)</code> | |||
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br>What Does the shiba inu say? 柴犬怎叫 | |||
| Lines not containing “狐狸” <br> <code>^((?!狐狸).)*$</code> | |||
| What Does the Fox Say? 12 狐狸怎叫 34<br><span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span> | |||
|- | |||
| Boolean logic AND: Lines containing both “狐狸” and “叫” <br> <code>(?=.*狐狸)(?=.*叫).*</code> or <code>狐狸.*叫\|叫.*狐狸</code> | |||
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br><span style="background:#C6E3FF">What Does the Fox Say? 12 不叫狐狸 34</span><br>What Does the shiba inu say? 柴犬怎叫 | |||
| | |||
| | |||
|- | |||
| Boolean logic OR: Lines containing “狐狸” or “叫” <br> <code>.*(狐狸\|叫).*</code> | |||
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34<br>What Does the shiba inu say? 柴犬怎叫</span><br>What Does the shiba inu say? 柴犬怎了 | |||
| Boolean logic: Lines not containing “狐狸” and not containing “柴犬” <br> <code>^((?!狐狸\|柴犬).)*$</code> | |||
| What Does the Fox Say? 12 狐狸怎叫 34<br>What Does the shiba inu say? 柴犬怎叫<br><span style="background:#C6E3FF">What Does the Husky say? 哈士奇怎叫</span> | |||
|- | |||
| Boolean logic NOT: Lines not containing “狐狸” but containing “柴犬” <br> <code>^((?!狐狸).)*(柴犬).*$</code> = <code>^(柴犬).*((?!狐狸).)*$</code> = <code>(柴犬).*((?!狐狸).)*</code> | |||
| What Does the Fox Say? 12 狐狸怎叫 34<br><span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span> | |||
| | |||
| | |||
|} | |||
<span id="regular-expression-online-tools"></span> | |||
== Regular Expression Online Tools == | |||
</ | |||
Websites for testing regular expression syntax: * '''RegEx101''' - “Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript” - Provides syntax explanations * '''RegExr''' - Learn, Build, & Test RegEx - Provides syntax explanations * '''Regexper''' - Visual explanation of syntax using diagrams * '''Regulex''' - JavaScript Regular Expression Visualizer - Visual explanation using diagrams * '''Rubular''' - A Ruby regular expression editor and tester * '''PHP Live Regex''' * '''Regex Tester and Debugger Online''' - Javascript, PCRE, PHP | |||
Examples: * '''Regular Expression Library''' - Pattern examples provided by users | |||
< | <span id="common-use-cases"></span> | ||
< | == Common Use Cases == | ||
==== | |||
<span id="replace-newlines-with-commas"></span> | |||
=== Replace Newlines with Commas === | |||
< | |||
Converting email lists into a format usable by email software: | |||
<pre>Original: | |||
Convert to: | |||
[email protected],bbb@email.com,ccc@email.com</pre> | |||
<span id="method-1-sublime-text-emeditor"></span> | |||
==== Method 1: Sublime Text, EmEditor ==== | |||
< | |||
==== | |||
# Menu: Search -> Replace | |||
# Check “Use Regular Expression” | |||
#* Find: <code>\n</code> (newline character) | |||
#* Replace with: <code>,</code> | |||
# Click “Replace all” | |||
==== | <span id="method-2-notepad"></span> | ||
==== Method 2: Notepad++ ==== | |||
# Menu: Find -> Replace | |||
# Search mode: Check “Extended mode” (not “Regular expression”) | |||
#* Find: <code>\n</code> | |||
#* Replace with: <code>,</code> | |||
# Click “Replace All” | |||
<span id="method-3-microsoft-word"></span> | |||
==== Method 3: Microsoft Word ==== | |||
# Menu: Edit -> Replace | |||
( | # Check extended mode | ||
#* Find: <code>^p</code> (paragraph mark) | |||
#* Replace with: <code>,</code> | |||
# Click “Replace All” | |||
<span id="method-4-sed-command-for-linux"></span> | |||
==== Method 4: Sed command for Linux ==== | |||
==== | <syntaxhighlight lang="bash">sed ':a;N;$!ba;s/\n/; /g' old.filename > new.filename</syntaxhighlight> | ||
<span id="find-ip-addresses-ipv4"></span> | |||
=== Find IP Addresses (IPv4) === | |||
For Notepad++ v.5.9.5: - Find: <code>\d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?</code> | |||
For Sublime Text v. 3.2.21: - Find: <code>(?:\d{1,3}\.){3}\d{1,3}</code> | |||
<span id="remove-black-squares-unix-line-endings-lf"></span> | |||
=== Remove Black Squares (UNIX Line Endings LF) === | |||
Using Notepad++: 1. Menu: Find -> Replace 2. Search mode: Check “Extended mode” - Find: <code>\n\n</code> (2 LF characters) - Replace with: <code>\r\n</code> (CR and LF) | |||
<span id="add-quotes-around-elements"></span> | |||
=== Add Quotes Around Elements === | |||
<span id="add-quotes-around-array-elements"></span> | |||
==== Add Quotes Around Array Elements ==== | |||
<pre>Before: Elmo, Emie, Granny Bird, Herry Monster, 喀喀獸 | |||
After: 'Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸'</pre> | |||
'''Method 1: PHP''' | |||
= | <syntaxhighlight lang="php">$users = array('Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸'); | ||
// Single quotes around each element | |||
$users = array('Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸'); | |||
// | |||
$result = implode(",", preg_replace('/^(.*?)$/', "'$1'", $users)); | $result = implode(",", preg_replace('/^(.*?)$/', "'$1'", $users)); | ||
// Double quotes around each element | |||
// | |||
$result = implode(",", preg_replace('/^(.*?)$/', "\"$1\"", $users)); | $result = implode(",", preg_replace('/^(.*?)$/', "\"$1\"", $users)); | ||
echo $result; | echo $result;</syntaxhighlight> | ||
</ | '''Method 2: Sublime Text or EmEditor''' - Find: <code>([^\s|,]+)</code> - Replace with: <code>'\1'</code> (for single quotes) or <code>"\1"</code> (for double quotes) | ||
'''Method 3: Notepad++''' (Enable “Regular expression” search mode) - Find: <code>([^\s|,]+)</code> - Replace with: <code>'$1'</code> (for single quotes) or <code>"$1"</code> (for double quotes) | |||
<span id="find-non-ascii-characters-chinesenon-english-text"></span> | |||
=== Find Non-ASCII Characters (Chinese/Non-English Text) === | |||
<span id="in-google-sheets"></span> | |||
==== In Google Sheets ==== | |||
<pre>[^\x00-\x80]+</pre> | |||
<span id="in-libreoffice"></span> | |||
==== In LibreOffice ==== | |||
< | <pre>[^\u0000-\u0080]+</pre> | ||
< | <span id="find-chinese-characters-in-google-sheets"></span> | ||
==== Find Chinese Characters in Google Sheets ==== | |||
Example: If A2 contains any Chinese character, display “Chinese”, otherwise display “English”: | |||
<pre>=IF(REGEXMATCH(A2, "[\一-\龥]"), "Chinese", "English")</pre> | |||
<span id="find-chinese-characters-in-mysql"></span> | |||
</pre> | ==== Find Chinese Characters in MySQL ==== | ||
Find rows where <code>column_name</code> contains Chinese characters: | |||
<syntaxhighlight lang="sql">SELECT `column_name` | |||
FROM `table_name` | |||
WHERE HEX(`column_name`) REGEXP '^(..)*(E[4-9])';</syntaxhighlight> | |||
<span id="find-non-ascii-characters-in-mysql"></span> | |||
==== Find Non-ASCII Characters in MySQL ==== | |||
Find rows where <code>column_name</code> is not entirely ASCII characters: | |||
<syntaxhighlight lang="sql">SELECT `column_name` | |||
< | |||
SELECT `column_name` | |||
FROM `table_name` | FROM `table_name` | ||
WHERE | WHERE `column_name` <> CONVERT(`column_name` USING ASCII)</syntaxhighlight> | ||
</ | <span id="find-chinese-characters-in-php"></span> | ||
==== Find Chinese Characters in PHP ==== | |||
'''Exact match:''' | |||
<syntaxhighlight lang="php">// Approach 1 | |||
= | |||
// | |||
if (preg_match('/^[\x{4e00}-\x{9fa5}]+$/u', $string)) { | if (preg_match('/^[\x{4e00}-\x{9fa5}]+$/u', $string)) { | ||
echo "All text is Chinese characters" . PHP_EOL; | |||
}else{ | } else { | ||
echo "Some text is not Chinese characters" . PHP_EOL; | |||
} | } | ||
// | // Approach 2 | ||
if (preg_match('/^[\p{Han}]+$/u', $string)) { | if (preg_match('/^[\p{Han}]+$/u', $string)) { | ||
echo "All text is Chinese characters" . PHP_EOL; | |||
}else{ | } else { | ||
echo "Some text is not Chinese characters" . PHP_EOL; | |||
} | }</syntaxhighlight> | ||
</ | '''Partial match:''' | ||
<syntaxhighlight lang="php">// Approach 1 | |||
< | |||
// | |||
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐'; | $string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐'; | ||
$pattern = '/[\p{Han}]+/u'; | $pattern = '/[\p{Han}]+/u'; | ||
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE); | preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE); | ||
var_dump($matches); | var_dump($matches); | ||
// | // Approach 2 | ||
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐'; | $string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐'; | ||
$pattern = '/[\x{4e00}-\x{9fa5}]+/u'; | $pattern = '/[\x{4e00}-\x{9fa5}]+/u'; | ||
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE); | preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE); | ||
var_dump($matches);</syntaxhighlight> | |||
<span id="find-ascii-characters-in-php"></span> | |||
=== Find ASCII Characters in PHP === | |||
'''Code I:''' | |||
<syntaxhighlight lang="php">if (preg_match('/[^\x20-\x7f]/', $keyword) === 0) { | |||
< | |||
if (preg_match('/[^\x20-\x7f]/', $keyword) === 0) { | |||
echo "The keyword is ASCII only"; | echo "The keyword is ASCII only"; | ||
} else { | } else { | ||
echo "The keyword contains non-ASCII characters (like Chinese, Japanese, etc.)"; | echo "The keyword contains non-ASCII characters (like Chinese, Japanese, etc.)"; | ||
} | }</syntaxhighlight> | ||
</ | '''Code II:''' | ||
<syntaxhighlight lang="php">$pattern = '/^[[:ascii:]]+$/i'; | |||
$pattern = '/^[[:ascii:]]+$/i'; | |||
$text = "Hello World"; // ASCII only | $text = "Hello World"; // ASCII only | ||
if (preg_match($pattern, $text)) { | if (preg_match($pattern, $text)) { | ||
echo " | echo "Pure ASCII characters"; | ||
} else { | } else { | ||
echo " | echo "Contains non-ASCII characters"; | ||
} | }</syntaxhighlight> | ||
<span id="remove-empty-lines"></span> | |||
=== Remove Empty Lines === | |||
'''Original:''' | |||
<pre>Neo | |||
Trinity | |||
Morpheus | |||
Smith | |||
Oracle</pre> | |||
'''After:''' | |||
<pre>Neo | |||
Trinity | |||
Morpheus | |||
Smith | |||
Oracle</pre> | |||
'''Using Sublime Text & EmEditor:''' - Find: <code>^[\s\t]*$\n</code> - Replace with: (empty) | |||
'''Using Notepad++ v7.8.7:''' - Menu: Edit -> Line Operations -> Remove Empty Lines (Including Blank Lines) | |||
=== | <span id="find-non-whitespace-text"></span> | ||
=== Find Non-Whitespace Text === | |||
* Find: <code>[^\s]+</code> | |||
<span id="convert-symbol-separated-text-to-line-by-line-display"></span> | |||
=== Convert Symbol-Separated Text to Line-by-Line Display === | |||
'''Example:''' | |||
<pre>Before: 尼歐、莫斐斯、崔妮蒂、史密斯、祭師 | |||
After: | |||
< | |||
尼歐 | 尼歐 | ||
莫斐斯 | 莫斐斯 | ||
崔妮蒂 | 崔妮蒂 | ||
史密斯 | 史密斯 | ||
祭師 | 祭師</pre> | ||
</pre> | '''Using Sublime Text or EmEditor:''' - Find: <code>([^、]+)([、]{1})</code> - Replace with: <code>\1\n</code> | ||
</ | |||
<span id="replace-multiple-spaces-with-tab-characters"></span> | |||
=== Replace Multiple Spaces with Tab Characters === | |||
'''Before:''' <code>aaa bbb ccc</code> '''After:''' <code>aaa\tbbb\tccc</code> | |||
'''Using Sublime Text:''' - Find: <code>([^\S\n]+)</code> or <code>([^\S\r\n]+)</code> or <code>\s\s+</code> - Replace with: <code>\t</code> | |||
=== | <span id="remove-leadingtrailing-whitespace"></span> | ||
=== Remove Leading/Trailing Whitespace === | |||
=== | <span id="remove-leading-whitespace"></span> | ||
==== Remove Leading Whitespace ==== | |||
* Find: <code>^\s+</code> | |||
< | * Replace with: (empty) | ||
<span id="remove-trailing-whitespace"></span> | |||
</ | ==== Remove Trailing Whitespace ==== | ||
* Find: <code>\s+$</code> | |||
* Replace with: (empty) | |||
< | <span id="remove-both-leading-and-trailing-whitespace"></span> | ||
==== Remove Both Leading and Trailing Whitespace ==== | |||
</ | |||
* Find: <code>(^\s+|\s+$)</code> | |||
* Replace with: (empty) | |||
= | <span id="text-editors-supporting-regular-expressions"></span> | ||
== Text Editors Supporting Regular Expressions == | |||
Various text editors support regular expressions including: - Sublime Text - EmEditor - Notepad++ - Visual Studio Code - Atom - Vim/Neovim | |||
= | <span id="syntax-reference"></span> | ||
== Syntax Reference == | |||
* Newline character: <code>\r\n</code> (for Notepad++: Extended mode & Regular expression mode) | |||
* Tab character: <code>\t</code> (for Notepad++: Extended mode) | |||
* | * Digits: <code>\d</code> (for Notepad++: Regular expression mode only) | ||
* | * Non-whitespace: <code>\S</code> - Does not include half-width spaces and full-width spaces | ||
* | |||
<span id="troubleshooting-regular-expressions"></span> | |||
== Troubleshooting Regular Expressions == | |||
'''Tips:''' 1. Use online tools like regex101 to understand your syntax 2. Test with small data: Prepare small file data to verify syntax 3. Highlight or output matched text for debugging 4. Simplify the syntax when encountering issues 5. Try alternative syntax due to compatibility issues (e.g., <code>\d</code> to <code>[0-9]+</code>) | |||
<span id="alternative-solutions"></span> | |||
< | == Alternative Solutions == | ||
* Use Tab-separated data that can be easily pasted into Google Sheets or MS Excel | |||
* | * Copy multiple rows and paste between different applications (compatibility varies) | ||
<span id="further-reading"></span> | |||
== Further Reading == | |||
* Regular-Expressions.info - Regex Tutorial, Examples and Reference | |||
* Unicode character properties documentation | |||
* Platform-specific regular expression documentation | |||
{{Template:Data factory flow}} | {{Template:Data factory flow}} | ||
[[Category:Regular expression]] [[Category:Software]] [[Category:Programming]] [[Category:Data Science]] [[Category:Search]] [[Category:String manipulation]] | [[Category:Regular expression]] [[Category:Software]] [[Category:Programming]] [[Category:Data Science]] [[Category:Search]] [[Category:String manipulation]] | ||
Revision as of 21:48, 21 July 2025
Regular Expression Guide
When processing text files through regular expressions, you can quickly search for or replace strings that match specific rules. Processing is done on a line-by-line basis for string manipulation.
Regular expressions are also known as regex, regexp, or pattern matching expressions.
Need Help? You can use the provided explanatory online tools to try debugging yourself.
Quick Reference Table
Note: (1) Blue highlighted areas in samples represent text matching the rules, (2) The same text rule can have multiple representations
| Text Rule | Sample | Opposite Text Rule | Sample |
|---|---|---|---|
Any single character (including spaces, but not newline) .
|
What Does the Fox Say? 12 狐狸怎叫 34 | ||
Any character (including spaces), appears 1 or 0 times .? = .{0,1}
|
What Does the Fox Say? 12 狐狸怎叫 34 | ||
Any number of multiple characters (including spaces) .* = .{0,}
|
What Does the Fox Say? 12 狐狸怎叫 34 | ||
Any number of characters (including spaces), at least 1 occurrence .+ = .{1,}
|
What Does the Fox Say? 12 狐狸怎叫 34 | ||
Any number of spaces or newlines (at least 1 occurrence) \s+
|
What Does the Fox Say? 12 狐狸怎叫 34 | Any number of characters (not including spaces or newlines) [^\s]+ = [^\s]{1,} = [\S]+ = [^ ]+
|
What Does the Fox Say? 12 狐狸怎叫 34 |
Any number of ASCII characters (including English, numbers and spaces) [\x00-\x80]+ or ascii:+
|
What Does the Fox Say? 12 狐狸怎叫 34 | Non-ASCII, i.e., Chinese characters appearing any number of times [^\x00-\x80]+
|
What Does the Fox Say? 12 狐狸怎叫 34 |
Any number of uppercase/lowercase English letters, numbers and underscore (_) (not including spaces) [\w]+ = [a-zA-Z0-9_]+ PHP with u modifier supports Chinese characters
|
What Does the Fox Say? 12 狐狸怎叫 _34 | Any number of characters that are not English letters, numbers and underscore (_) \W+ = [^a-zA-Z0-9_]+
|
|
Any number of digits (not including spaces) [\d]+ = [0-9]+
|
What Does the Fox Say? 12 狐狸怎叫 34 | Any number of characters not including digits (including spaces) [^\d]+ = [^0-9]+ = \D+
|
What Does the Fox Say? 12 狐狸怎叫 34 |
Any number of Chinese characters [\p{Han}]+
|
What Does the Fox Say? 12 狐狸怎叫 34 | Any number of characters not including Chinese [^\p{Han}]+
|
|
Lines starting with “狐狸” ^狐狸.*$
|
狐狸怎叫 34 What Does the Fox Say? 柴犬怎叫 What Does the shiba inu say? |
Lines not starting with “狐狸” ^(?!狐狸).*$
|
狐狸怎叫 34 What Does the Fox Say? 柴犬怎叫 What Does the shiba inu say? |
Lines ending with “怎叫” ^.*怎叫$
|
What Does the Fox Say? 12 狐狸怎叫 34 What Does the shiba inu say? 柴犬怎叫 |
Lines not ending with “怎叫” .*(?<!怎叫)$
|
What Does the Fox Say? 12 狐狸怎叫 34 What Does the shiba inu say? 柴犬怎叫 |
Lines containing “狐狸” ^.*狐狸.*$ or (狐狸)
|
What Does the Fox Say? 12 狐狸怎叫 34 What Does the shiba inu say? 柴犬怎叫 |
Lines not containing “狐狸” ^((?!狐狸).)*$
|
What Does the Fox Say? 12 狐狸怎叫 34 What Does the shiba inu say? 柴犬怎叫 |
| 叫.*狐狸 | What Does the Fox Say? 12 狐狸怎叫 34 What Does the Fox Say? 12 不叫狐狸 34 What Does the shiba inu say? 柴犬怎叫 |
||
| 叫).* | What Does the Fox Say? 12 狐狸怎叫 34 What Does the shiba inu say? 柴犬怎叫 What Does the shiba inu say? 柴犬怎了 |
柴犬).)*$ | What Does the Fox Say? 12 狐狸怎叫 34 What Does the shiba inu say? 柴犬怎叫 What Does the Husky say? 哈士奇怎叫 |
Boolean logic NOT: Lines not containing “狐狸” but containing “柴犬” ^((?!狐狸).)*(柴犬).*$ = ^(柴犬).*((?!狐狸).)*$ = (柴犬).*((?!狐狸).)*
|
What Does the Fox Say? 12 狐狸怎叫 34 What Does the shiba inu say? 柴犬怎叫 |
Regular Expression Online Tools
Websites for testing regular expression syntax: * RegEx101 - “Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript” - Provides syntax explanations * RegExr - Learn, Build, & Test RegEx - Provides syntax explanations * Regexper - Visual explanation of syntax using diagrams * Regulex - JavaScript Regular Expression Visualizer - Visual explanation using diagrams * Rubular - A Ruby regular expression editor and tester * PHP Live Regex * Regex Tester and Debugger Online - Javascript, PCRE, PHP
Examples: * Regular Expression Library - Pattern examples provided by users
Common Use Cases
Replace Newlines with Commas
Converting email lists into a format usable by email software:
Original: [email protected] [email protected] [email protected] Convert to: [email protected],[email protected],[email protected]
Method 1: Sublime Text, EmEditor
- Menu: Search -> Replace
- Check “Use Regular Expression”
- Find:
\n(newline character) - Replace with:
,
- Find:
- Click “Replace all”
Method 2: Notepad++
- Menu: Find -> Replace
- Search mode: Check “Extended mode” (not “Regular expression”)
- Find:
\n - Replace with:
,
- Find:
- Click “Replace All”
Method 3: Microsoft Word
- Menu: Edit -> Replace
- Check extended mode
- Find:
^p(paragraph mark) - Replace with:
,
- Find:
- Click “Replace All”
Method 4: Sed command for Linux
sed ':a;N;$!ba;s/\n/; /g' old.filename > new.filename
Find IP Addresses (IPv4)
For Notepad++ v.5.9.5: - Find: \d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?
For Sublime Text v. 3.2.21: - Find: (?:\d{1,3}\.){3}\d{1,3}
Remove Black Squares (UNIX Line Endings LF)
Using Notepad++: 1. Menu: Find -> Replace 2. Search mode: Check “Extended mode” - Find: \n\n (2 LF characters) - Replace with: \r\n (CR and LF)
Add Quotes Around Elements
Add Quotes Around Array Elements
Before: Elmo, Emie, Granny Bird, Herry Monster, 喀喀獸 After: 'Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸'
Method 1: PHP
$users = array('Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸');
// Single quotes around each element
$result = implode(",", preg_replace('/^(.*?)$/', "'$1'", $users));
// Double quotes around each element
$result = implode(",", preg_replace('/^(.*?)$/', "\"$1\"", $users));
echo $result;Method 2: Sublime Text or EmEditor - Find: ([^\s|,]+) - Replace with: '\1' (for single quotes) or "\1" (for double quotes)
Method 3: Notepad++ (Enable “Regular expression” search mode) - Find: ([^\s|,]+) - Replace with: '$1' (for single quotes) or "$1" (for double quotes)
Find Non-ASCII Characters (Chinese/Non-English Text)
In Google Sheets
[^\x00-\x80]+
In LibreOffice
[^\u0000-\u0080]+
Find Chinese Characters in Google Sheets
Example: If A2 contains any Chinese character, display “Chinese”, otherwise display “English”:
=IF(REGEXMATCH(A2, "[\一-\龥]"), "Chinese", "English")
Find Chinese Characters in MySQL
Find rows where column_name contains Chinese characters:
SELECT `column_name`
FROM `table_name`
WHERE HEX(`column_name`) REGEXP '^(..)*(E[4-9])';
Find Non-ASCII Characters in MySQL
Find rows where column_name is not entirely ASCII characters:
SELECT `column_name`
FROM `table_name`
WHERE `column_name` <> CONVERT(`column_name` USING ASCII)
Find Chinese Characters in PHP
Exact match:
// Approach 1
if (preg_match('/^[\x{4e00}-\x{9fa5}]+$/u', $string)) {
echo "All text is Chinese characters" . PHP_EOL;
} else {
echo "Some text is not Chinese characters" . PHP_EOL;
}
// Approach 2
if (preg_match('/^[\p{Han}]+$/u', $string)) {
echo "All text is Chinese characters" . PHP_EOL;
} else {
echo "Some text is not Chinese characters" . PHP_EOL;
}Partial match:
// Approach 1
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\p{Han}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);
// Approach 2
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\x{4e00}-\x{9fa5}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);
Find ASCII Characters in PHP
Code I:
if (preg_match('/[^\x20-\x7f]/', $keyword) === 0) {
echo "The keyword is ASCII only";
} else {
echo "The keyword contains non-ASCII characters (like Chinese, Japanese, etc.)";
}Code II:
$pattern = '/^[[:ascii:]]+$/i';
$text = "Hello World"; // ASCII only
if (preg_match($pattern, $text)) {
echo "Pure ASCII characters";
} else {
echo "Contains non-ASCII characters";
}
Remove Empty Lines
Original:
Neo Trinity Morpheus Smith Oracle
After:
Neo Trinity Morpheus Smith Oracle
Using Sublime Text & EmEditor: - Find: ^[\s\t]*$\n - Replace with: (empty)
Using Notepad++ v7.8.7: - Menu: Edit -> Line Operations -> Remove Empty Lines (Including Blank Lines)
Find Non-Whitespace Text
- Find:
[^\s]+
Convert Symbol-Separated Text to Line-by-Line Display
Example:
Before: 尼歐、莫斐斯、崔妮蒂、史密斯、祭師 After: 尼歐 莫斐斯 崔妮蒂 史密斯 祭師
Using Sublime Text or EmEditor: - Find: ([^、]+)([、]{1}) - Replace with: \1\n
Replace Multiple Spaces with Tab Characters
Before: aaa bbb ccc After: aaa\tbbb\tccc
Using Sublime Text: - Find: ([^\S\n]+) or ([^\S\r\n]+) or \s\s+ - Replace with: \t
Remove Leading/Trailing Whitespace
Remove Leading Whitespace
- Find:
^\s+ - Replace with: (empty)
Remove Trailing Whitespace
- Find:
\s+$ - Replace with: (empty)
Remove Both Leading and Trailing Whitespace
- Find:
(^\s+|\s+$) - Replace with: (empty)
Text Editors Supporting Regular Expressions
Various text editors support regular expressions including: - Sublime Text - EmEditor - Notepad++ - Visual Studio Code - Atom - Vim/Neovim
Syntax Reference
- Newline character:
\r\n(for Notepad++: Extended mode & Regular expression mode) - Tab character:
\t(for Notepad++: Extended mode) - Digits:
\d(for Notepad++: Regular expression mode only) - Non-whitespace:
\S- Does not include half-width spaces and full-width spaces
Troubleshooting Regular Expressions
Tips: 1. Use online tools like regex101 to understand your syntax 2. Test with small data: Prepare small file data to verify syntax 3. Highlight or output matched text for debugging 4. Simplify the syntax when encountering issues 5. Try alternative syntax due to compatibility issues (e.g., \d to [0-9]+)
Alternative Solutions
- Use Tab-separated data that can be easily pasted into Google Sheets or MS Excel
- Copy multiple rows and paste between different applications (compatibility varies)
Further Reading
- Regular-Expressions.info - Regex Tutorial, Examples and Reference
- Unicode character properties documentation
- Platform-specific regular expression documentation
Data factory flow
- 🔍 Data Collection: 1. How to extract content from websites
- 🧹 Data Cleaning: 2. Data cleaning, 3. Regular expression
- ⚙️ Data Processing: 4. Json encode and decode, 5. Convert between date and unix timestamp, 6. MySQL commands
- 📊 Data Analysis: 7. Data exploration
- 📤 Data Output: 8. Comparison of common data file formats, 9. Export MySQL query to Excel file, 10. Microsoft Excel
- 🔧 Troubleshooting: 11. Troubleshooting of MySQL errors