Regular expression: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
 
(168 intermediate revisions by the same user not shown)
Line 1: Line 1:
正規表示法 (Regular Expression):處理文字檔時,搜尋或取代符合特定規則的字串,以文字檔每行的字串為單位處理。<ref>[http://linux.vbird.org/linux_basic/0330regularex.php 鳥哥的 Linux 私房菜 -- 正規表示法 (regular expression, RE) 與文件格式化處理]</ref> 正規表示法,又稱正規表示式、正則表達式、正規表示法、正規運算式、規則運算式、常規表示法<ref>[https://zh.wikipedia.org/wiki/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F 正規表示式 - 維基百科,自由的百科全書]</ref>。
When processing text files through regular expressions, you can quickly search for or replace strings that match specific rules. Processing is done on a line-by-line basis for string manipulation. Regular expressions are also known as regex, regexp, or pattern matching expressions.


{{Raise hand | text = 有問題嗎?可以利用提供解說的[[Regular_expression#Regular_expression_online_tools | 線上工具]],嘗試自己除錯。 也可以到[http://www.ptt.cc/bbs/RegExp/index.html 看板 RegExp 文章列表 - 批踢踢實業坊]或其他[[問答服務]]詢問。 }}
{{LanguageSwitcher | content = [[Regular expression | English]], [[Regular expression in Mandarin|漢字]]}}


== 快速查表 ==
{{Raise hand | text = '''Need Help?''' You can use the provided explanatory [[#regular-expression-online-tools|online tools]] to try debugging yourself. }}
說明: sample 藍色網底處代表符合規則的文字
<table border="1" style="width:100%">
<tr >
<th style="background-color: #E0E0E0;"> 文字規則 </th>
<th style="background-color: #E0E0E0; width:260px;"> sample </th>
<th style="background-color: #9c9ca3;"> 對立的文字規則 </th>
<th style="background-color: #9c9ca3; width:260px;"> sample</th>
</tr>
<tr>
<td> 任意一個文字(包含空白,但不包含換行符號) <br /> {{kbd | key = <nowiki>.</nowiki>}} </td>
<td><span style="background:#C6E3FF">W</span>hat Does the Fox Say? 12 狐狸怎叫 34</td>
<td></td>
<td></td>
</tr>
<tr>
<td> 任意文字(包含空白),出現1次或0次 <br /> {{kbd | key = <nowiki>.?</nowiki>}} = {{kbd | key = <nowiki>.{0,1}</nowiki>}}</td>
<td><span style="background:#C6E3FF">W</span>hat Does the Fox Say? 12 狐狸怎叫 34</td>
<td></td>
<td></td>
</tr>
<tr>
<td> 任意次的多個文字(包含空白) <br /> {{kbd | key = <nowiki>.*</nowiki>}} ={{kbd | key = <nowiki> .{0,}</nowiki>}}</td>
<td><span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span></td>
<td></td>
<td></td>
</tr>
<tr>
<td> 任意次的文字(包含空白),至少出現1次 <br /> {{kbd | key = <nowiki>.+</nowiki>}} = {{kbd | key = <nowiki>.{1,}</nowiki>}}</td>
<td><span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span></td>
<td></td>
<td></td>
</tr>
<tr>
<td> 任意次的空白 (至少出現1次的空白)  <br /> {{kbd | key = <nowiki>\s+</nowiki>}} </td>
<td>What<span style="background:#C6E3FF"> </span>Does the Fox Say? 12 狐狸怎叫 34</td>
<td>任意多個文字(不包含空白) <br /> {{kbd | key = <nowiki>[^\s]+</nowiki>}} ={{kbd | key = <nowiki> [^\s]{1,}</nowiki>}} = {{kbd | key = <nowiki> [\S]+</nowiki>}} </td>
<td><span style="background:#C6E3FF">What</span> Does the Fox Say? 12 狐狸怎叫 34</td>
</tr>
<tr>
<td> 任意次的 ASCII character(包含英文、數字和空白) [http://regexr.com/3aom2 demo]<ref>[http://www.asciitable.com/ understand]</ref> <br /> {{kbd | key = <nowiki>[\x00-\x80]+</nowiki>}}</td>
<td><span style="background:#C6E3FF">What Does the Fox Say? 12</span> 狐狸怎叫 34</td>
<td>非 ASCII,即中文出現任意次<br /> {{kbd | key = <nowiki>[^\x00-\x80]+</nowiki>}}</td>
<td>What Does the Fox Say? 12 <span style="background:#C6E3FF">狐狸怎叫</span> 34</td>
</tr>
<tr>
<td> 任意次的英文、數字和底線符號( _ )文字(不包含空白) <br /> {{kbd | key = <nowiki>[\w]+</nowiki>}} = {{kbd | key = <nowiki>[a-zA-Z0-9_]+</nowiki>}}</td>
<td><span style="background:#C6E3FF">What</span> Does the Fox Say? 12 狐狸怎叫 34</td>
<td>{{kbd | key = <nowiki>\W+</nowiki>}} = {{kbd | key = <nowiki>[^a-zA-Z0-9_]+</nowiki>}}</td>
<td>[http://regexr.com/3bk4v demo]</td>
</tr>
<tr>
<td> 任意次的數字(不包含空白) <br /> {{kbd | key = <nowiki>[\d]+</nowiki>}} = {{kbd | key = <nowiki>[0-9]+</nowiki>}}</td>
<td>What Does the Fox Say? <span style="background:#C6E3FF">12</span> 狐狸怎叫 34</td>
<td>不包含數字的任意次文字(包含空白  <br /> {{kbd | key = <nowiki>[^\d]+</nowiki>}} = {{kbd | key = <nowiki>\D+</nowiki>}} </td>
<td><span style="background:#C6E3FF">What Does the Fox Say? </span>12 狐狸怎叫 34</td>
</tr>
<tr>
<td> 以「狐狸」開頭的行 <br /> {{kbd | key = <nowiki>^狐狸.*$</nowiki>}}<ref>[http://www.regular-expressions.info/completelines.html Regex Examples: Matching Whole Lines of Text That Satisfy Certain Requirements]</ref></td>
<td>
<span style="background:#C6E3FF">狐狸怎叫 34 What Does the Fox Say?</span><br />
柴犬怎叫 What Does the shiba inu say?
</td>
<td>不以「狐狸」開頭的行  <br /> {{kbd | key = <nowiki>^(?!狐狸).*$</nowiki>}}<ref>[http://stackoverflow.com/questions/406230/regular-expression-to-match-text-that-doesnt-contain-a-word regex - Regular expression to match text that *doesn't* contain a word? - Stack Overflow]</ref> </td>
<td>
狐狸怎叫 34 What Does the Fox Say?<br />
<span style="background:#C6E3FF">柴犬怎叫 What Does the shiba inu say?</span>
</td>
</tr>
<tr>
<td> 以「怎叫」結尾的行 <br /> {{kbd | key = <nowiki>^.*怎叫$</nowiki>}}
<td>
What Does the Fox Say? 12 狐狸怎叫 34<br />
<span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span>
</td>
<td>不以「怎叫」結尾的行  <br /> {{kbd | key = <nowiki>.*(?<!怎叫)$</nowiki>}}<ref>[http://stackoverflow.com/questions/16398471/regex-not-ending-with Regex not ending with - Stack Overflow]</ref></td>
<td>
<span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br />
What Does the shiba inu say? 柴犬怎叫
</td>
</tr>
<tr>
<td> 包含「狐狸」的行 <br /> {{kbd | key = <nowiki>^.*狐狸.*$</nowiki>}}</td>
<td>
<span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br />
What Does the shiba inu say? 柴犬怎叫
</td>
<td>不包含「狐狸」的行  <br /> {{kbd | key = <nowiki>^((?!狐狸).)*$</nowiki>}} </td>
<td>
What Does the Fox Say? 12 狐狸怎叫 34<br />
<span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫 </span>
</td>
</tr>
<tr>
<td> 布林邏輯 AND: 包含「狐狸」和「叫」的行 ([http://regexr.com/3aokl demo])<ref>[http://stackoverflow.com/questions/469913/regular-expressions-is-there-an-and-operator regex - Regular Expressions: Is there an AND operator? - Stack Overflow]</ref><br /> {{kbd | key = <nowiki>(?=.*狐狸)(?=.*叫).*</nowiki>}}</td>
<td>
<span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br />
What Does the shiba inu say? 柴犬怎叫
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> 布林邏輯 OR: 包含「狐狸」或「叫」的行 ([http://regexr.com/3aoko demo])<br /> {{kbd | key = <nowiki>.*(狐狸|叫).*</nowiki>}}</td>
<td>
<span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34<br />
What Does the shiba inu say? 柴犬怎叫</span>
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> 布林邏輯 NOT: 不包含「狐狸」,但包含「柴犬」的行 ([http://regexr.com/3aokr demo])<ref>[http://stackoverflow.com/questions/2953039/regular-expression-for-a-string-containing-one-word-but-not-another regex - Regular expression for a string containing one word but not another - Stack Overflow]</ref><br /> {{kbd | key = <nowiki>^((?!狐狸).)*(柴犬).*$</nowiki>}} = {{kbd | key = <nowiki>^(柴犬).*((?!狐狸).)*$</nowiki>}} = {{kbd | key = <nowiki>(柴犬).*((?!狐狸).)*</nowiki>}}</td>
<td>
What Does the Fox Say? 12 狐狸怎叫 34<br />
<span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span>
</td>
<td></td>
<td></td>
</tr>
</table>


== Regular expression online tools ==
* [http://regex101.com/ Online regex tester and debugger: JavaScript, Python, PHP, and PCRE] ([http://regex101.com/r/tH1eT7/1 example]) {{Gd}} 有提供語法解說
* [http://www.rubular.com/ Rubular]: a Ruby regular expression editor and tester ([http://www.rubular.com/r/UZuUT5pjeh example])
* [http://gskinner.com/RegExr/ RegExr]: Learn, Build, & Test RegEx ([http://regexr.com/395t0 example]). {{Gd}} 有提供語法解說. 教學: [http://blog.hsdn.net/1426.html RegExr: 功能強大的正規式撰寫協助工具]
* [http://www.phpliveregex.com/ PHP Live Regex] {{access | date=2014-11-25}}
* [http://www.gethifi.com/tools/regex HiFi Regex Tester - Live JavaScript Regular Expression Tester] for Javascript {{access | date=2014-12-23}}


examples
== Quick Reference Table ==
* {{Gd}} [http://regexlib.com/ Regular Expression Library] 網友提供的 pattern 範例


== cases ==
Note: (1) Blue highlighted areas in samples represent text matching the rules, (2) The same text rule can have multiple representations
=== 將Email清單,轉成Email軟體可以使用的寄信名單 (取代換行符號) ===
 
<pre>
{| class="wikitable"
|-
! Text Rule
! Sample
! Opposite Text Rule
! Sample
|-
| Any single character (including spaces, but not newline) <br> <code>.</code>
| <span style="background:#C6E3FF">W</span>hat Does the Fox Say? 12 狐狸怎叫 34
|
|
|-
| Any character (including spaces), appears 1 or 0 times <br> <code>.?</code> = <code>.{0,1}</code>
| <span style="background:#C6E3FF">W</span>hat Does the Fox Say? 12 狐狸怎叫 34
|
|
|-
| Any number of multiple characters (including spaces) <br> <code>.*</code> = <code>.{0,}</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span>
|
|
|-
| Any number of characters (including spaces), at least 1 occurrence <br> <code>.+</code> = <code>.{1,}</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span>
|
|
|-
| Any number of spaces or newlines (at least 1 occurrence) <br> <code>\s+</code>
| What<span style="background:#C6E3FF"> </span>Does the Fox Say? 12 狐狸怎叫 34
| Any number of characters (not including spaces or newlines) <br> <code>[^\s]+</code> = <code>[^\s]{1,}</code> = <code>[\S]+</code> = <code>[^ ]+</code>
| <span style="background:#C6E3FF">What</span> Does the Fox Say? 12 狐狸怎叫 34
|-
| Any number of ASCII characters (including English, numbers and spaces) <br> <code>[\x00-\x80]+</code> or <code>[[:ascii:]]+</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12</span> 狐狸怎叫 34
| Non-ASCII, i.e., Chinese characters appearing any number of times <br> <code>[^\x00-\x80]+</code>
| What Does the Fox Say? 12 <span style="background:#C6E3FF">狐狸怎叫</span> 34
|-
| Any number of uppercase/lowercase English letters, numbers and underscore (_) (not including spaces) <br> <code>[\w]+</code> = <code>[a-zA-Z0-9_]+</code> <br> PHP with <code>u</code> modifier supports Chinese characters
| <span style="background:#C6E3FF">What</span> <span style="background:#C6E3FF">Does</span> <span style="background:#C6E3FF">the</span> <span style="background:#C6E3FF">Fox</span> <span style="background:#C6E3FF">Say</span>? <span style="background:#C6E3FF">12</span> 狐狸怎叫 <span style="background:#C6E3FF">_34</span>
| Any number of characters that are not English letters, numbers and underscore (_) <br> <code>\W+</code> = <code>[^a-zA-Z0-9_]+</code>
|
|-
| Any number of digits (not including spaces) <br> <code>[\d]+</code> = <code>[0-9]+</code>
| What Does the Fox Say? <span style="background:#C6E3FF">12</span> 狐狸怎叫 34
| Any number of characters not including digits (including spaces) <br> <code>[^\d]+</code> = <code>[^0-9]+</code> = <code>\D+</code>
| <span style="background:#C6E3FF">What Does the Fox Say? </span>12 狐狸怎叫 34
|-
| Any number of Chinese characters <br> <code>[\p{Han}]+</code>
| What Does the Fox Say? 12 <span style="background:#C6E3FF">狐狸怎叫</span> 34
| Any number of characters not including Chinese <br> <code>[^\p{Han}]+</code>
|
|-
| Lines starting with “狐狸” <br> <code>^狐狸.*$</code>
| <span style="background:#C6E3FF">狐狸怎叫 34 What Does the Fox Say?</span><br>柴犬怎叫 What Does the shiba inu say?
| Lines not starting with “狐狸” <br> <code>^(?!狐狸).*$</code>
| 狐狸怎叫 34 What Does the Fox Say?<br><span style="background:#C6E3FF">柴犬怎叫 What Does the shiba inu say?</span>
|-
| Lines ending with “怎叫” <br> <code>^.*怎叫$</code>
| What Does the Fox Say? 12 狐狸怎叫 34<br><span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span>
| Lines not ending with “怎叫” <br> <code>.*(?&lt;!怎叫)$</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br>What Does the shiba inu say? 柴犬怎叫
|-
| Lines containing “狐狸” <br> <code>^.*狐狸.*$</code> or <code>(狐狸)</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br>What Does the shiba inu say? 柴犬怎叫
| Lines not containing “狐狸” <br> <code>^((?!狐狸).)*$</code>
| What Does the Fox Say? 12 狐狸怎叫 34<br><span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span>
|-
| Boolean logic AND: Lines containing both “狐狸” and “叫” <br> <code>(?=.*狐狸)(?=.*叫).*</code> or <code>狐狸.*叫\|叫.*狐狸</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br><span style="background:#C6E3FF">What Does the Fox Say? 12 不叫狐狸 34</span><br>What Does the shiba inu say? 柴犬怎叫
|
|
|-
| Boolean logic OR: Lines containing “狐狸” or “叫” <br> <code>.*(狐狸\|叫).*</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34<br>What Does the shiba inu say? 柴犬怎叫</span><br>What Does the shiba inu say? 柴犬怎了
| Boolean logic: Lines not containing “狐狸” and not containing “柴犬” <br> <code>^((?!狐狸\|柴犬).)*$</code>
| What Does the Fox Say? 12 狐狸怎叫 34<br>What Does the shiba inu say? 柴犬怎叫<br><span style="background:#C6E3FF">What Does the Husky say? 哈士奇怎叫</span>
|-
| Boolean logic NOT: Lines not containing “狐狸” but containing “柴犬” <br> <code>^((?!狐狸).)*(柴犬).*$</code> = <code>^(柴犬).*((?!狐狸).)*$</code> = <code>(柴犬).*((?!狐狸).)*</code>
| What Does the Fox Say? 12 狐狸怎叫 34<br><span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span>
|
|
|}
 
 
== Regular Expression Online Tools ==
 
Websites for testing regular expression syntax:
* {{Gd}} [http://regex101.com/ RegEx101] - “Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript” - Provides syntax explanations
* {{Gd}} [http://gskinner.com/RegExr/ RegExr] - Learn, Build, &amp; Test RegEx - Provides syntax explanations
* [https://regexper.com/ Regexper] - Visual explanation of syntax using diagrams
* [https://jex.im/regulex/ Regulex:JavaScript Regular Expression Visualizer] - JavaScript Regular Expression Visualizer - Visual explanation using diagrams
* [http://www.rubular.com/ Rubular] - A Ruby regular expression editor and tester
* [http://www.phpliveregex.com/ PHP Live Regex]
* [http://www.regextester.com/ Regex Tester and Debugger Online] - JavaScript, PCRE, PHP
 
 
 
== Common Use Cases ==
 
 
=== Replace Newlines with Commas ===
 
Converting email lists into a format usable by email software:
 
<pre>Original:


改成
Convert to:
</pre>
 
==== Method 1: Sublime Text, EmEditor ====
 
# Menu: Search -&gt; Replace
# Check “Use Regular Expression”
#* Find: <code>\n</code> (newline character)
#* Replace with: <code>,</code>
# Click “Replace all”
 
 
==== Method 2: Notepad++ ====


==== 方案1: Sublime Text, EmEditor ====
# Menu: Find -&gt; Replace
語法適用 [http://www.sublimetext.com/ Sublime Text], [http://www.emeditor.com/ EmEditor]軟體 (以下為 EmEditor 的操作說明)
# Search mode: Check “Extended mode” (not “Regular expression”)
# Menu: Search -> Replace
#* Find: <code>\n</code>
# click "Use Regular Expression"
#* Replace with: <code>,</code>
## Find: \n
# Click “Replace All”
## Replace with: ,
# click "Replace all"


==== 方案2: Notepad++ ====
使用[http://notepad-plus-plus.org/ Notepad++]軟體
# 選單: 尋找 -> 取代
# 搜尋模式: 勾選「增強模式」 (不是勾選「用類型表式」)
## 尋找目標: \r\n
## 取代成: ,
# 勾選全部取代


相關資料: [http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Replacing_Newlines How To Replace Line Ends, thus changing the line layout] last visited: 2010-01-27
==== Method 3: Microsoft Word ====


==== 方案3: Microsoft Word ====
# Menu: Edit -&gt; Replace
使用Microsoft Word 2002軟體
# Check extended mode
# 選單: 編輯 -> 取代
#* Find: <code>^p</code> (paragraph mark)
# 勾選增強模式
#* Replace with: <code>,</code>
## 尋找目標: ^p (段落標記)
# Click “Replace All”
## 取代為: ,
# 勾選全部取代


==== 方案4: Sed command for linux ====
==== Method 4: Sed command for Linux ====


{{kbd | key=<nowiki>sed 's/要被取代的字串/新的字串/g' old.filename > new.filename</nowiki>}}<ref>[http://linux.vbird.org/linux_basic/0330regularex.php#sed_replace 鳥哥的 Linux 私房菜 -- 正規表示法 (regular expression, RE) 與文件格式化處理]</ref>
<syntaxhighlight lang="bash">sed ':a;N;$!ba;s/\n/; /g' old.filename > new.filename</syntaxhighlight>


(1)要被取代的字串: :a;N;$!ba;s/\n
=== Find IP Addresses (IPv4) ===
(2)新的字串: ;


{{kbd | key=<nowiki>sed ':a;N;$!ba;s/\n/; /g' old.filename > new.filename</nowiki>}} <ref>參考 [http://stackoverflow.com/questions/1251999/sed-how-can-i-replace-a-newline-n unix - sed: How can I replace a newline? ]</ref>
For Notepad++ v.5.9.5: - Find: <code>\d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?</code>


=== Find IP address ===
For Sublime Text v. 3.2.21: - Find: <code>(?:\d{1,3}\.){3}\d{1,3}</code>
使用[http://notepad-plus-plus.org/ Notepad++]軟體 v.5.9.5
# 選單: 尋找 -> 取代
# 搜尋模式: 勾選「用類型表式」
## 尋找目標: \d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?


note: not support {n} syntax
=== Remove Black Squares (UNIX Line Endings LF) ===


參考資料:  
Using Notepad++: 1. Menu: Find -&gt; Replace 2. Search mode: Check “Extended mode” - Find: <code>\n\n</code> (2 LF characters) - Replace with: <code>\r\n</code> (CR and LF)
* [http://sourceforge.net/projects/notepad-plus/forums/forum/331754/topic/4780602 SourceForge.net: Notepad++: Regular expression for IP addresses]
* [http://stackoverflow.com/questions/53497/regular-expression-that-matches-valid-ipv6-addresses regex - Regular expression that matches valid IPv6 addresses - Stack Overflow] {{access | date = 2015-08-10}}


=== 移除記事本純文字檔的黑色方塊(UNIX系統的換行符號 LF ) ===
=== Add Quotes Around Elements ===
使用notepad++軟體
# 選單: 尋找 -> 取代
# 搜尋模式: 勾選「增強模式」
## 尋找目標: \n\n  (註: 2個LF )
## 取代成: \r\n  (註: CR與LF )


用記事本打開純文字檔時,就不會看到黑色方塊
==== Add Quotes Around Array Elements ====


<pre>Before: Elmo, Emie, Granny Bird, Herry Monster, 喀喀獸
After: 'Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸'</pre>
'''Method 1: PHP'''


=== 將每項元素,加上引號框起來 ===
<syntaxhighlight lang="php">$users = array('Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸');
==== 將陣列的每項元素,都加上引號框起來 ====
// Single quotes around each element
<pre>
Elmo, Emie, Granny Bird, Herry Monster, 喀喀獸
修改成
'Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸'
</pre>
方法1: 使用 PHP
<pre>
$users = array('Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸');
//「單引號」相隔每個元素
$result = implode(",", preg_replace('/^(.*?)$/', "'$1'", $users));
$result = implode(",", preg_replace('/^(.*?)$/', "'$1'", $users));
// Double quotes around each element
$result = implode(",", preg_replace('/^(.*?)$/', "\"$1\"", $users));
echo $result;</syntaxhighlight>
'''Method 2: Sublime Text or EmEditor''' - Find: <code>([^\s|,]+)</code> - Replace with: <code>'\1'</code> (for single quotes) or <code>&quot;\1&quot;</code> (for double quotes)


//「雙引號」相隔每個元素
'''Method 3: Notepad++''' (Enable “Regular expression” search mode) - Find: <code>([^\s|,]+)</code> - Replace with: <code>'$1'</code> (for single quotes) or <code>&quot;$1&quot;</code> (for double quotes)
//$result = implode(",", preg_replace('/^(.*?)$/', "\"$1\"", $users));
 
echo $result;
=== Find Non-ASCII Characters (Chinese/Non-English Text) ===
</pre>
 
 
==== In LibreOffice ====
 
<pre>[^\u0000-\u0080]+</pre>


Thanks, Joshua! More on [http://melikedev.com/2010/02/24/php-wrap-implode-array-elements-in-quotes/ PHP - Wrap Implode Array Elements in Quotes » Me Like Dev]


方法2: 使用 [http://www.sublimetext.com/ Sublime Text] 或 [https://zh-tw.emeditor.com/ EmEditor]
==== Find Chinese Characters in Google Sheets ====
* Find: {{kbd | key = <nowiki>([^\s|,]+)</nowiki>}}
* 分隔符號
**「單引號」相隔每個元素 Replace with: {{kbd | key = <nowiki>'\1'</nowiki>}}
**「雙引號」相隔每個元素 Replace with: {{kbd | key = <nowiki>"\1"</nowiki>}}


方法3: 使用 [https://notepad-plus-plus.org/ Notepad++]。啟用搜尋模式的「用類型表式」
Example: If cell {{kbd | key=A2}} contains any Chinese character, display “Chinese”, otherwise display “English”:
* Find: {{kbd | key = <nowiki>([^\s|,]+)</nowiki>}}  
* 分隔符號
**「單引號」相隔每個元素 Replace with: {{kbd | key = <nowiki>'$1'</nowiki>}}
**「雙引號」相隔每個元素 Replace with: {{kbd | key = <nowiki>"$1"</nowiki>}}


==== 將每行的文字,都加上引號框起來 ====
<pre>=IF(REGEXMATCH(A2, &quot;[\一-\龥]&quot;), &quot;Chinese&quot;, &quot;English&quot;)</pre>
使用 [http://www.sublimetext.com/ Sublime Text] 或 [https://zh-tw.emeditor.com/ EmEditor] {{exclaim}} 以下方法沒有處理每行的後面可能有一格或多格空白
* Find what: {{kbd | key = <nowiki>^(.*)$\n</nowiki>}}
* Replace with: {{kbd | key = <nowiki>'\1', </nowiki>}}


=== 取代非英文的文字 ===
==== Find Non-ASCII Characters in Google Sheets ====
適用: Google Drive 的 RegExReplace 函數、Notepad++的搜尋
Extract non-ASCII characters (such as Chinese, Japanese, emoji, etc.) from cell {{kbd | key=A2}}
<pre>
<pre>
[^\x00-\x80]+
=IF(ISERROR(REGEXEXTRACT(A2, "[^\x00-\x80]+")), "", REGEXEXTRACT(A2, "[^\x00-\x80]+"))
</pre>
</pre>


適用: Total commander 的 Multi-Rename tool<ref>取代非英文的文字,但是不包含 . 符號: <nowiki>[^\u0000-\u0080|.]+ </nowiki></ref>
Explanation of regular expression {{kbd | key=<nowiki>[^\x00-\x80]+</nowiki>}}
<pre>
 
[^\u0000-\u0080]+
* {{kbd | key=<nowiki>[\x00-\x80]</nowiki>}}: Represents the ASCII character range (character codes 0-128). (1) Standard ASCII range: 0-127 ({{kbd | key=<nowiki>0x00-0x7F</nowiki>}} aka * {{kbd | key=<nowiki>[\x00-\x7F]</nowiki>}})<ref>[https://www.commfront.com/pages/ascii-chart ASCII Chart – CommFront]</ref> (2) Character 128 (({{kbd | key=<nowiki>0x80</nowiki>}}) is actually the first character in the extended ASCII range, not part of the original ASCII standard.<ref>[https://en.wikipedia.org/wiki/UTF-8 UTF-8 - Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/Control_character Control character - Wikipedia]</ref>
</pre>
* {{kbd | key=<nowiki>[^...]</nowiki>}}: Means "not" these characters
* {{kbd | key=<nowiki>+</nowiki>}}: Means one or more
 
Overall meaning: Matches one or more non-ASCII characters
 
==== Find Chinese Characters in MySQL ====
 
Find rows where <code>column_name</code> contains Chinese characters:
 
<pre lang="sql">SELECT `column_name`
FROM `table_name`
WHERE HEX(`column_name`) REGEXP '^(..)*(E[4-9])';</pre>
 
Query condition used to match records where the <code>column_name</code> field contains only Chinese characters.
<pre lang="sql">SELECT `column_name`
FROM `table_name`
WHERE `column_name` REGEXP '^[一-龯]+$';</pre>
 
Explanation:
* {{kbd | key=<nowiki>[一-龯]</nowiki>}} - Character set that matches all characters from "一" to "龯" in Unicode
* "一" has Unicode code point {{kbd | key=<nowiki>U+4E00</nowiki>}}<ref>[https://www.compart.com/en/unicode/U+4E00 “一” U+4E00 CJK Unified Ideograph-4E00 Unicode Character]</ref>
* "龯" has Unicode code point {{kbd | key=<nowiki>U+9FEF</nowiki>}}<ref>[https://www.compart.com/en/unicode/U+9FAF “龯” U+9FAF CJK Unified Ideograph-9FAF Unicode Character]</ref>
* This range U+4E00-U+9FFF already covers over 99% of daily Chinese usage requirements [https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_B Extension B] and later blocks mainly contain ancient Chinese characters, variant characters, etc., which rarely appear in modern texts
 
==== Find Non-ASCII Characters in MySQL ====
 
Find rows where <code>column_name</code> is not entirely ASCII characters:
 
<syntaxhighlight lang="sql">SELECT `column_name`
FROM `table_name`
WHERE `column_name` <> CONVERT(`column_name` USING ASCII)</syntaxhighlight>
 
==== Find Chinese Characters in PHP ====
 
'''Exact match:'''
 
<syntaxhighlight lang="php">// Approach 1
if (preg_match('/^[\x{4e00}-\x{9fa5}]+$/u', $string)) {
    echo "All text is Chinese characters" . PHP_EOL;
} else {
    echo "Some text is not Chinese characters" . PHP_EOL;
}
 
// Approach 2
if (preg_match('/^[\p{Han}]+$/u', $string)) {
    echo "All text is Chinese characters" . PHP_EOL;
} else {
    echo "Some text is not Chinese characters" . PHP_EOL;
}</syntaxhighlight>
'''Partial match:'''
 
<syntaxhighlight lang="php">// Approach 1
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\p{Han}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);
 
// Approach 2
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\x{4e00}-\x{9fa5}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);</syntaxhighlight>
 
=== Find ASCII Characters in PHP ===
 
'''Code I:'''


參考資料: [http://stackoverflow.com/questions/150033/regular-expression-to-match-non-english-characters javascript - Regular expression to match non-english characters? - Stack Overflow]
<syntaxhighlight lang="php">if (preg_match('/[^\x20-\x7f]/', $keyword) === 0) {
    echo "The keyword is ASCII only";
} else {
    echo "The keyword contains non-ASCII characters (like Chinese, Japanese, etc.)";
}</syntaxhighlight>
'''Code II:'''


=== 將每行文字的行頭加上逗號符號 ===
<syntaxhighlight lang="php">$pattern = '/^[[:ascii:]]+$/i';
使用notepad++軟體
$text = "Hello World"; // ASCII only
# 選單: 尋找 -> 取代
if (preg_match($pattern, $text)) {
# 搜尋模式: 勾選「用類型表示」
    echo "Pure ASCII characters";
## 尋找目標: {{kbd | key=(.*)}} 或者是 {{kbd | key=^(.*)$}}
} else {
## 取代成: {{kbd | key=,\1}} 或者是 {{kbd | key=,$1}}。
    echo "Contains non-ASCII characters";
}</syntaxhighlight>


參考資料: [http://stackoverflow.com/questions/8413237/notepad-regex-search-replace-how-to-append-and-prepend-a-character-at-start-a Notepad++ RegEx Search/Replace: How to append and prepend a character at start and end of each file line? - Stack Overflow]
=== Remove Empty Lines ===


=== 知道前面跟後面的文字,但是中間文字忘記了 ===
'''Original:'''
使用notepad++軟體
# 選單: 尋找 -> 取代
# 搜尋模式: 勾選「用類型表示」
## 尋找目標: {{kbd | key=a(.*)le}} 就可以找到(1)apple (2)apps lesson ... 等a開頭、le結尾的文字,中間可夾雜空白。 {{exclaim}} 中文字串搜尋,建議將文件的編碼改成 UTF-8 編碼


<pre>Neo
Trinity


=== 移除空白行 ===
Morpheus
移除一行空白或多行空白(含空白字元)
* 尋找: {{kbd | key=<nowiki>^[\s\t]*$\n</nowiki>}} --> 取代為: 空白 (適用 Sublime Text 與 EmEditor 軟體, {{exclaim}} 不適用 Notepad++ 軟體)<ref>[http://www.sitepoint.com/forums/showthread.php?448843-Regex-delete-multiple-blank-lines Regex: delete multiple blank lines]</ref>
* Notepad++ 軟體選單: 編輯 -> 行列 -> 移除空行(含空白字元)<ref>[http://stackoverflow.com/questions/3866034/removing-empty-lines-in-notepad regex - Removing empty lines in Notepad++ - Stack Overflow]</ref>


移除一行空白或多行空白,例子:
<pre>
# (原) 每行可能間隔一行空白或多行空白
尼歐
崔妮蒂


莫斐斯
Smith
Oracle</pre>
'''After:'''


<pre>Neo
Trinity
Morpheus
Smith
Oracle</pre>
'''Using Sublime Text &amp; EmEditor:''' - Find: <code>^[\s\t]*$\n</code> - Replace with: (empty)


史密斯
'''Using Notepad++ v7.8.7:''' - Menu: Edit -&gt; Line Operations -&gt; Remove Empty Lines (Including Blank Lines)
祭師


# (後) 改成每行逐行緊接著
=== Find Non-Whitespace Text ===
尼歐
崔妮蒂
莫斐斯
史密斯
祭師
</pre>
* 尋找: {{kbd | key=<nowiki>^$\n</nowiki>}} --> 取代為: 空白 (適用 Sublime Text 與 EmEditor 軟體, {{exclaim}} 不適用 Notepad++ 軟體)
* 尋找: {{kbd | key=<nowiki>\r\n[\r\n]*</nowiki>}} 或 {{kbd | key=<nowiki>\r\n[\r\n]+</nowiki>}} --> 取代為: {{kbd | key=<nowiki>\r\n</nowiki>}} (適用 Notepad++ 軟體,需勾選「用類型表式)
* 尋找: {{kbd | key=<nowiki>\n(\n)+</nowiki>}} --> 取代為: {{kbd | key=<nowiki>\n</nowiki>}}(適用 Sublime Text 軟體,需 勾選「regular expression」)


移除一行空白
* Find: <code>[^\s]+</code>
* 尋找: {{kbd | key=<nowiki>\n\n</nowiki>}} --> 取代為: {{kbd | key=<nowiki>\n</nowiki>}} (適用  Sublime Text 與 EmEditor 軟體,需勾選「使用規則運算式」)


=== 尋找非空白的文字 ===
=== Convert Symbol-Separated Text to Line-by-Line Display ===
* 尋找: {{kbd | key=<nowiki>[^\s]+</nowiki>}} [https://regex101.com/r/zH7wV3/1 online demo]


=== 將特定符號相隔的文字,改成逐行顯示 ===
'''Example:'''
例子:  
<pre>
# (原) 頓號(、)符號相隔的文字
尼歐、莫斐斯、崔妮蒂、史密斯、祭師


# (後) 改成逐行顯示
<pre>Before: 尼歐、莫斐斯、崔妮蒂、史密斯、祭師
After:
尼歐
尼歐
莫斐斯
莫斐斯
崔妮蒂
崔妮蒂
史密斯
史密斯
祭師
祭師</pre>
</pre>
'''Using Sublime Text or EmEditor:''' - Find: <code>([^、]+)([、]{1})</code> - Replace with: <code>\1\n</code>
 
=== Replace Multiple Spaces with Tab Characters ===
 
'''Before:''' <code>aaa bbb    ccc</code> '''After:''' <code>aaa\tbbb\tccc</code>
 
'''Using Sublime Text:''' - Find: <code>([^\S\n]+)</code> or <code>([^\S\r\n]+)</code> or <code>\s\s+</code> - Replace with: <code>\t</code>
 
 
=== Remove Leading/Trailing Whitespace ===
 


使用 [http://www.sublimetext.com/ Sublime Text] 或 [https://zh-tw.emeditor.com/ EmEditor]
==== Remove Leading Whitespace ====
* Find: {{kbd | key = <nowiki>([^、]+)([、]{1})</nowiki>}}
* Replace with: {{kbd | key = <nowiki>\1\n</nowiki>}}


語法說明
* Find: <code>^\s+</code>
* <nowiki>[^、]</nowiki> : 符合任意字,但不是頓號(、)的文字
* Replace with: (empty)
* <nowiki>[^、]+</nowiki> : 一次以上不是頓號(、)的文字
* <nowiki>([^、]+)</nowiki> : 符合「一次以上不是頓號(、)的文字」規則的文字
* <nowiki>[、]</nowiki>: 出現頓號(、)任意次的文字
* <nowiki>[、]{1}</nowiki> : 出現頓號(、)一次的文字
* <nowiki>([、]{1})</nowiki> : 符合「出現頓號()一次的文字」規則的文字




=== 將每行文字的結尾處,加入空一格 (半形空白) ===
==== Remove Trailing Whitespace ====
適用軟體: Sublime Text, EmEditor
# Menu: Search -> Replace
# click "Use Regular Expression"
## Find: {{kbd | key = <nowiki>\n</nowiki>}}
## Replace with: {{kbd | key = <nowiki>_\n</nowiki>}}(符號 {{kbd | key = <nowiki>\n</nowiki>}} 前面的 _ 自行替換成半形空白)
# click "Replace all"


{{exclaim}} 需要檢查最後一行是否是空白行,如果不是空白行,不會套用到該取代規則
* Find: <code>\s+$</code>
* Replace with: (empty)




=== 將每行文字內夾雜的空白,取代成 Tab 符號 ===
==== Remove Both Leading and Trailing Whitespace ====
將原本空白間隔的欄位值,取代成 Tab鍵間隔的欄位值。輸出結果可以方便貼到 MS Excel 或 [[Google spreadsheet]]。
<pre># \t 代表是 Tab 鍵,又稱定位鍵
# before
aaa bbb    ccc


# after
* Find: <code>(^\s+|\s+$)</code>
aaa\tbbb\tccc
* Replace with: (empty)
</pre>


說明: \S 代表非空白字元, \r\n 代表換行符號。[^\S\r\n] 則代表不是非空白字元、也不是換行符號。換句話說尋找空白,但不包含換行符號。


使用  Sublime Text 軟體 (參考資料<ref>[http://www.techrepublic.com/blog/microsoft-office/quickly-replace-multiple-space-characters-with-a-tab-character/ Quickly replace multiple space characters with a tab character - TechRepublic]</ref> <ref>[http://stackoverflow.com/questions/3469080/match-whitespace-but-not-newlines-perl regex - Match whitespace but not newlines (Perl) - Stack Overflow]</ref>)
== Text Editors Supporting Regular Expressions ==
# Menu: Search -> Replace
# click "Use Regular Expression"
## Find: {{kbd | key = <nowiki>([^\S\n]+)</nowiki>}} 或 {{kbd | key = <nowiki>([^\S\r\n]+)</nowiki>}} 或 {{kbd | key = <nowiki>_{1,}</nowiki>}} ( 自行替換 _ 成半形空白)
## Replace with: {{kbd | key = <nowiki>\t</nowiki>}}
# click "Replace all"


== Search unmatched string ==
Various text editors support regular expressions including: - Sublime Text - EmEditor - Notepad++ - Visual Studio Code - Atom - Vim/Neovim
=== case: find un-commented console.log ===
original format: some lines contains un-commented [[Javascript debug]] information
<pre>
  console.log("un-commented debug information");


  //console.log("commented debug information");
</pre>


Search pattern: find not started with the / symbol before the string "console.log"
== Syntax Reference ==


<pre>
* Newline character: <code>\r\n</code> (for Notepad++: Extended mode &amp; Regular expression mode)
  [^/](console\.log)
* Tab character: <code>\t</code> (for Notepad++: Extended mode)
</pre>
* Digits: <code>\d</code> (for Notepad++: Regular expression mode only)
* Non-whitespace: <code>\S</code> - Does not include half-width spaces and full-width spaces


== batch action ==
== Troubleshooting Regular Expressions ==
* {{Gd}} [https://github.com/facelessuser/RegReplace RegReplace] 執行多個取代命令 "Simple find and replace sequencer plugin for Sublime Text" Quoted from official webpage. {{access | date=2014-10-25}}


== syntax ==
'''Tips:''' 1. Use online tools like regex101 to understand your syntax 2. Test with small data: Prepare small file data to verify syntax 3. Highlight or output matched text for debugging 4. Simplify the syntax when encountering issues 5. Try alternative syntax due to compatibility issues (e.g., <code>\d</code> to <code>[0-9]+</code>)
* 換行符號: \r\n (適用: Notepad++選項: 增強模式 & 用類型表式)
* tab鍵的固定空白分隔: \t  (適用: Notepad++選項: 增強模式)
* 數字: \d (適用: Notepad++選項: 用類型表式。{{exclaim}} 不適用: Notepad++選項: 增強模式)
* {{kbd | key=<nowiki>\S</nowiki>}} 非空白的文字: 不會含括半形空白與全行空白


== trouble shooting ==
* [http://errerrors.blogspot.com/2015/07/sublime-text-invalid-lookbehind.html Err: 解決 Sublime Text 正則表示式搜尋,遇到的「Invalid lookbehind assertion」錯誤]


== further reading ==
== Alternative Solutions ==
* [http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Searching_And_Replacing SourceForge.net: Searching And Replacing - notepad-plus], [http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions SourceForge.net: Regular Expressions - notepad-plus]
* [http://stackoverflow.com/questions/23020856/text-extraction-with-sublime-text regex - text extraction with sublime text - Stack Overflow] {{access | date=2014-09-26}}
* [https://zh.wikipedia.org/wiki/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F 正規表示式 - 維基百科,自由的百科全書]


unicode
* Use Tab-separated data that can be easily pasted into Google Sheets or MS Excel
* [http://www.regular-expressions.info/unicode.html Regex Tutorial - Unicode Characters and Properties] {{access | date = 2014-04-02}}
* Copy multiple rows and paste between different applications (compatibility varies)
* [http://php.net/manual/en/regexp.reference.unicode.php PHP: Unicode character properties - Manual] {{access | date = 2014-04-02}}


references
== Further Reading ==
<references/>


== 替代方案 ==
* Regular-Expressions.info - Regex Tutorial, Examples and Reference
* 將資料以 {{kbd |key=Tab}}來隔開,貼到Google Drive的Spreadsheet或MS Excel,會自動儲存到不同欄位。所以將需要處理的原始資料中,需要擷取的資料的前後,使用{{kbd |key=Tab}}來隔開,複製後貼到於Google Drive的Spreadsheet或MS Excel,就會自動儲存到不同欄位,方便做進一步處理。
* Unicode character properties documentation
* Platform-specific regular expression documentation


Copy multiple rows & paste
{{Template: Data factory flow}}
* Copy to dreamweaver from MS Excel 2002: ok
* Copy to dreamweaver from Google Docs: not ok {{exclaim}}
* Copy to MS Excel 2002 from Google Docs: ok


[[Category:RegExp]] [[Category:Software]] [[Category:Programming]] [[Category:Data Science]] [[Category:Search]]
[[Category: Regular expression]]  
[[Category: Software]]  
[[Category: Programming]]  
[[Category: Data Science]]  
[[Category: Search]]
[[Category: String manipulation]]
[[Category: Revised with LLMs]

Latest revision as of 11:55, 11 December 2025

When processing text files through regular expressions, you can quickly search for or replace strings that match specific rules. Processing is done on a line-by-line basis for string manipulation. Regular expressions are also known as regex, regexp, or pattern matching expressions.

🌐 Switch language: English, 漢字


Raise_hand.png Need Help? You can use the provided explanatory online tools to try debugging yourself.


Quick Reference Table[edit]

Note: (1) Blue highlighted areas in samples represent text matching the rules, (2) The same text rule can have multiple representations

Text Rule Sample Opposite Text Rule Sample
Any single character (including spaces, but not newline)
.
What Does the Fox Say? 12 狐狸怎叫 34
Any character (including spaces), appears 1 or 0 times
.? = .{0,1}
What Does the Fox Say? 12 狐狸怎叫 34
Any number of multiple characters (including spaces)
.* = .{0,}
What Does the Fox Say? 12 狐狸怎叫 34
Any number of characters (including spaces), at least 1 occurrence
.+ = .{1,}
What Does the Fox Say? 12 狐狸怎叫 34
Any number of spaces or newlines (at least 1 occurrence)
\s+
What Does the Fox Say? 12 狐狸怎叫 34 Any number of characters (not including spaces or newlines)
[^\s]+ = [^\s]{1,} = [\S]+ = [^ ]+
What Does the Fox Say? 12 狐狸怎叫 34
Any number of ASCII characters (including English, numbers and spaces)
[\x00-\x80]+ or ascii:+
What Does the Fox Say? 12 狐狸怎叫 34 Non-ASCII, i.e., Chinese characters appearing any number of times
[^\x00-\x80]+
What Does the Fox Say? 12 狐狸怎叫 34
Any number of uppercase/lowercase English letters, numbers and underscore (_) (not including spaces)
[\w]+ = [a-zA-Z0-9_]+
PHP with u modifier supports Chinese characters
What Does the Fox Say? 12 狐狸怎叫 _34 Any number of characters that are not English letters, numbers and underscore (_)
\W+ = [^a-zA-Z0-9_]+
Any number of digits (not including spaces)
[\d]+ = [0-9]+
What Does the Fox Say? 12 狐狸怎叫 34 Any number of characters not including digits (including spaces)
[^\d]+ = [^0-9]+ = \D+
What Does the Fox Say? 12 狐狸怎叫 34
Any number of Chinese characters
[\p{Han}]+
What Does the Fox Say? 12 狐狸怎叫 34 Any number of characters not including Chinese
[^\p{Han}]+
Lines starting with “狐狸”
^狐狸.*$
狐狸怎叫 34 What Does the Fox Say?
柴犬怎叫 What Does the shiba inu say?
Lines not starting with “狐狸”
^(?!狐狸).*$
狐狸怎叫 34 What Does the Fox Say?
柴犬怎叫 What Does the shiba inu say?
Lines ending with “怎叫”
^.*怎叫$
What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫
Lines not ending with “怎叫”
.*(?<!怎叫)$
What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫
Lines containing “狐狸”
^.*狐狸.*$ or (狐狸)
What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫
Lines not containing “狐狸”
^((?!狐狸).)*$
What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫
叫.*狐狸 What Does the Fox Say? 12 狐狸怎叫 34
What Does the Fox Say? 12 不叫狐狸 34
What Does the shiba inu say? 柴犬怎叫
叫).* What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫

What Does the shiba inu say? 柴犬怎了
柴犬).)*$ What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫
What Does the Husky say? 哈士奇怎叫
Boolean logic NOT: Lines not containing “狐狸” but containing “柴犬”
^((?!狐狸).)*(柴犬).*$ = ^(柴犬).*((?!狐狸).)*$ = (柴犬).*((?!狐狸).)*
What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫


Regular Expression Online Tools[edit]

Websites for testing regular expression syntax:


Common Use Cases[edit]

Replace Newlines with Commas[edit]

Converting email lists into a format usable by email software:

Original:
[email protected]
[email protected]
[email protected]

Convert to:
[email protected],[email protected],[email protected]

Method 1: Sublime Text, EmEditor[edit]

  1. Menu: Search -> Replace
  2. Check “Use Regular Expression”
    • Find: \n (newline character)
    • Replace with: ,
  3. Click “Replace all”


Method 2: Notepad++[edit]

  1. Menu: Find -> Replace
  2. Search mode: Check “Extended mode” (not “Regular expression”)
    • Find: \n
    • Replace with: ,
  3. Click “Replace All”


Method 3: Microsoft Word[edit]

  1. Menu: Edit -> Replace
  2. Check extended mode
    • Find: ^p (paragraph mark)
    • Replace with: ,
  3. Click “Replace All”

Method 4: Sed command for Linux[edit]

sed ':a;N;$!ba;s/\n/; /g' old.filename > new.filename

Find IP Addresses (IPv4)[edit]

For Notepad++ v.5.9.5: - Find: \d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?

For Sublime Text v. 3.2.21: - Find: (?:\d{1,3}\.){3}\d{1,3}

Remove Black Squares (UNIX Line Endings LF)[edit]

Using Notepad++: 1. Menu: Find -> Replace 2. Search mode: Check “Extended mode” - Find: \n\n (2 LF characters) - Replace with: \r\n (CR and LF)

Add Quotes Around Elements[edit]

Add Quotes Around Array Elements[edit]

Before: Elmo, Emie, Granny Bird, Herry Monster, 喀喀獸
After: 'Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸'

Method 1: PHP

$users = array('Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸');
// Single quotes around each element
$result = implode(",", preg_replace('/^(.*?)$/', "'$1'", $users));
// Double quotes around each element
$result = implode(",", preg_replace('/^(.*?)$/', "\"$1\"", $users));
echo $result;

Method 2: Sublime Text or EmEditor - Find: ([^\s|,]+) - Replace with: '\1' (for single quotes) or "\1" (for double quotes)

Method 3: Notepad++ (Enable “Regular expression” search mode) - Find: ([^\s|,]+) - Replace with: '$1' (for single quotes) or "$1" (for double quotes)

Find Non-ASCII Characters (Chinese/Non-English Text)[edit]

In LibreOffice[edit]

[^\u0000-\u0080]+


Find Chinese Characters in Google Sheets[edit]

Example: If cell A2 contains any Chinese character, display “Chinese”, otherwise display “English”:

=IF(REGEXMATCH(A2, "[\一-\龥]"), "Chinese", "English")

Find Non-ASCII Characters in Google Sheets[edit]

Extract non-ASCII characters (such as Chinese, Japanese, emoji, etc.) from cell A2

=IF(ISERROR(REGEXEXTRACT(A2, "[^\x00-\x80]+")), "", REGEXEXTRACT(A2, "[^\x00-\x80]+"))

Explanation of regular expression [^\x00-\x80]+

  • [\x00-\x80]: Represents the ASCII character range (character codes 0-128). (1) Standard ASCII range: 0-127 (0x00-0x7F aka * [\x00-\x7F])[1] (2) Character 128 ((0x80) is actually the first character in the extended ASCII range, not part of the original ASCII standard.[2][3]
  • [^...]: Means "not" these characters
  • +: Means one or more

Overall meaning: Matches one or more non-ASCII characters

Find Chinese Characters in MySQL[edit]

Find rows where column_name contains Chinese characters:

SELECT `column_name`
FROM `table_name`
WHERE HEX(`column_name`) REGEXP '^(..)*(E[4-9])';

Query condition used to match records where the column_name field contains only Chinese characters.

SELECT `column_name`
FROM `table_name`
WHERE `column_name` REGEXP '^[一-龯]+$';

Explanation:

  • [一-龯] - Character set that matches all characters from "一" to "龯" in Unicode
  • "一" has Unicode code point U+4E00[4]
  • "龯" has Unicode code point U+9FEF[5]
  • This range U+4E00-U+9FFF already covers over 99% of daily Chinese usage requirements Extension B and later blocks mainly contain ancient Chinese characters, variant characters, etc., which rarely appear in modern texts

Find Non-ASCII Characters in MySQL[edit]

Find rows where column_name is not entirely ASCII characters:

SELECT `column_name`
FROM `table_name`
WHERE `column_name` <> CONVERT(`column_name` USING ASCII)

Find Chinese Characters in PHP[edit]

Exact match:

// Approach 1
if (preg_match('/^[\x{4e00}-\x{9fa5}]+$/u', $string)) {
    echo "All text is Chinese characters" . PHP_EOL;
} else {
    echo "Some text is not Chinese characters" . PHP_EOL;
}

// Approach 2
if (preg_match('/^[\p{Han}]+$/u', $string)) {
    echo "All text is Chinese characters" . PHP_EOL;
} else {
    echo "Some text is not Chinese characters" . PHP_EOL;
}

Partial match:

// Approach 1
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\p{Han}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);

// Approach 2
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\x{4e00}-\x{9fa5}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);

Find ASCII Characters in PHP[edit]

Code I:

if (preg_match('/[^\x20-\x7f]/', $keyword) === 0) {
    echo "The keyword is ASCII only";
} else {
    echo "The keyword contains non-ASCII characters (like Chinese, Japanese, etc.)";
}

Code II:

$pattern = '/^[[:ascii:]]+$/i';
$text = "Hello World"; // ASCII only
if (preg_match($pattern, $text)) {
    echo "Pure ASCII characters";
} else {
    echo "Contains non-ASCII characters";
}

Remove Empty Lines[edit]

Original:

Neo
Trinity

Morpheus


Smith
Oracle

After:

Neo
Trinity
Morpheus
Smith
Oracle

Using Sublime Text & EmEditor: - Find: ^[\s\t]*$\n - Replace with: (empty)

Using Notepad++ v7.8.7: - Menu: Edit -> Line Operations -> Remove Empty Lines (Including Blank Lines)

Find Non-Whitespace Text[edit]

  • Find: [^\s]+

Convert Symbol-Separated Text to Line-by-Line Display[edit]

Example:

Before: 尼歐、莫斐斯、崔妮蒂、史密斯、祭師
After:
尼歐
莫斐斯
崔妮蒂
史密斯
祭師

Using Sublime Text or EmEditor: - Find: ([^、]+)([、]{1}) - Replace with: \1\n

Replace Multiple Spaces with Tab Characters[edit]

Before: aaa bbb ccc After: aaa\tbbb\tccc

Using Sublime Text: - Find: ([^\S\n]+) or ([^\S\r\n]+) or \s\s+ - Replace with: \t


Remove Leading/Trailing Whitespace[edit]

Remove Leading Whitespace[edit]

  • Find: ^\s+
  • Replace with: (empty)


Remove Trailing Whitespace[edit]

  • Find: \s+$
  • Replace with: (empty)


Remove Both Leading and Trailing Whitespace[edit]

  • Find: (^\s+|\s+$)
  • Replace with: (empty)


Text Editors Supporting Regular Expressions[edit]

Various text editors support regular expressions including: - Sublime Text - EmEditor - Notepad++ - Visual Studio Code - Atom - Vim/Neovim


Syntax Reference[edit]

  • Newline character: \r\n (for Notepad++: Extended mode & Regular expression mode)
  • Tab character: \t (for Notepad++: Extended mode)
  • Digits: \d (for Notepad++: Regular expression mode only)
  • Non-whitespace: \S - Does not include half-width spaces and full-width spaces

Troubleshooting Regular Expressions[edit]

Tips: 1. Use online tools like regex101 to understand your syntax 2. Test with small data: Prepare small file data to verify syntax 3. Highlight or output matched text for debugging 4. Simplify the syntax when encountering issues 5. Try alternative syntax due to compatibility issues (e.g., \d to [0-9]+)


Alternative Solutions[edit]

  • Use Tab-separated data that can be easily pasted into Google Sheets or MS Excel
  • Copy multiple rows and paste between different applications (compatibility varies)

Further Reading[edit]

  • Regular-Expressions.info - Regex Tutorial, Examples and Reference
  • Unicode character properties documentation
  • Platform-specific regular expression documentation

Data factory flow

[[Category: Revised with LLMs]