Regular expression: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
(+將行內空白,改成 Tab 鍵)
mNo edit summary
 
(213 intermediate revisions by the same user not shown)
Line 1: Line 1:
正規表示法 (Regular Expression):處理文字檔每行的字串,搜尋或取代符合特定規則的字串。<ref>[http://linux.vbird.org/linux_basic/0330regularex.php 鳥哥的 Linux 私房菜 -- 正規表示法 (regular expression, RE) 與文件格式化處理]</ref>
When processing text files through regular expressions, you can quickly search for or replace strings that match specific rules. Processing is done on a line-by-line basis for string manipulation. Regular expressions are also known as regex, regexp, or pattern matching expressions.
{{Raise hand | text = 有問題嗎? [http://www.ptt.cc/bbs/RegExp/index.html 看板 RegExp 文章列表 - 批踢踢實業坊]或其他[[問答服務]] }}


{{LanguageSwitcher | content = [[Regular expression | English]], [[Regular expression in Mandarin|漢字]]}}


== 快速查表 ==
{{Raise hand | text = '''Need Help?''' You can use the provided explanatory [[#regular-expression-online-tools|online tools]] to try debugging yourself. }}
[https://regex101.com/r/zH7wV3/1 online demo]


說明: sample 藍色網底處代表符合規則的文字
<table border="1" style="width:100%">
<tr >
<th style="background-color: #E0E0E0;"> 文字規則 </th>
<th style="background-color: #E0E0E0; width:260px;"> sample </th>
<th style="background-color: #9c9ca3;"> 對立的文字規則 </th>
<th style="background-color: #9c9ca3; width:260px;"> sample</th>
</tr>
<tr>
<td> 任意一個文字(包含空白,但不包含換行符號) <br /> {{kbd | key = <nowiki>.</nowiki>}} </td>
<td><span style="background:#C6E3FF">W</span>hat Does the Fox Say? 12 狐狸怎叫 34</td>
<td></td>
<td></td>
</tr>
<tr>
<td> 任意文字(包含空白),出現1次或0次 <br /> {{kbd | key = <nowiki>.?</nowiki>}} = {{kbd | key = <nowiki>.{0,1}</nowiki>}}</td>
<td><span style="background:#C6E3FF">W</span>hat Does the Fox Say? 12 狐狸怎叫 34</td>
<td></td>
<td></td>
</tr>
<tr>
<td> 任意次的多個文字(包含空白) <br /> {{kbd | key = <nowiki>.*</nowiki>}} ={{kbd | key = <nowiki> .{0,}</nowiki>}}</td>
<td><span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span></td>
<td></td>
<td></td>
</tr>
<tr>
<td> 任意次的文字(包含空白),至少出現1次 <br /> {{kbd | key = <nowiki>.+</nowiki>}} = {{kbd | key = <nowiki>.{1,}</nowiki>}}</td>
<td><span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span></td>
<td></td>
<td></td>
</tr>
<tr>
<td> 任意次的空白 <br /> {{kbd | key = <nowiki>\s+</nowiki>}} </td>
<td>What<span style="background:#C6E3FF"> </span>Does the Fox Say? 12 狐狸怎叫 34</td>
<td>任意多個文字(不包含空白) <br /> {{kbd | key = <nowiki>[^\s]+</nowiki>}} ={{kbd | key = <nowiki> [^\s]{1,}</nowiki>}}</td>
<td><span style="background:#C6E3FF">What</span> Does the Fox Say? 12 狐狸怎叫 34</td>
</tr>
<tr>
<td> 任意次的 ASCII character(包含英文、數字和空白) [http://regexr.com/3aom2 demo]<ref>[http://www.asciitable.com/ understand]</ref> <br /> {{kbd | key = <nowiki>[\x00-\x80]+</nowiki>}}</td>
<td><span style="background:#C6E3FF">What Does the Fox Say? 12</span> 狐狸怎叫 34</td>
<td>非 ASCII,即中文出現任意次<br /> {{kbd | key = <nowiki>[^\x00-\x80]+</nowiki>}}</td>
<td>What Does the Fox Say? 12 <span style="background:#C6E3FF">狐狸怎叫</span> 34</td>
</tr>
<tr>
<td> 任意次的英文、數字和底線符號( _ )文字(不包含空白) <br /> {{kbd | key = <nowiki>[\w]+</nowiki>}} = {{kbd | key = <nowiki>[a-zA-Z0-9_]+</nowiki>}}</td>
<td><span style="background:#C6E3FF">What</span> Does the Fox Say? 12 狐狸怎叫 34</td>
<td></td>
<td></td>
</tr>
<tr>
<td> 任意次的數字(不包含空白) <br /> {{kbd | key = <nowiki>[\d]+</nowiki>}} = {{kbd | key = <nowiki>[0-9]+</nowiki>}}</td>
<td>What Does the Fox Say? <span style="background:#C6E3FF">12</span> 狐狸怎叫 34</td>
<td>不包含數字的任意次文字  <br /> {{kbd | key = <nowiki>[^\d]+</nowiki>}} </td>
<td><span style="background:#C6E3FF">What Does the Fox Say? </span>12 狐狸怎叫 34</td>
</tr>
<tr>
<td> 以「狐狸」開頭的行 <br /> {{kbd | key = <nowiki>^狐狸.*$</nowiki>}}<ref>[http://www.regular-expressions.info/completelines.html Regex Examples: Matching Whole Lines of Text That Satisfy Certain Requirements]</ref></td>
<td>
<span style="background:#C6E3FF">狐狸怎叫 34 What Does the Fox Say?</span><br />
柴犬怎叫 What Does the shiba inu say?
</td>
<td>不以「狐狸」開頭的行  <br /> {{kbd | key = <nowiki>^(?!狐狸).*$</nowiki>}}<ref>[http://stackoverflow.com/questions/406230/regular-expression-to-match-text-that-doesnt-contain-a-word regex - Regular expression to match text that *doesn't* contain a word? - Stack Overflow]</ref> </td>
<td>
狐狸怎叫 34 What Does the Fox Say?<br />
<span style="background:#C6E3FF">柴犬怎叫 What Does the shiba inu say?</span>
</td>
</tr>
<tr>
<td> 包含「狐狸」的行 <br /> {{kbd | key = <nowiki>^.*狐狸.*$</nowiki>}}</td>
<td>
<span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br />
What Does the shiba inu say? 柴犬怎叫
</td>
<td>不包「狐狸」的行  <br /> {{kbd | key = <nowiki>^((?!狐狸).)*$</nowiki>}} </td>
<td>
What Does the Fox Say? 12 狐狸怎叫 34<br />
<span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫 </span>
</td>
</tr>
<tr>
<td> 布林邏輯 AND: 包含「狐狸」和「叫」的行 ([http://regexr.com/3aokl demo])<ref>[http://stackoverflow.com/questions/469913/regular-expressions-is-there-an-and-operator regex - Regular Expressions: Is there an AND operator? - Stack Overflow]</ref><br /> {{kbd | key = <nowiki>(?=.*狐狸)(?=.*叫).*</nowiki>}}</td>
<td>
<span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br />
What Does the shiba inu say? 柴犬怎叫
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> 布林邏輯 OR: 包含「狐狸」或「叫」的行 ([http://regexr.com/3aoko demo])<br /> {{kbd | key = <nowiki>.*(狐狸|叫).*</nowiki>}}</td>
<td>
<span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34<br />
What Does the shiba inu say? 柴犬怎叫</span>
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> 布林邏輯 NOT: 不包含「狐狸」,但包含「柴犬」的行 ([http://regexr.com/3aokr demo])<ref>[http://stackoverflow.com/questions/2953039/regular-expression-for-a-string-containing-one-word-but-not-another regex - Regular expression for a string containing one word but not another - Stack Overflow]</ref><br /> {{kbd | key = <nowiki>^((?!狐狸).)*(柴犬).*$</nowiki>}} = {{kbd | key = <nowiki>^(柴犬).*((?!狐狸).)*$</nowiki>}} = {{kbd | key = <nowiki>(柴犬).*((?!狐狸).)*</nowiki>}}</td>
<td>
What Does the Fox Say? 12 狐狸怎叫 34<br />
<span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span>
</td>
<td></td>
<td></td>
</tr>
</table>


== Regular expression online tools ==
== Quick Reference Table ==
* [http://regex101.com/ Online regex tester and debugger: JavaScript, Python, PHP, and PCRE] ([http://regex101.com/r/tH1eT7/1 example]) {{Gd}} 有提供語法解說
* [http://www.rubular.com/ Rubular]: a Ruby regular expression editor and tester ([http://www.rubular.com/r/UZuUT5pjeh example])
* [http://gskinner.com/RegExr/ RegExr]: Learn, Build, & Test RegEx ([http://regexr.com/395t0 example]). 有提供語法解說. 教學: [http://blog.hsdn.net/1426.html RegExr: 功能強大的正規式撰寫協助工具]
* [http://www.phpliveregex.com/ PHP Live Regex] {{access | date=2014-11-25}}
* [http://www.gethifi.com/tools/regex HiFi Regex Tester - Live JavaScript Regular Expression Tester] for Javascript {{access | date=2014-12-23}}


== case ==
Note: (1) Blue highlighted areas in samples represent text matching the rules, (2) The same text rule can have multiple representations
=== 將Email清單,轉成Email軟體可以使用的寄信名單 (取代換行符號) ===
 
<pre>
{| class="wikitable"
|-
! Text Rule
! Sample
! Opposite Text Rule
! Sample
|-
| Any single character (including spaces, but not newline) <br> <code>.</code>
| <span style="background:#C6E3FF">W</span>hat Does the Fox Say? 12 狐狸怎叫 34
|
|
|-
| Any character (including spaces), appears 1 or 0 times <br> <code>.?</code> = <code>.{0,1}</code>
| <span style="background:#C6E3FF">W</span>hat Does the Fox Say? 12 狐狸怎叫 34
|
|
|-
| Any number of multiple characters (including spaces) <br> <code>.*</code> = <code>.{0,}</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span>
|
|
|-
| Any number of characters (including spaces), at least 1 occurrence <br> <code>.+</code> = <code>.{1,}</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span>
|
|
|-
| Any number of spaces or newlines (at least 1 occurrence) <br> <code>\s+</code>
| What<span style="background:#C6E3FF"> </span>Does the Fox Say? 12 狐狸怎叫 34
| Any number of characters (not including spaces or newlines) <br> <code>[^\s]+</code> = <code>[^\s]{1,}</code> = <code>[\S]+</code> = <code>[^ ]+</code>
| <span style="background:#C6E3FF">What</span> Does the Fox Say? 12 狐狸怎叫 34
|-
| Any number of ASCII characters (including English, numbers and spaces) <br> <code>[\x00-\x80]+</code> or <code>[[:ascii:]]+</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12</span> 狐狸怎叫 34
| Non-ASCII, i.e., Chinese characters appearing any number of times <br> <code>[^\x00-\x80]+</code>
| What Does the Fox Say? 12 <span style="background:#C6E3FF">狐狸怎叫</span> 34
|-
| Any number of uppercase/lowercase English letters, numbers and underscore (_) (not including spaces) <br> <code>[\w]+</code> = <code>[a-zA-Z0-9_]+</code> <br> PHP with <code>u</code> modifier supports Chinese characters
| <span style="background:#C6E3FF">What</span> <span style="background:#C6E3FF">Does</span> <span style="background:#C6E3FF">the</span> <span style="background:#C6E3FF">Fox</span> <span style="background:#C6E3FF">Say</span>? <span style="background:#C6E3FF">12</span> 狐狸怎叫 <span style="background:#C6E3FF">_34</span>
| Any number of characters that are not English letters, numbers and underscore (_) <br> <code>\W+</code> = <code>[^a-zA-Z0-9_]+</code>
|
|-
| Any number of digits (not including spaces) <br> <code>[\d]+</code> = <code>[0-9]+</code>
| What Does the Fox Say? <span style="background:#C6E3FF">12</span> 狐狸怎叫 34
| Any number of characters not including digits (including spaces) <br> <code>[^\d]+</code> = <code>[^0-9]+</code> = <code>\D+</code>
| <span style="background:#C6E3FF">What Does the Fox Say? </span>12 狐狸怎叫 34
|-
| Any number of Chinese characters <br> <code>[\p{Han}]+</code>
| What Does the Fox Say? 12 <span style="background:#C6E3FF">狐狸怎叫</span> 34
| Any number of characters not including Chinese <br> <code>[^\p{Han}]+</code>
|
|-
| Lines starting with “狐狸” <br> <code>^狐狸.*$</code>
| <span style="background:#C6E3FF">狐狸怎叫 34 What Does the Fox Say?</span><br>柴犬怎叫 What Does the shiba inu say?
| Lines not starting with “狐狸” <br> <code>^(?!狐狸).*$</code>
| 狐狸怎叫 34 What Does the Fox Say?<br><span style="background:#C6E3FF">柴犬怎叫 What Does the shiba inu say?</span>
|-
| Lines ending with “怎叫” <br> <code>^.*怎叫$</code>
| What Does the Fox Say? 12 狐狸怎叫 34<br><span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span>
| Lines not ending with “怎叫” <br> <code>.*(?&lt;!怎叫)$</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br>What Does the shiba inu say? 柴犬怎叫
|-
| Lines containing “狐狸” <br> <code>^.*狐狸.*$</code> or <code>(狐狸)</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br>What Does the shiba inu say? 柴犬怎叫
| Lines not containing “狐狸” <br> <code>^((?!狐狸).)*$</code>
| What Does the Fox Say? 12 狐狸怎叫 34<br><span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span>
|-
| Boolean logic AND: Lines containing both “狐狸” and “叫” <br> <code>(?=.*狐狸)(?=.*叫).*</code> or <code>狐狸.*叫\|叫.*狐狸</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34</span><br><span style="background:#C6E3FF">What Does the Fox Say? 12 不叫狐狸 34</span><br>What Does the shiba inu say? 柴犬怎叫
|
|
|-
| Boolean logic OR: Lines containing “狐狸” or “叫” <br> <code>.*(狐狸\|叫).*</code>
| <span style="background:#C6E3FF">What Does the Fox Say? 12 狐狸怎叫 34<br>What Does the shiba inu say? 柴犬怎叫</span><br>What Does the shiba inu say? 柴犬怎了
| Boolean logic: Lines not containing “狐狸” and not containing “柴犬” <br> <code>^((?!狐狸\|柴犬).)*$</code>
| What Does the Fox Say? 12 狐狸怎叫 34<br>What Does the shiba inu say? 柴犬怎叫<br><span style="background:#C6E3FF">What Does the Husky say? 哈士奇怎叫</span>
|-
| Boolean logic NOT: Lines not containing “狐狸” but containing “柴犬” <br> <code>^((?!狐狸).)*(柴犬).*$</code> = <code>^(柴犬).*((?!狐狸).)*$</code> = <code>(柴犬).*((?!狐狸).)*</code>
| What Does the Fox Say? 12 狐狸怎叫 34<br><span style="background:#C6E3FF">What Does the shiba inu say? 柴犬怎叫</span>
|
|
|}
 
 
== Regular Expression Online Tools ==
 
Websites for testing regular expression syntax:
* {{Gd}} [http://regex101.com/ RegEx101] - “Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript” - Provides syntax explanations
* {{Gd}} [http://gskinner.com/RegExr/ RegExr] - Learn, Build, &amp; Test RegEx - Provides syntax explanations
* [https://regexper.com/ Regexper] - Visual explanation of syntax using diagrams
* [https://jex.im/regulex/ Regulex:JavaScript Regular Expression Visualizer] - JavaScript Regular Expression Visualizer - Visual explanation using diagrams
* [http://www.rubular.com/ Rubular] - A Ruby regular expression editor and tester
* [http://www.phpliveregex.com/ PHP Live Regex]
* [http://www.regextester.com/ Regex Tester and Debugger Online] - JavaScript, PCRE, PHP
 
 
 
== Common Use Cases ==
 
 
=== Replace Newlines with Commas ===
 
Converting email lists into a format usable by email software:
 
<pre>Original:


改成
Convert to:
</pre>
 
==== Method 1: Sublime Text, EmEditor ====
 
# Menu: Search -&gt; Replace
# Check “Use Regular Expression”
#* Find: <code>\n</code> (newline character)
#* Replace with: <code>,</code>
# Click “Replace all”
 


==== 方案1: Sublime Text, EmEditor ====
==== Method 2: Notepad++ ====
語法適用 [http://www.sublimetext.com/ Sublime Text], [http://www.emeditor.com/ EmEditor]軟體 (以下為 EmEditor 的操作說明)
# Menu: Search -> Replace
# click "Use Regular Expression"
## Find: \n
## Replace with: ,
# click "Replace all"


==== 方案2: Notepad++ ====
# Menu: Find -&gt; Replace
使用[http://notepad-plus-plus.org/ Notepad++]軟體
# Search mode: Check “Extended mode” (not “Regular expression”)
# 選單: 尋找 -> 取代
#* Find: <code>\n</code>
# 搜尋模式: 勾選「增強模式」 (不是勾選「用類型表式」)
#* Replace with: <code>,</code>
## 尋找目標: \r\n
# Click “Replace All”
## 取代成: ,
# 勾選全部取代


相關資料: [http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Replacing_Newlines How To Replace Line Ends, thus changing the line layout] last visited: 2010-01-27


==== 方案3: Microsoft Word ====
==== Method 3: Microsoft Word ====
使用Microsoft Word 2002軟體
# 選單: 編輯 -> 取代
# 勾選增強模式
## 尋找目標: ^p (段落標記)
## 取代為: ,
# 勾選全部取代


==== 方案4: Sed command for linux ====
# Menu: Edit -&gt; Replace
# Check extended mode
#* Find: <code>^p</code> (paragraph mark)
#* Replace with: <code>,</code>
# Click “Replace All”


{{kbd | key=<nowiki>sed 's/要被取代的字串/新的字串/g' old.filename > new.filename</nowiki>}}<ref>[http://linux.vbird.org/linux_basic/0330regularex.php#sed_replace 鳥哥的 Linux 私房菜 -- 正規表示法 (regular expression, RE) 與文件格式化處理]</ref>
==== Method 4: Sed command for Linux ====


(1)要被取代的字串: :a;N;$!ba;s/\n  
<syntaxhighlight lang="bash">sed ':a;N;$!ba;s/\n/; /g' old.filename > new.filename</syntaxhighlight>
(2)新的字串: ;  


{{kbd | key=<nowiki>sed ':a;N;$!ba;s/\n/; /g' old.filename > new.filename</nowiki>}} <ref>參考 [http://stackoverflow.com/questions/1251999/sed-how-can-i-replace-a-newline-n unix - sed: How can I replace a newline? ]</ref>
=== Find IP Addresses (IPv4) ===


=== Find IP address ===
For Notepad++ v.5.9.5: - Find: <code>\d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?</code>
使用[http://notepad-plus-plus.org/ Notepad++]軟體 v.5.9.5
# 選單: 尋找 -> 取代
# 搜尋模式: 勾選「用類型表式」
## 尋找目標: \d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?


note: not support {n} syntax
For Sublime Text v. 3.2.21: - Find: <code>(?:\d{1,3}\.){3}\d{1,3}</code>


參考資料: [http://sourceforge.net/projects/notepad-plus/forums/forum/331754/topic/4780602 SourceForge.net: Notepad++: Regular expression for IP addresses]
=== Remove Black Squares (UNIX Line Endings LF) ===


Using Notepad++: 1. Menu: Find -&gt; Replace 2. Search mode: Check “Extended mode” - Find: <code>\n\n</code> (2 LF characters) - Replace with: <code>\r\n</code> (CR and LF)


=== 移除記事本純文字檔的黑色方塊(UNIX系統的換行符號 LF ) ===
=== Add Quotes Around Elements ===
使用notepad++軟體
# 選單: 尋找 -> 取代
# 搜尋模式: 勾選「增強模式」
## 尋找目標: \n\n  (註: 2個LF )
## 取代成: \r\n  (註: CR與LF )


用記事本打開純文字檔時,就不會看到黑色方塊
==== Add Quotes Around Array Elements ====


<pre>Before: Elmo, Emie, Granny Bird, Herry Monster, 喀喀獸
After: 'Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸'</pre>
'''Method 1: PHP'''


=== 將陣列的每項元素,都加上引號框起來 ===
<syntaxhighlight lang="php">$users = array('Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸');
<pre>
// Single quotes around each element
Elmo, Emie, Granny Bird, Herry Monster, 喀喀獸
修改成
'Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸'
</pre>
方法1: 使用 PHP
<pre>
$users = array('Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸');
//「單引號」相隔每個元素
$result = implode(",", preg_replace('/^(.*?)$/', "'$1'", $users));
$result = implode(",", preg_replace('/^(.*?)$/', "'$1'", $users));
// Double quotes around each element
$result = implode(",", preg_replace('/^(.*?)$/', "\"$1\"", $users));
echo $result;</syntaxhighlight>
'''Method 2: Sublime Text or EmEditor''' - Find: <code>([^\s|,]+)</code> - Replace with: <code>'\1'</code> (for single quotes) or <code>&quot;\1&quot;</code> (for double quotes)


//「雙引號」相隔每個元素
'''Method 3: Notepad++''' (Enable “Regular expression” search mode) - Find: <code>([^\s|,]+)</code> - Replace with: <code>'$1'</code> (for single quotes) or <code>&quot;$1&quot;</code> (for double quotes)
//$result = implode(",", preg_replace('/^(.*?)$/', "\"$1\"", $users));
 
echo $result;
=== Find Non-ASCII Characters (Chinese/Non-English Text) ===
</pre>
 
 
==== In LibreOffice ====
 
<pre>[^\u0000-\u0080]+</pre>
 
 
==== Find Chinese Characters in Google Sheets ====


Thanks, Joshua! More on [http://melikedev.com/2010/02/24/php-wrap-implode-array-elements-in-quotes/ PHP - Wrap Implode Array Elements in Quotes » Me Like Dev]
Example: If cell {{kbd | key=A2}} contains any Chinese character, display “Chinese”, otherwise display “English”:


方法2: 使用 sublime
<pre>=IF(REGEXMATCH(A2, &quot;[\一-\龥]&quot;), &quot;Chinese&quot;, &quot;English&quot;)</pre>
* Find: {{kbd | key = <nowiki>([^\s|,]+)</nowiki>}}
* 分隔符號
**「單引號」相隔每個元素 Replace with: {{kbd | key = <nowiki>'\1'</nowiki>}}
**「雙引號」相隔每個元素 Replace with: {{kbd | key = <nowiki>"\1"</nowiki>}}


=== 取代非英文的文字 ===
==== Find Non-ASCII Characters in Google Sheets ====
適用: Google Drive 的 RegExReplace 函數、Notepad++的搜尋
Extract non-ASCII characters (such as Chinese, Japanese, emoji, etc.) from cell {{kbd | key=A2}}
<pre>
<pre>
[^\x00-\x80]+
=IF(ISERROR(REGEXEXTRACT(A2, "[^\x00-\x80]+")), "", REGEXEXTRACT(A2, "[^\x00-\x80]+"))
</pre>
</pre>


適用: Total commander 的 Multi-Rename tool<ref>取代非英文的文字,但是不包含 . 符號: <nowiki>[^\u0000-\u0080|.]+ </nowiki></ref>
Explanation of regular expression {{kbd | key=<nowiki>[^\x00-\x80]+</nowiki>}}
<pre>
 
[^\u0000-\u0080]+
* {{kbd | key=<nowiki>[\x00-\x80]</nowiki>}}: Represents the ASCII character range (character codes 0-128). (1) Standard ASCII range: 0-127 ({{kbd | key=<nowiki>0x00-0x7F</nowiki>}} aka * {{kbd | key=<nowiki>[\x00-\x7F]</nowiki>}})<ref>[https://www.commfront.com/pages/ascii-chart ASCII Chart – CommFront]</ref> (2) Character 128 (({{kbd | key=<nowiki>0x80</nowiki>}}) is actually the first character in the extended ASCII range, not part of the original ASCII standard.<ref>[https://en.wikipedia.org/wiki/UTF-8 UTF-8 - Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/Control_character Control character - Wikipedia]</ref>
</pre>
* {{kbd | key=<nowiki>[^...]</nowiki>}}: Means "not" these characters
* {{kbd | key=<nowiki>+</nowiki>}}: Means one or more
 
Overall meaning: Matches one or more non-ASCII characters
 
==== Find Chinese Characters in MySQL ====
 
Find rows where <code>column_name</code> contains Chinese characters:
 
<pre lang="sql">SELECT `column_name`
FROM `table_name`
WHERE HEX(`column_name`) REGEXP '^(..)*(E[4-9])';</pre>
 
Query condition used to match records where the <code>column_name</code> field contains only Chinese characters.
<pre lang="sql">SELECT `column_name`
FROM `table_name`
WHERE `column_name` REGEXP '^[一-龯]+$';</pre>
 
Explanation:
* {{kbd | key=<nowiki>[一-龯]</nowiki>}} - Character set that matches all characters from "一" to "龯" in Unicode
* "一" has Unicode code point {{kbd | key=<nowiki>U+4E00</nowiki>}}<ref>[https://www.compart.com/en/unicode/U+4E00 “一” U+4E00 CJK Unified Ideograph-4E00 Unicode Character]</ref>
* "龯" has Unicode code point {{kbd | key=<nowiki>U+9FEF</nowiki>}}<ref>[https://www.compart.com/en/unicode/U+9FAF “龯” U+9FAF CJK Unified Ideograph-9FAF Unicode Character]</ref>
* This range U+4E00-U+9FFF already covers over 99% of daily Chinese usage requirements [https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_B Extension B] and later blocks mainly contain ancient Chinese characters, variant characters, etc., which rarely appear in modern texts
 
==== Find Non-ASCII Characters in MySQL ====
 
Find rows where <code>column_name</code> is not entirely ASCII characters:
 
<syntaxhighlight lang="sql">SELECT `column_name`
FROM `table_name`
WHERE `column_name` <> CONVERT(`column_name` USING ASCII)</syntaxhighlight>
 
==== Find Chinese Characters in PHP ====
 
'''Exact match:'''
 
<syntaxhighlight lang="php">// Approach 1
if (preg_match('/^[\x{4e00}-\x{9fa5}]+$/u', $string)) {
    echo "All text is Chinese characters" . PHP_EOL;
} else {
    echo "Some text is not Chinese characters" . PHP_EOL;
}
 
// Approach 2
if (preg_match('/^[\p{Han}]+$/u', $string)) {
    echo "All text is Chinese characters" . PHP_EOL;
} else {
    echo "Some text is not Chinese characters" . PHP_EOL;
}</syntaxhighlight>
'''Partial match:'''
 
<syntaxhighlight lang="php">// Approach 1
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\p{Han}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);
 
// Approach 2
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\x{4e00}-\x{9fa5}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);</syntaxhighlight>
 
=== Find ASCII Characters in PHP ===
 
'''Code I:'''
 
<syntaxhighlight lang="php">if (preg_match('/[^\x20-\x7f]/', $keyword) === 0) {
    echo "The keyword is ASCII only";
} else {
    echo "The keyword contains non-ASCII characters (like Chinese, Japanese, etc.)";
}</syntaxhighlight>
'''Code II:'''
 
<syntaxhighlight lang="php">$pattern = '/^[[:ascii:]]+$/i';
$text = "Hello World"; // ASCII only
if (preg_match($pattern, $text)) {
    echo "Pure ASCII characters";
} else {
    echo "Contains non-ASCII characters";
}</syntaxhighlight>
 
=== Remove Empty Lines ===
 
'''Original:'''
 
<pre>Neo
Trinity
 
Morpheus
 
 
Smith
Oracle</pre>
'''After:'''
 
<pre>Neo
Trinity
Morpheus
Smith
Oracle</pre>
'''Using Sublime Text &amp; EmEditor:''' - Find: <code>^[\s\t]*$\n</code> - Replace with: (empty)
 
'''Using Notepad++ v7.8.7:''' - Menu: Edit -&gt; Line Operations -&gt; Remove Empty Lines (Including Blank Lines)
 
=== Find Non-Whitespace Text ===
 
* Find: <code>[^\s]+</code>
 
=== Convert Symbol-Separated Text to Line-by-Line Display ===
 
'''Example:'''
 
<pre>Before: 尼歐、莫斐斯、崔妮蒂、史密斯、祭師
After:
尼歐
莫斐斯
崔妮蒂
史密斯
祭師</pre>
'''Using Sublime Text or EmEditor:''' - Find: <code>([^、]+)([、]{1})</code> - Replace with: <code>\1\n</code>
 
=== Replace Multiple Spaces with Tab Characters ===
 
'''Before:''' <code>aaa bbb    ccc</code> '''After:''' <code>aaa\tbbb\tccc</code>
 
'''Using Sublime Text:''' - Find: <code>([^\S\n]+)</code> or <code>([^\S\r\n]+)</code> or <code>\s\s+</code> - Replace with: <code>\t</code>
 
 
=== Remove Leading/Trailing Whitespace ===
 


參考資料: [http://stackoverflow.com/questions/150033/regular-expression-to-match-non-english-characters javascript - Regular expression to match non-english characters? - Stack Overflow]
==== Remove Leading Whitespace ====


=== 將每行文字的行頭加上逗號符號 ===
* Find: <code>^\s+</code>
使用notepad++軟體
* Replace with: (empty)
# 選單: 尋找 -> 取代
# 搜尋模式: 勾選「用類型表示」
## 尋找目標: {{kbd | key=(.*)}} 或者是 {{kbd | key=^(.*)$}}
## 取代成: {{kbd | key=,\1}} 或者是 {{kbd | key=,$1}}。


參考資料: [http://stackoverflow.com/questions/8413237/notepad-regex-search-replace-how-to-append-and-prepend-a-character-at-start-a Notepad++ RegEx Search/Replace: How to append and prepend a character at start and end of each file line? - Stack Overflow]


=== 將行內一個或多個空白,取代成 Tab鍵 ===
==== Remove Trailing Whitespace ====
將原本空白間隔的不同欄位值,取代成 Tab鍵。輸出結果可以方便貼到 MS Excel 或 Google spreadsheet。
<pre>
apple 多個空白 orange 多個空白 banana
->
apple Tab鍵 orange Tab鍵 banana
</pre>


# Find: {{code | code = <nowiki>([^\S\r\n]+)</nowiki>}}
* Find: <code>\s+$</code>
# Replace with: {{code | code = <nowiki>\t</nowiki>}}
* Replace with: (empty)


=== 知道前面跟後面的文字,但是中間文字忘記了 ===
使用notepad++軟體
# 選單: 尋找 -> 取代
# 搜尋模式: 勾選「用類型表示」
## 尋找目標: {{kbd | key=a(.*)le}} 就可以找到(1)apple (2)apps lesson ... 等a開頭、le結尾的文字,中間可夾雜空白。 {{exclaim}} 中文字串搜尋,建議將文件的編碼改成 UTF-8 編碼


==== Remove Both Leading and Trailing Whitespace ====


=== 移除空白行 ===
* Find: <code>(^\s+|\s+$)</code>
移除一行空白或多行空白(含空白字元)
* Replace with: (empty)
* 尋找: {{kbd | key=<nowiki>^[\s\t]*$\n</nowiki>}} --> 取代為: 空白 (適用 Sublime Text 與 EmEditor 軟體, {{exclaim}} 不適用 Notepad++ 軟體)<ref>[http://www.sitepoint.com/forums/showthread.php?448843-Regex-delete-multiple-blank-lines Regex: delete multiple blank lines]</ref>
* Notepad++ 軟體選單: 編輯 -> 行列 -> 移除空行(含空白字元)<ref>[http://stackoverflow.com/questions/3866034/removing-empty-lines-in-notepad regex - Removing empty lines in Notepad++ - Stack Overflow]</ref>


移除一行空白或多行空白
* 尋找: {{kbd | key=<nowiki>^$\n</nowiki>}} --> 取代為: 空白 (適用 Sublime Text 與 EmEditor 軟體, {{exclaim}} 不適用 Notepad++ 軟體)
* 尋找: {{kbd | key=<nowiki>\r\n[\r\n]*</nowiki>}} 或 {{kbd | key=<nowiki>\r\n[\r\n]+</nowiki>}} --> 取代為: {{kbd | key=<nowiki>\r\n</nowiki>}} (適用 Notepad++ 軟體,需勾選「用類型表式)
* 尋找: {{kbd | key=<nowiki>\n(\n)+</nowiki>}} --> 取代為: {{kbd | key=<nowiki>\n</nowiki>}}(適用 Sublime Text 軟體,需 勾選「regular expression」)


移除一行空白
== Text Editors Supporting Regular Expressions ==
* 尋找: {{kbd | key=<nowiki>\n\n</nowiki>}} --> 取代為: {{kbd | key=<nowiki>\n</nowiki>}} (適用  Sublime Text 與 EmEditor 軟體,需勾選「使用規則運算式」)


=== 尋找非空白的文字 ===
Various text editors support regular expressions including: - Sublime Text - EmEditor - Notepad++ - Visual Studio Code - Atom - Vim/Neovim
* 尋找: {{kbd | key=<nowiki>[^\s]+</nowiki>}} [https://regex101.com/r/zH7wV3/1 online demo]


== Search unmatched string ==
=== case: find un-commented console.log ===
original format: some lines contains un-commented [[Javascript debug]] information
<pre>
  console.log("un-commented debug information");


  //console.log("commented debug information");
== Syntax Reference ==
</pre>


Search pattern: find not started with the / symbol before the string "console.log"
* Newline character: <code>\r\n</code> (for Notepad++: Extended mode &amp; Regular expression mode)
* Tab character: <code>\t</code> (for Notepad++: Extended mode)
* Digits: <code>\d</code> (for Notepad++: Regular expression mode only)
* Non-whitespace: <code>\S</code> - Does not include half-width spaces and full-width spaces


<pre>
== Troubleshooting Regular Expressions ==
  [^/](console\.log)
</pre>


== batch action ==
'''Tips:''' 1. Use online tools like regex101 to understand your syntax 2. Test with small data: Prepare small file data to verify syntax 3. Highlight or output matched text for debugging 4. Simplify the syntax when encountering issues 5. Try alternative syntax due to compatibility issues (e.g., <code>\d</code> to <code>[0-9]+</code>)
* {{Gd}} [https://github.com/facelessuser/RegReplace RegReplace] 執行多個取代命令 "Simple find and replace sequencer plugin for Sublime Text" Quoted from official webpage. {{access | date=2014-10-25}}


== syntax ==
* 換行符號: \r\n (適用: Notepad++選項: 增強模式 & 用類型表式)
* tab鍵的固定空白分隔: \t  (適用: Notepad++選項: 增強模式)
* 數字: \d (適用: Notepad++選項: 用類型表式。{{exclaim}} 不適用: Notepad++選項: 增強模式)


== further reading ==
== Alternative Solutions ==
* {{Gd}} [http://regexlib.com/ Regular Expression Library]
* [http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Searching_And_Replacing SourceForge.net: Searching And Replacing - notepad-plus], [http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions SourceForge.net: Regular Expressions - notepad-plus]
* [http://stackoverflow.com/questions/23020856/text-extraction-with-sublime-text regex - text extraction with sublime text - Stack Overflow] {{access | date=2014-09-26}}


unicode
* Use Tab-separated data that can be easily pasted into Google Sheets or MS Excel
* [http://www.regular-expressions.info/unicode.html Regex Tutorial - Unicode Characters and Properties] {{access | date = 2014-04-02}}
* Copy multiple rows and paste between different applications (compatibility varies)
* [http://php.net/manual/en/regexp.reference.unicode.php PHP: Unicode character properties - Manual] {{access | date = 2014-04-02}}


references
== Further Reading ==
<references/>


== 替代方案 ==
* Regular-Expressions.info - Regex Tutorial, Examples and Reference
* 將資料以 {{kbd |key=Tab}}來隔開,貼到Google Drive的Spreadsheet或MS Excel,會自動儲存到不同欄位。所以將需要處理的原始資料中,需要擷取的資料的前後,使用{{kbd |key=Tab}}來隔開,複製後貼到於Google Drive的Spreadsheet或MS Excel,就會自動儲存到不同欄位,方便做進一步處理。
* Unicode character properties documentation
* Platform-specific regular expression documentation


Copy multiple rows & paste
{{Template: Data factory flow}}
* Copy to dreamweaver from MS Excel 2002: ok
* Copy to dreamweaver from Google Docs: not ok {{exclaim}}
* Copy to MS Excel 2002 from Google Docs: ok


[[Category:RegExp]] [[Category:Software]] [[Category:Programming]] [[Category:Data Science]] [[Category:Search]]
[[Category: Regular expression]]  
[[Category: Software]]  
[[Category: Programming]]  
[[Category: Data Science]]  
[[Category: Search]]
[[Category: String manipulation]]
[[Category: Revised with LLMs]

Latest revision as of 11:55, 11 December 2025

When processing text files through regular expressions, you can quickly search for or replace strings that match specific rules. Processing is done on a line-by-line basis for string manipulation. Regular expressions are also known as regex, regexp, or pattern matching expressions.

🌐 Switch language: English, 漢字


Raise_hand.png Need Help? You can use the provided explanatory online tools to try debugging yourself.


Quick Reference Table[edit]

Note: (1) Blue highlighted areas in samples represent text matching the rules, (2) The same text rule can have multiple representations

Text Rule Sample Opposite Text Rule Sample
Any single character (including spaces, but not newline)
.
What Does the Fox Say? 12 狐狸怎叫 34
Any character (including spaces), appears 1 or 0 times
.? = .{0,1}
What Does the Fox Say? 12 狐狸怎叫 34
Any number of multiple characters (including spaces)
.* = .{0,}
What Does the Fox Say? 12 狐狸怎叫 34
Any number of characters (including spaces), at least 1 occurrence
.+ = .{1,}
What Does the Fox Say? 12 狐狸怎叫 34
Any number of spaces or newlines (at least 1 occurrence)
\s+
What Does the Fox Say? 12 狐狸怎叫 34 Any number of characters (not including spaces or newlines)
[^\s]+ = [^\s]{1,} = [\S]+ = [^ ]+
What Does the Fox Say? 12 狐狸怎叫 34
Any number of ASCII characters (including English, numbers and spaces)
[\x00-\x80]+ or ascii:+
What Does the Fox Say? 12 狐狸怎叫 34 Non-ASCII, i.e., Chinese characters appearing any number of times
[^\x00-\x80]+
What Does the Fox Say? 12 狐狸怎叫 34
Any number of uppercase/lowercase English letters, numbers and underscore (_) (not including spaces)
[\w]+ = [a-zA-Z0-9_]+
PHP with u modifier supports Chinese characters
What Does the Fox Say? 12 狐狸怎叫 _34 Any number of characters that are not English letters, numbers and underscore (_)
\W+ = [^a-zA-Z0-9_]+
Any number of digits (not including spaces)
[\d]+ = [0-9]+
What Does the Fox Say? 12 狐狸怎叫 34 Any number of characters not including digits (including spaces)
[^\d]+ = [^0-9]+ = \D+
What Does the Fox Say? 12 狐狸怎叫 34
Any number of Chinese characters
[\p{Han}]+
What Does the Fox Say? 12 狐狸怎叫 34 Any number of characters not including Chinese
[^\p{Han}]+
Lines starting with “狐狸”
^狐狸.*$
狐狸怎叫 34 What Does the Fox Say?
柴犬怎叫 What Does the shiba inu say?
Lines not starting with “狐狸”
^(?!狐狸).*$
狐狸怎叫 34 What Does the Fox Say?
柴犬怎叫 What Does the shiba inu say?
Lines ending with “怎叫”
^.*怎叫$
What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫
Lines not ending with “怎叫”
.*(?<!怎叫)$
What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫
Lines containing “狐狸”
^.*狐狸.*$ or (狐狸)
What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫
Lines not containing “狐狸”
^((?!狐狸).)*$
What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫
叫.*狐狸 What Does the Fox Say? 12 狐狸怎叫 34
What Does the Fox Say? 12 不叫狐狸 34
What Does the shiba inu say? 柴犬怎叫
叫).* What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫

What Does the shiba inu say? 柴犬怎了
柴犬).)*$ What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫
What Does the Husky say? 哈士奇怎叫
Boolean logic NOT: Lines not containing “狐狸” but containing “柴犬”
^((?!狐狸).)*(柴犬).*$ = ^(柴犬).*((?!狐狸).)*$ = (柴犬).*((?!狐狸).)*
What Does the Fox Say? 12 狐狸怎叫 34
What Does the shiba inu say? 柴犬怎叫


Regular Expression Online Tools[edit]

Websites for testing regular expression syntax:


Common Use Cases[edit]

Replace Newlines with Commas[edit]

Converting email lists into a format usable by email software:

Original:
[email protected]
[email protected]
[email protected]

Convert to:
[email protected],[email protected],[email protected]

Method 1: Sublime Text, EmEditor[edit]

  1. Menu: Search -> Replace
  2. Check “Use Regular Expression”
    • Find: \n (newline character)
    • Replace with: ,
  3. Click “Replace all”


Method 2: Notepad++[edit]

  1. Menu: Find -> Replace
  2. Search mode: Check “Extended mode” (not “Regular expression”)
    • Find: \n
    • Replace with: ,
  3. Click “Replace All”


Method 3: Microsoft Word[edit]

  1. Menu: Edit -> Replace
  2. Check extended mode
    • Find: ^p (paragraph mark)
    • Replace with: ,
  3. Click “Replace All”

Method 4: Sed command for Linux[edit]

sed ':a;N;$!ba;s/\n/; /g' old.filename > new.filename

Find IP Addresses (IPv4)[edit]

For Notepad++ v.5.9.5: - Find: \d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?

For Sublime Text v. 3.2.21: - Find: (?:\d{1,3}\.){3}\d{1,3}

Remove Black Squares (UNIX Line Endings LF)[edit]

Using Notepad++: 1. Menu: Find -> Replace 2. Search mode: Check “Extended mode” - Find: \n\n (2 LF characters) - Replace with: \r\n (CR and LF)

Add Quotes Around Elements[edit]

Add Quotes Around Array Elements[edit]

Before: Elmo, Emie, Granny Bird, Herry Monster, 喀喀獸
After: 'Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸'

Method 1: PHP

$users = array('Elmo', 'Emie', 'Granny Bird', 'Herry Monster', '喀喀獸');
// Single quotes around each element
$result = implode(",", preg_replace('/^(.*?)$/', "'$1'", $users));
// Double quotes around each element
$result = implode(",", preg_replace('/^(.*?)$/', "\"$1\"", $users));
echo $result;

Method 2: Sublime Text or EmEditor - Find: ([^\s|,]+) - Replace with: '\1' (for single quotes) or "\1" (for double quotes)

Method 3: Notepad++ (Enable “Regular expression” search mode) - Find: ([^\s|,]+) - Replace with: '$1' (for single quotes) or "$1" (for double quotes)

Find Non-ASCII Characters (Chinese/Non-English Text)[edit]

In LibreOffice[edit]

[^\u0000-\u0080]+


Find Chinese Characters in Google Sheets[edit]

Example: If cell A2 contains any Chinese character, display “Chinese”, otherwise display “English”:

=IF(REGEXMATCH(A2, "[\一-\龥]"), "Chinese", "English")

Find Non-ASCII Characters in Google Sheets[edit]

Extract non-ASCII characters (such as Chinese, Japanese, emoji, etc.) from cell A2

=IF(ISERROR(REGEXEXTRACT(A2, "[^\x00-\x80]+")), "", REGEXEXTRACT(A2, "[^\x00-\x80]+"))

Explanation of regular expression [^\x00-\x80]+

  • [\x00-\x80]: Represents the ASCII character range (character codes 0-128). (1) Standard ASCII range: 0-127 (0x00-0x7F aka * [\x00-\x7F])[1] (2) Character 128 ((0x80) is actually the first character in the extended ASCII range, not part of the original ASCII standard.[2][3]
  • [^...]: Means "not" these characters
  • +: Means one or more

Overall meaning: Matches one or more non-ASCII characters

Find Chinese Characters in MySQL[edit]

Find rows where column_name contains Chinese characters:

SELECT `column_name`
FROM `table_name`
WHERE HEX(`column_name`) REGEXP '^(..)*(E[4-9])';

Query condition used to match records where the column_name field contains only Chinese characters.

SELECT `column_name`
FROM `table_name`
WHERE `column_name` REGEXP '^[一-龯]+$';

Explanation:

  • [一-龯] - Character set that matches all characters from "一" to "龯" in Unicode
  • "一" has Unicode code point U+4E00[4]
  • "龯" has Unicode code point U+9FEF[5]
  • This range U+4E00-U+9FFF already covers over 99% of daily Chinese usage requirements Extension B and later blocks mainly contain ancient Chinese characters, variant characters, etc., which rarely appear in modern texts

Find Non-ASCII Characters in MySQL[edit]

Find rows where column_name is not entirely ASCII characters:

SELECT `column_name`
FROM `table_name`
WHERE `column_name` <> CONVERT(`column_name` USING ASCII)

Find Chinese Characters in PHP[edit]

Exact match:

// Approach 1
if (preg_match('/^[\x{4e00}-\x{9fa5}]+$/u', $string)) {
    echo "All text is Chinese characters" . PHP_EOL;
} else {
    echo "Some text is not Chinese characters" . PHP_EOL;
}

// Approach 2
if (preg_match('/^[\p{Han}]+$/u', $string)) {
    echo "All text is Chinese characters" . PHP_EOL;
} else {
    echo "Some text is not Chinese characters" . PHP_EOL;
}

Partial match:

// Approach 1
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\p{Han}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);

// Approach 2
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\x{4e00}-\x{9fa5}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);

Find ASCII Characters in PHP[edit]

Code I:

if (preg_match('/[^\x20-\x7f]/', $keyword) === 0) {
    echo "The keyword is ASCII only";
} else {
    echo "The keyword contains non-ASCII characters (like Chinese, Japanese, etc.)";
}

Code II:

$pattern = '/^[[:ascii:]]+$/i';
$text = "Hello World"; // ASCII only
if (preg_match($pattern, $text)) {
    echo "Pure ASCII characters";
} else {
    echo "Contains non-ASCII characters";
}

Remove Empty Lines[edit]

Original:

Neo
Trinity

Morpheus


Smith
Oracle

After:

Neo
Trinity
Morpheus
Smith
Oracle

Using Sublime Text & EmEditor: - Find: ^[\s\t]*$\n - Replace with: (empty)

Using Notepad++ v7.8.7: - Menu: Edit -> Line Operations -> Remove Empty Lines (Including Blank Lines)

Find Non-Whitespace Text[edit]

  • Find: [^\s]+

Convert Symbol-Separated Text to Line-by-Line Display[edit]

Example:

Before: 尼歐、莫斐斯、崔妮蒂、史密斯、祭師
After:
尼歐
莫斐斯
崔妮蒂
史密斯
祭師

Using Sublime Text or EmEditor: - Find: ([^、]+)([、]{1}) - Replace with: \1\n

Replace Multiple Spaces with Tab Characters[edit]

Before: aaa bbb ccc After: aaa\tbbb\tccc

Using Sublime Text: - Find: ([^\S\n]+) or ([^\S\r\n]+) or \s\s+ - Replace with: \t


Remove Leading/Trailing Whitespace[edit]

Remove Leading Whitespace[edit]

  • Find: ^\s+
  • Replace with: (empty)


Remove Trailing Whitespace[edit]

  • Find: \s+$
  • Replace with: (empty)


Remove Both Leading and Trailing Whitespace[edit]

  • Find: (^\s+|\s+$)
  • Replace with: (empty)


Text Editors Supporting Regular Expressions[edit]

Various text editors support regular expressions including: - Sublime Text - EmEditor - Notepad++ - Visual Studio Code - Atom - Vim/Neovim


Syntax Reference[edit]

  • Newline character: \r\n (for Notepad++: Extended mode & Regular expression mode)
  • Tab character: \t (for Notepad++: Extended mode)
  • Digits: \d (for Notepad++: Regular expression mode only)
  • Non-whitespace: \S - Does not include half-width spaces and full-width spaces

Troubleshooting Regular Expressions[edit]

Tips: 1. Use online tools like regex101 to understand your syntax 2. Test with small data: Prepare small file data to verify syntax 3. Highlight or output matched text for debugging 4. Simplify the syntax when encountering issues 5. Try alternative syntax due to compatibility issues (e.g., \d to [0-9]+)


Alternative Solutions[edit]

  • Use Tab-separated data that can be easily pasted into Google Sheets or MS Excel
  • Copy multiple rows and paste between different applications (compatibility varies)

Further Reading[edit]

  • Regular-Expressions.info - Regex Tutorial, Examples and Reference
  • Unicode character properties documentation
  • Platform-specific regular expression documentation

Data factory flow

[[Category: Revised with LLMs]