Regular expression: Difference between revisions

Jump to navigation Jump to search
2,641 bytes added ,  4 December 2019
m
(15 intermediate revisions by the same user not shown)
Line 1: Line 1:
透過正規表示法 (Regular Expression) 處理文字檔時,可以快速地搜尋或取代符合特定規則的字串。以每行為單位,進行字串處理<ref>[http://linux.vbird.org/linux_basic/0330regularex.php 鳥哥的 Linux 私房菜 -- 正規表示法 (regular expression, RE) 與文件格式化處理]</ref>。 正規表示法 又稱正規表示式、正則表達式、正規表示法、正規運算式、規則運算式、常規表示法<ref>[https://zh.wikipedia.org/wiki/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F 正規表示式 - 維基百科,自由的百科全書]</ref>。
透過正規表示法 (Regular Expression) 處理文字檔時,可以快速地搜尋或取代符合特定規則的字串。以每行為單位,進行字串處理<ref>[http://linux.vbird.org/linux_basic/0330regularex.php 鳥哥的 Linux 私房菜 -- 正規表示法 (regular expression, RE) 與文件格式化處理]</ref>。 正規表示法 又稱正規表示式、正規表達式、正則表達式、正規表示法、正規運算式、規則運算式、常規表示法<ref>[https://zh.wikipedia.org/wiki/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F 正規表示式 - 維基百科,自由的百科全書]</ref>。


{{Raise hand | text = 有問題嗎?可以利用提供解說的[[Regular_expression#Regular_expression_online_tools | 線上工具]],嘗試自己除錯。 也可以到[http://www.ptt.cc/bbs/RegExp/index.html 看板 RegExp 文章列表 - 批踢踢實業坊]或其他[[問答服務]]詢問。 }}
{{Raise hand | text = 有問題嗎?可以利用提供解說的[[Regular_expression#Regular_expression_online_tools | 線上工具]],嘗試自己除錯。 也可以到[http://www.ptt.cc/bbs/RegExp/index.html 看板 RegExp 文章列表 - 批踢踢實業坊]或其他[[問答服務]]詢問。 }}
Line 5: Line 5:
== 快速查表 ==
== 快速查表 ==
說明: (1) sample 藍色網底處代表符合規則的文字、(2) 同一文字規則可以有多種表示法
說明: (1) sample 藍色網底處代表符合規則的文字、(2) 同一文字規則可以有多種表示法
<table border="1" style="width:100%">
<table border="1" style="width:100%" class="wikitable">
<tr >
<tr >
<th style="background-color: #E0E0E0;"> 文字規則 </th>
<th style="background-color: #E0E0E0;"> 文字規則 </th>
Line 43: Line 43:
</tr>
</tr>
<tr>
<tr>
<td> 任意次的 ASCII character(包含英文、數字和空白) [http://regexr.com/3aom2 demo]<ref>[http://www.asciitable.com/ understand]</ref> <br /> {{kbd | key = <nowiki>[\x00-\x80]+</nowiki>}}</td>
<td> 任意次的 ASCII character(包含英文、數字和空白) [http://regexr.com/3aom2 demo]<ref>[http://www.asciitable.com/ Ascii Table - ASCII character codes and html, octal, hex and decimal chart conversion]</ref> <br /> {{kbd | key = <nowiki>[\x00-\x80]+</nowiki>}} 或 {{kbd | key = <nowiki>[[:ascii:]]+</nowiki>}}<ref>[https://stackoverflow.com/questions/24903140/regex-for-any-english-ascii-character-including-special-characters php - Regex for Any English ASCII Character Including Special Characters - Stack Overflow]</ref></td>
<td><span style="background:#C6E3FF">What Does the Fox Say? 12</span> 狐狸怎叫 34</td>
<td><span style="background:#C6E3FF">What Does the Fox Say? 12</span> 狐狸怎叫 34</td>
<td>非 ASCII,即中文出現任意次<br /> {{kbd | key = <nowiki>[^\x00-\x80]+</nowiki>}}</td>
<td>非 ASCII,即中文出現任意次<br /> {{kbd | key = <nowiki>[^\x00-\x80]+</nowiki>}}</td>
Line 49: Line 49:
</tr>
</tr>
<tr>
<tr>
<td> 任意次的英文、數字和底線符號( _ )文字(不包含空白) <br /> {{kbd | key = <nowiki>[\w]+</nowiki>}} = {{kbd | key = <nowiki>[a-zA-Z0-9_]+</nowiki>}} </td>
<td> 任意次的大小寫英文、數字和底線符號( _ )文字(不包含空白) <br /> {{kbd | key = <nowiki>[\w]+</nowiki>}} = {{kbd | key = <nowiki>[a-zA-Z0-9_]+</nowiki>}} </td>
<td><span style="background:#C6E3FF">What</span> Does the Fox Say? 12 狐狸怎叫 34</td>
<td><span style="background:#C6E3FF">What</span> Does the Fox Say? 12 狐狸怎叫 34</td>
<td> 任意次的不是英文、數字和底線符號( _ )的文字 <br /> {{kbd | key = <nowiki>\W+</nowiki>}} = {{kbd | key = <nowiki>[^a-zA-Z0-9_]+</nowiki>}}</td>
<td> 任意次的不是英文、數字和底線符號( _ )的文字 <br /> {{kbd | key = <nowiki>\W+</nowiki>}} = {{kbd | key = <nowiki>[^a-zA-Z0-9_]+</nowiki>}}</td>
Line 59: Line 59:
<td>不包含數字的任意次文字(包含空白  <br /> {{kbd | key = <nowiki>[^\d]+</nowiki>}} = {{kbd | key = <nowiki>[^0-9]+</nowiki>}} = {{kbd | key = <nowiki>\D+</nowiki>}} </td>
<td>不包含數字的任意次文字(包含空白  <br /> {{kbd | key = <nowiki>[^\d]+</nowiki>}} = {{kbd | key = <nowiki>[^0-9]+</nowiki>}} = {{kbd | key = <nowiki>\D+</nowiki>}} </td>
<td><span style="background:#C6E3FF">What Does the Fox Say? </span>12 狐狸怎叫 34</td>
<td><span style="background:#C6E3FF">What Does the Fox Say? </span>12 狐狸怎叫 34</td>
</tr>
<tr>
<td> 任意次的中文字 <br /> {{kbd | key = <nowiki>[\p{Han}]+</nowiki>}} ([https://regex101.com/r/UYkdml/1 demo]、[[Regular expression#尋找中文、非英文的文字 | 詳細說明]])</td>
<td>What Does the Fox Say? 12 <span style="background:#C6E3FF">狐狸怎叫</span> 34</td>
<td>不包含中文字的任意次文字  <br /> {{kbd | key = <nowiki>[^\p{Han}]+</nowiki>}} ([https://regex101.com/r/Nk9GdA/1 demo])</td>
<td></td>
</tr>
</tr>
<tr>
<tr>
Line 130: Line 136:


== Regular expression online tools ==
== Regular expression online tools ==
測試 Regular expression 語法的網站
* {{Gd}} [http://regex101.com/ RegEx101] "Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript" ([http://regex101.com/r/tH1eT7/1 example]) 有提供語法解說。教學: [https://www.minwt.com/webdesign-dev/html/20352.html RegEx101正規表示法線上產生器,有沒有選到立馬告訴你|梅問題.教學網]
* {{Gd}} [http://regex101.com/ RegEx101] "Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript" ([http://regex101.com/r/tH1eT7/1 example]) 有提供語法解說。教學: [https://www.minwt.com/webdesign-dev/html/20352.html RegEx101正規表示法線上產生器,有沒有選到立馬告訴你|梅問題.教學網]
* {{Gd}} [http://gskinner.com/RegExr/ RegExr]: Learn, Build, & Test RegEx ([http://regexr.com/395t0 example]). 有提供語法解說. 教學: [http://blog.hsdn.net/1426.html RegExr: 功能強大的正規式撰寫協助工具]
* {{Gd}} [http://gskinner.com/RegExr/ RegExr]: Learn, Build, & Test RegEx ([http://regexr.com/395t0 example]). 有提供語法解說. 教學: [http://blog.hsdn.net/1426.html RegExr: 功能強大的正規式撰寫協助工具]
Line 243: Line 250:
* [https://www.hexdictionary.com/ Hex Dictionary | Convert Hex / Hexadecimal Numbers to Binary and Decimal]
* [https://www.hexdictionary.com/ Hex Dictionary | Convert Hex / Hexadecimal Numbers to Binary and Decimal]


=== Find IP address ===
=== Find IP address (IPv4) ===
使用[http://notepad-plus-plus.org/ Notepad++]軟體 v.5.9.5
適用 [http://notepad-plus-plus.org/ Notepad++] 軟體 v.5.9.5
# 選單: 尋找 -> 取代
# 選單: 尋找 -> 取代
# 搜尋模式: 勾選「用類型表式」
# 搜尋模式: 勾選「用類型表式」
## 尋找目標: \d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?
## 尋找目標: {{kbd | key=<nowiki>\d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?</nowiki>}}


note: not support {n} syntax
note: not support {n} syntax
適用 [https://www.sublimetext.com/ Sublime Text] v. 3.2.21
# Find: {{kbd | key=<nowiki>(?:\d{1,3}\.){3}\d{1,3}</nowiki>}}


參考資料:  
參考資料:  
* [https://www.regular-expressions.info/ip.html How to Find or Validate an IP Address] {{access | date = 2019-06-05}}
* [http://sourceforge.net/projects/notepad-plus/forums/forum/331754/topic/4780602 SourceForge.net: Notepad++: Regular expression for IP addresses]
* [http://sourceforge.net/projects/notepad-plus/forums/forum/331754/topic/4780602 SourceForge.net: Notepad++: Regular expression for IP addresses]
* [http://stackoverflow.com/questions/53497/regular-expression-that-matches-valid-ipv6-addresses regex - Regular expression that matches valid IPv6 addresses - Stack Overflow] {{access | date = 2015-08-10}}
* [http://stackoverflow.com/questions/53497/regular-expression-that-matches-valid-ipv6-addresses regex - Regular expression that matches valid IPv6 addresses - Stack Overflow] {{access | date = 2015-08-10}}
Line 349: Line 360:


=== 尋找中文、非英文的文字 ===
=== 尋找中文、非英文的文字 ===
適用: Google Drive 試算表的 [https://support.google.com/docs/answer/3098245?hl=zh-Hant RegExReplace] 函數、Notepad++的搜尋
適用: Google Drive 試算表的 Regular expression 相關函數,例如: [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH]、[https://support.google.com/docs/answer/3098244?hl=en REGEXEXTRACT]、[https://support.google.com/docs/answer/3098245?hl=zh-Hant RegExReplace] 函數、Notepad++的搜尋
<pre>
<pre>
[^\x00-\x80]+
[^\x00-\x80]+
</pre>
</pre>


適用: Total commander 的 Multi-Rename tool<ref>取代非英文的文字,但是不包含 . 符號: <nowiki>[^\u0000-\u0080|.]+ </nowiki></ref><ref>[http://stackoverflow.com/questions/150033/regular-expression-to-match-non-english-characters javascript - Regular expression to match non-english characters? - Stack Overflow]</ref>
適用: [https://zh-tw.libreoffice.org/ LibreOffice] [https://help.libreoffice.org/6.2/en-US/text/scalc/01/func_regex.html REGEX] function<ref>[https://help.libreoffice.org/6.2/en-US/text/shared/01/02100001.html?&DbPAR=WRITER&System=MAC List of Regular Expressions]</ref>、Total commander 的 Multi-Rename tool<ref>取代非英文的文字,但是不包含 . 符號: <nowiki>[^\u0000-\u0080|.]+ </nowiki></ref><ref>[http://stackoverflow.com/questions/150033/regular-expression-to-match-non-english-characters javascript - Regular expression to match non-english characters? - Stack Overflow]</ref>
<pre>
<pre>
[^\u0000-\u0080]+
[^\u0000-\u0080]+
Line 366: Line 377:
</pre>
</pre>


尋找欄位值包含中文字,中文字包含繁體中文與簡體中文,不包含特殊符號,例如 Emoji:{{kbd | key = ⭐}}。
尋找欄位值包含中文字,中文字包含繁體中文與簡體中文,不包含標點符號 (例如 {{kbd | key = <nowiki>,</nowiki>}})、全形標點符號 (例如 {{kbd | key = <nowiki>,</nowiki>}})以及特殊符號,例如 Emoji:{{kbd | key = ⭐}}。
PHP:
PHP: exact match
<pre>
<pre>
// approach 1
// approach 1
Line 384: Line 395:
</pre>
</pre>


技術問題除錯:
partial match ([http://sandbox.onlinephpfunctions.com/code/d780845d20877c0fd2e693b28ed02a10d250d39e online demo] hosted by [http://sandbox.onlinephpfunctions.com/ PHP Sandbox])
* 錯誤訊息:<pre>preg_match(): Compilation failed: character value in \x{} or \o{} is too large at offset 8</pre>
<pre>
// approach 1
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\p{Han}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
 
var_dump($matches);
 
// approach 2
$string = '繁體中文-简体中文-English-12345-。,!-.,!-⭐';
$pattern = '/[\x{4e00}-\x{9fa5}]+/u';
preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
 
var_dump($matches);
</pre>
 
技術問題除錯: 錯誤訊息
<pre>preg_match(): Compilation failed: character value in \x{} or \o{} is too large at offset 8</pre>


解決方式: [http://php.net/manual/en/function.preg-match.php preg_match()] 需要加上 {{kbd | key = u }} 變數<ref>[https://stackoverflow.com/questions/32375531/preg-match-compilation-failed-character-value-in-x-or-o-is-too-large-a php - preg_match(): Compilation failed: character value in \x{} or \o{} is too large at offset 27 on line number 25 - Stack Overflow]</ref>。
解決方式: [http://php.net/manual/en/function.preg-match.php preg_match()] 需要加上 {{kbd | key = u }} 變數<ref>[https://stackoverflow.com/questions/32375531/preg-match-compilation-failed-character-value-in-x-or-o-is-too-large-a php - preg_match(): Compilation failed: character value in \x{} or \o{} is too large at offset 27 on line number 25 - Stack Overflow]</ref>。
Line 495: Line 523:
</pre>
</pre>


說明: \S 代表非空白字元, \r\n 代表換行符號。[^\S\r\n] 則代表不是非空白字元、也不是換行符號。換句話說尋找空白,但不包含換行符號。
說明: \S 代表非空白字元, \r\n 代表[[Return symbol | 換行符號]]。[^\S\r\n] 則代表不是非空白字元、也不是換行符號。換句話說尋找空白,但不包含換行符號。


使用  Sublime Text 軟體 (參考資料<ref>[http://www.techrepublic.com/blog/microsoft-office/quickly-replace-multiple-space-characters-with-a-tab-character/ Quickly replace multiple space characters with a tab character - TechRepublic]</ref> <ref>[http://stackoverflow.com/questions/3469080/match-whitespace-but-not-newlines-perl regex - Match whitespace but not newlines (Perl) - Stack Overflow]</ref>)
使用  Sublime Text 軟體 (參考資料<ref>[http://www.techrepublic.com/blog/microsoft-office/quickly-replace-multiple-space-characters-with-a-tab-character/ Quickly replace multiple space characters with a tab character - TechRepublic]</ref> <ref>[http://stackoverflow.com/questions/3469080/match-whitespace-but-not-newlines-perl regex - Match whitespace but not newlines (Perl) - Stack Overflow]</ref>)
# Menu: Search -> Replace
# Menu: Search -> Replace
# click "Use Regular Expression"
# click "Use Regular Expression"
## Find: {{kbd | key = <nowiki>([^\S\n]+)</nowiki>}} 或 {{kbd | key = <nowiki>([^\S\r\n]+)</nowiki>}} 或 {{kbd | key = <nowiki>_{1,}</nowiki>}} ( 自行替換 _ 成半形空白)
## Find: {{kbd | key = <nowiki>([^\S\n]+)</nowiki>}} 或 {{kbd | key = <nowiki>([^\S\r\n]+)</nowiki>}} 或 {{kbd | key = <nowiki>\s\s+</nowiki>}} 或 {{kbd | key = <nowiki>_{1,}</nowiki>}} ( 自行替換 _ 成半形空白) {{exclaim}} 因為 {{kbd | key = <nowiki>\s</nowiki>}} 包含了空白與換行字元,所以不能直接使用 {{kbd | key = <nowiki>\s+</nowiki>}} 當做搜尋條件
## Replace with: {{kbd | key = <nowiki>\t</nowiki>}}
## Replace with: {{kbd | key = <nowiki>\t</nowiki>}}
# click "Replace all"
# click "Replace all"
Line 538: Line 566:
=== 尋找文章內容中的網址 ===
=== 尋找文章內容中的網址 ===
[[Regular extract url from text]]
[[Regular extract url from text]]
=== 尋找數字 ===
請參考 [[Data cleaning#Numeric]]


=== 尋找文章內容中的長數字 ===
=== 尋找文章內容中的長數字 ===

Navigation menu