Extract all hashtags from text: Difference between revisions

Revision as of 19:35, 9 March 2021

使用 Regular expression 擷取文字中的 Hashtag (主題標籤)

資料預先處理

因為連結可能包含 # 符號，所以需要事先移除連結文字。可參考從文章內容中擷取網址。

擷取文字中的 Hashtag

需要處理「標點符號與特殊字元（例如 $ 和 %）」^[1]。僅處理部分符號的 PHP code 範例: ^[2]:

preg_match_all('/\B(#[^\s|\-|－|,|，|\[|\]|『|』|／|~|～|'|"|`|\(|\)|（|）|【|】|《|》|「|」|\<|\>|\=|\{|\}|！|\!|、|？|\?|：|…|\:|;|#|\.|．|。|\$|%|&|\*|\+|,|@|\^|\||\/]+)/ui', $string, $matches);

說明

\B 比對非「英文字的邊界」^[3]^[4]，避免從 another#bad_hashtag 文字中擷取出 #bad_hashtag。
\s 空白字元 (Whitespace character)

還需要處理以下狀況

illegal Hashtag
#12345
#___
#_
#__123

legal Hashtag
#1_abc
#_abc

@@ Line 1: / Line 1: @@
-使用 [[Regular expression]]  擷取文字中的 Hashtag ( [https://zh.wikipedia.org/wiki/%E4%B8%BB%E9%A1%8C%E6%A8%99%E7%B1%A4 主題標籤])
+使用 [[Regular expression]]  擷取文字中的 Hashtag ([https://zh.wikipedia.org/wiki/%E4%B8%BB%E9%A1%8C%E6%A8%99%E7%B1%A4 主題標籤])
-* 因為連結包含 # 符號，所以需要事先處理。
-* PHP code<ref>[http://stackoverflow.com/questions/3060601/retrieve-all-hashtags-from-a-tweet-in-a-php-function regex - Retrieve all hashtags from a tweet in a PHP function - Stack Overflow]</ref>:
+== 資料預先處理 ==
+* 因為連結可能包含 # 符號，所以需要事先移除連結文字。可參考 [[Extract url from text | 從文章內容中擷取網址]]。
+== 擷取文字中的 Hashtag ==
+* 需要處理「[https://zh.wikipedia.org/zh-tw/%E6%A0%87%E7%82%B9%E7%AC%A6%E5%8F%B7 標點符號]與特殊字元（例如 $ 和 %）」<ref>[https://www.facebook.com/help/587836257914341 如何使用主題標籤（Hashtag）？ | Facebook 使用說明]</ref>。僅處理部分符號的 PHP code 範例: <ref>[http://stackoverflow.com/questions/3060601/retrieve-all-hashtags-from-a-tweet-in-a-php-function regex - Retrieve all hashtags from a tweet in a PHP function - Stack Overflow]</ref>:
 <pre>
-preg_match_all("/(#[^\s|\-|,|\"|，|\[|\]|『|』|／|~|(|)|（|）|【|】|《|》|－|「|」|！|!|、|？|：|…|:|#|\.|．|。|?]+)/ui", $string, $matches);
+preg_match_all('/\B(#[^\s|\-|－|,|，|\[|\]|『|』|／|~|～|'|"|`|\(|\)|（|）|【|】|《|》|「|」|\<|\>|\=|\{|\}|！|\!|、|？|\?|：|…|\:|;|#|\.|．|。|\$|%|&|\*|\+|,|@|\^|\||\/]+)/ui', $string, $matches);
 </pre>
-* 還需要處理以下狀況
+說明
+* {{kbd| key = <nowiki>\B</nowiki>}} 比對非「英文字的邊界」<ref>[https://atedev.wordpress.com/2007/11/23/%E6%AD%A3%E8%A6%8F%E8%A1%A8%E7%A4%BA%E5%BC%8F-regular-expression/ 正規表示式 Regular Expression | 就是愛程式]</ref><ref>[http://stackoverflow.com/questions/6664151/difference-between-b-and-b-in-regex Difference between \b and \B in regex - Stack Overflow]</ref>，避免從 another#bad_hashtag 文字中擷取出 #bad_hashtag。
+* {{kbd| key = <nowiki>\s</nowiki>}} 空白字元 ([https://en.wikipedia.org/wiki/Whitespace_character Whitespace character])
+還需要處理以下狀況
 <pre>
-illegal
+illegal Hashtag
 #12345
 #___
@@ Line 13: / Line 22: @@
 #__123
-legal
+legal Hashtag
 #1_abc
 #_abc
@@ Line 21: / Line 30: @@
 <references />
-[[Category:RegExp]] [[Category:Data Science]] [[Category:Search]] [[Category:Text file processing]]
+延伸閱讀
+# [http://input.foruto.com/source/source_01.htm 中文輸入法常用標點符號簡表]
+# [http://www.wfublog.com/2015/06/unicode-emoji-special-character-table.html Unicode 表情圖案(emoji ) + 特殊符號字元一覽表＠WFU BLOG]
+# [http://www.regular-expressions.info/unicode.html Regex Tutorial - Unicode Characters and Properties]
+# [http://unicode-table.com/cn/ Unicode®字符百科]
+[[Category:Regular expression]] [[Category:String manipulation]] [[Category:Data Science]] [[Category:Search]]

Extract all hashtags from text: Difference between revisions

Revision as of 19:35, 9 March 2021

資料預先處理

擷取文字中的 Hashtag

references

Navigation menu

Search