Extract url from text: Difference between revisions
Jump to navigation
Jump to search
| (40 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
從文章內容中擷取網址 (又稱 [https://zh.wikipedia.org/zh-tw/%E7%BB%9F%E4%B8%80%E8%B5%84%E6%BA%90%E5%AE%9A%E4%BD%8D%E7%AC%A6 統一資源定位符], [https://en.wikipedia.org/wiki/Uniform_Resource_Locator Uniform Resource Locator]) 或[https://zh.wikipedia.org/zh-tw/%E5%9F%9F%E5%90%8D 網域] (domain name)。 | |||
== 擷取完整網址 == | == 從文章內容,擷取完整網址 == | ||
使用 Google | === 使用 Google sheet 擷取完整網址 === | ||
* (optional) Step1: [https://workspace.google.com/marketplace/app/extract_urls/143780651832 Extract URLs - Google Workspace Marketplace] "The application extracts links and converts them to the HYPERLINK formula" {{Gd}} | |||
* (optional) Step2: Using the [https://support.microsoft.com/zh-tw/office/formulatext-%E5%87%BD%E6%95%B8-0a786771-54fd-4ae2-96ee-09cda35439c8 FORMULATEXT 函數 - Microsoft 支援服務] | |||
* Step3: 使用 Google 試算表正規表示法 ([[Regular expression]]) 的 [https://support.google.com/docs/answer/3098244?hl=zh-Hant REGEXEXTRACT] 函數,從文章內容擷取第一個網址。 | |||
<pre> | <pre> | ||
=REGEXEXTRACT(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)") | =REGEXEXTRACT(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)") | ||
</pre> | </pre> | ||
詳細操作說明:[https://errerrors.blogspot.com/2023/10/how-to-quickly-extract-links-from-google-sheets.html 如何從 Google 試算表,快速取出連結] | |||
=== 使用 Google sheet 刪除文章內網址 === | |||
Using [https://support.google.com/docs/answer/3098245?hl=zh-Hant REGEXREPLACE] function | |||
<pre> | <pre> | ||
=REGEXREPLACE(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)", "") | |||
</pre> | </pre> | ||
=== 使用 Sublime Text 擷取完整網址 === | |||
< | 使用 Sublime Text 等支援 regular expression 的文字編輯器 | ||
* 選單 Find --> Replace | |||
</ | * 啟用 Regular expression | ||
* Find What: {{kbd | key= <nowiki>.*(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+).*</nowiki>}} {{exclaim}} 此處輸入資料是一行只有一個網址。如果有多個網址,將會截取靠近行尾的網址。 | |||
* Replace with: {{kbd | key= <nowiki>\1</nowiki>}} | |||
=== 使用 Microsoft Excel 擷取完整網址 === | |||
使用 Excel [https://support.office.com/en-us/article/find-findb-functions-c7912941-af2a-4bdf-a553-d0d89b0a0628?ui=en-US&rs=en-US&ad=US FIND], [https://support.office.com/en-us/article/len-lenb-functions-29236f94-cedc-429d-affd-b5e33d2c67cb?ui=en-US&rs=en-US&ad=US LEN] 與 [https://support.office.com/en-us/article/mid-midb-functions-d5f9e25c-d7d6-472e-b568-4ecb12433028?ui=en-US&rs=en-US&ad=US MID] 等函數。資料限制:網址前後需要間隔空白或換行符號。以下公式從 B2 儲存格擷取完整網址:(公式修改自 guitarthrower 提供的公式<ref>[https://stackoverflow.com/questions/25429211/extract-urls-from-a-cell-of-text-in-excel vba - Extract URL's from a Cell of Text in Excel - Stack Overflow]</ref>) | |||
<pre> | <pre> | ||
= | =IF(ISERROR(MID(SUBSTITUTE(B2, " | ||
", " "),FIND("http",SUBSTITUTE(B2, " | |||
", " ")),IFERROR(FIND(" ",SUBSTITUTE(B2, " | |||
", " "),FIND("http",SUBSTITUTE(B2, " | |||
", " ")))-1,LEN(SUBSTITUTE(B2, " | |||
", " ")))-FIND("http",SUBSTITUTE(B2, " | |||
", " "))+1)), "", MID(SUBSTITUTE(B2, " | |||
", " "),FIND("http",SUBSTITUTE(B2, " | |||
", " ")),IFERROR(FIND(" ",SUBSTITUTE(B2, " | |||
", " "),FIND("http",SUBSTITUTE(B2, " | |||
", " ")))-1,LEN(SUBSTITUTE(B2, " | |||
", " ")))-FIND("http",SUBSTITUTE(B2, " | |||
", " "))+1)) | |||
</pre> | </pre> | ||
=== 測試資料 === | |||
輸入資料: 不包含 HTML 語法的 [http://www.w3schools.com/tags/att_a_href.asp a href] 屬性標籤 | |||
<pre> | <pre> | ||
Yahoo! 新聞 https://tw.news.yahoo.com/abc | Yahoo! 新聞 https://tw.news.yahoo.com/abc | ||
</pre> | </pre> | ||
輸出資料: | |||
<pre> | <pre> | ||
https://tw.news.yahoo.com/ | https://tw.news.yahoo.com/abc | ||
</pre> | </pre> | ||
說明: | 說明: | ||
# | # 網址可能是 <nowiki>http://</nowiki> 或 <nowiki>https://</nowiki> 開頭,所以條件是 {{kbd | key = <nowiki>http[s]?://</nowiki>}} | ||
# 根據 [http://tools.ietf.org/html/rfc3986/ RFC 3986] 的 [http://tools.ietf.org/html/rfc3986#section-2 Section 2: Characters] 網址允許的文字有 {{kbd | key = <nowiki>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=</nowiki>}},其他文字則需要加上比例符號 % 編碼。 <ref>[http://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid validation - Which characters make a URL invalid? - Stack Overflow]</ref> | |||
== 從 HTML 文字,擷取完整網址 == | |||
=== 使用 Google sheet 擷取完整網址 === | |||
# Using [https://extract-urls.contributor.pw/ EXTRACT URLs] to extracts links and converts them to the HYPERLINK formula. | |||
# Using [https://support.google.com/docs/answer/9365792?hl=en FORMULATEXT function - Google Docs Editors Help] | |||
# Using [https://support.google.com/docs/answer/3098244?hl=zh-Hant REGEXEXTRACT] function to extract the Url from above cell | |||
<pre> | <pre> | ||
= | =REGEXEXTRACT(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)") | ||
</pre> | </pre> | ||
參考資料: | |||
* [https://support.google.com/docs/thread/34116680/extract-url-from-pasted-external-text-with-link-embedded?hl=en Extract URL from pasted external text with link embedded - Google Docs Editors Community] | |||
== 從文章內容,擷取網址中的網域部分 == | |||
[[Extract domain from text in Mandarin | 從文章擷取網址中的網域部分]] | |||
== 從文章內容,擷取特定檔案類型的網址 == | |||
=== 使用 Sublime Text 擷取特定檔案類型的網址 === | |||
以下語法適用於 [https://www.sublimetext.com/ Sublime Tex] | |||
步驟1: 擷取該網頁的全部網址 | |||
* {{Chrome}} 瀏覽器安裝 [https://chrome.google.com/webstore/detail/video-downloader-getthema/nbkekaeindpfpcoldfckljplboolgkfm Video Downloader GetThemAll] 擴充套件 | |||
* 安裝後,點選工具列上的 Video Downloader GetThemAll 按鈕 | |||
* 點選「save link in txt」 | |||
* 儲存網址清單為純文字檔 | |||
步驟2: 刪除不包含不包含檔案類型的行,下例是檔案類型 <span style="background-color: yellow;">.ttf</span> | |||
* 用 Sublime Tex 開啟網址清單,範例檔案如下: | |||
<pre> | <pre> | ||
Frequently Asked Questions http://www.clearchinese.com/faq.htm | |||
Contact Us http://www.clearchinese.com/contact.php | |||
HDZB_5 http://www.clearchinese.com/images/fonts/HDZB_5.TTF | |||
HDZB_6 http://www.clearchinese.com/images/fonts/HDZB_6.TTF | |||
</pre> | </pre> | ||
* 選單 Find --> Replace | |||
* 啟用 Regular expression | |||
* Find What: {{kbd | key= ^((?!\<span style="background-color: yellow;">.ttf</span>).)*$}} {{exclaim}} 此處語法是尋找不包含 .ttf 的行,可再修正為結尾不是 .ttf 的行。 | |||
* Replace with: (不需要輸入任何文字) | |||
步驟3: [[Regular replace blank lines | 刪除空白行]] | |||
* 選單 Find --> Replace | |||
* 啟用 Regular expression | |||
</ | * Find What: {{kbd | key= <nowiki>^[\s\t]*$\n</nowiki>}} | ||
* Replace with: (不需要輸入任何文字) | |||
步驟4: 只留下網址部分,刪除該行最前面的文字 | |||
* 選單 Find --> Replace | |||
* 啟用 Regular expression | |||
* Find What: {{kbd | key= .*(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)(\<span style="background-color: yellow;">.ttf</span>$)}} | |||
* Replace with: {{kbd | key= <nowiki>\1\2</nowiki>}} | |||
* 儲存網址清單,範例檔案如下: | |||
<pre> | <pre> | ||
http://www.clearchinese.com/images/fonts/HDZB_5.TTF | |||
http://www.clearchinese.com/images/fonts/HDZB_6.TTF | |||
</pre> | </pre> | ||
步驟5: 下載檔案 | |||
* 安裝與執行 [http://www.orbitdownloader.com/ Orbit Downloader] | |||
* 選單: 檔案 --> 匯入下載清單 --> 選擇網址清單 | |||
* 啟動下載任務 | |||
== | == 資料驗證用:文章內容是否包含網址 == | ||
使用 Google 試算表 [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH] 函數,符合正規表示法的規則的話,回傳 TRUE。若不符合,則回傳 FALSE。 | |||
<pre> | <pre> | ||
=REGEXMATCH(A1, | =REGEXMATCH(A1, "http") | ||
</pre> | </pre> | ||
輸入1: | 輸入1: | ||
<pre> | <pre> | ||
Yahoo! 新聞 https://tw.news.yahoo.com/abc | |||
</pre> | </pre> | ||
| Line 98: | Line 152: | ||
<references /> | <references /> | ||
[[Category:Regular expression]] [[Category:Data Science]] [[Category: | [[Category: Regular expression]] [[Category: Data Science]] [[Category: String manipulation]] | ||
Latest revision as of 15:25, 28 March 2025
從文章內容中擷取網址 (又稱 統一資源定位符, Uniform Resource Locator) 或網域 (domain name)。
從文章內容,擷取完整網址[edit]
使用 Google sheet 擷取完整網址[edit]
- (optional) Step1: Extract URLs - Google Workspace Marketplace "The application extracts links and converts them to the HYPERLINK formula"

- (optional) Step2: Using the FORMULATEXT 函數 - Microsoft 支援服務
- Step3: 使用 Google 試算表正規表示法 (Regular expression) 的 REGEXEXTRACT 函數,從文章內容擷取第一個網址。
=REGEXEXTRACT(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)")
詳細操作說明:如何從 Google 試算表,快速取出連結
使用 Google sheet 刪除文章內網址[edit]
Using REGEXREPLACE function
=REGEXREPLACE(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)", "")
使用 Sublime Text 擷取完整網址[edit]
使用 Sublime Text 等支援 regular expression 的文字編輯器
- 選單 Find --> Replace
- 啟用 Regular expression
- Find What: .*(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+).*
此處輸入資料是一行只有一個網址。如果有多個網址,將會截取靠近行尾的網址。 - Replace with: \1
使用 Microsoft Excel 擷取完整網址[edit]
使用 Excel FIND, LEN 與 MID 等函數。資料限制:網址前後需要間隔空白或換行符號。以下公式從 B2 儲存格擷取完整網址:(公式修改自 guitarthrower 提供的公式[1])
=IF(ISERROR(MID(SUBSTITUTE(B2, "
", " "),FIND("http",SUBSTITUTE(B2, "
", " ")),IFERROR(FIND(" ",SUBSTITUTE(B2, "
", " "),FIND("http",SUBSTITUTE(B2, "
", " ")))-1,LEN(SUBSTITUTE(B2, "
", " ")))-FIND("http",SUBSTITUTE(B2, "
", " "))+1)), "", MID(SUBSTITUTE(B2, "
", " "),FIND("http",SUBSTITUTE(B2, "
", " ")),IFERROR(FIND(" ",SUBSTITUTE(B2, "
", " "),FIND("http",SUBSTITUTE(B2, "
", " ")))-1,LEN(SUBSTITUTE(B2, "
", " ")))-FIND("http",SUBSTITUTE(B2, "
", " "))+1))
測試資料[edit]
輸入資料: 不包含 HTML 語法的 a href 屬性標籤
Yahoo! 新聞 https://tw.news.yahoo.com/abc
輸出資料:
https://tw.news.yahoo.com/abc
說明:
- 網址可能是 http:// 或 https:// 開頭,所以條件是 http[s]?://
- 根據 RFC 3986 的 Section 2: Characters 網址允許的文字有 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=,其他文字則需要加上比例符號 % 編碼。 [2]
從 HTML 文字,擷取完整網址[edit]
使用 Google sheet 擷取完整網址[edit]
- Using EXTRACT URLs to extracts links and converts them to the HYPERLINK formula.
- Using FORMULATEXT function - Google Docs Editors Help
- Using REGEXEXTRACT function to extract the Url from above cell
=REGEXEXTRACT(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)")
參考資料:
從文章內容,擷取網址中的網域部分[edit]
從文章內容,擷取特定檔案類型的網址[edit]
使用 Sublime Text 擷取特定檔案類型的網址[edit]
以下語法適用於 Sublime Tex
步驟1: 擷取該網頁的全部網址
- Chrome
瀏覽器安裝 Video Downloader GetThemAll 擴充套件 - 安裝後,點選工具列上的 Video Downloader GetThemAll 按鈕
- 點選「save link in txt」
- 儲存網址清單為純文字檔
步驟2: 刪除不包含不包含檔案類型的行,下例是檔案類型 .ttf
- 用 Sublime Tex 開啟網址清單,範例檔案如下:
Frequently Asked Questions http://www.clearchinese.com/faq.htm Contact Us http://www.clearchinese.com/contact.php HDZB_5 http://www.clearchinese.com/images/fonts/HDZB_5.TTF HDZB_6 http://www.clearchinese.com/images/fonts/HDZB_6.TTF
- 選單 Find --> Replace
- 啟用 Regular expression
- Find What: ^((?!\.ttf).)*$
此處語法是尋找不包含 .ttf 的行,可再修正為結尾不是 .ttf 的行。 - Replace with: (不需要輸入任何文字)
步驟3: 刪除空白行
- 選單 Find --> Replace
- 啟用 Regular expression
- Find What: ^[\s\t]*$\n
- Replace with: (不需要輸入任何文字)
步驟4: 只留下網址部分,刪除該行最前面的文字
- 選單 Find --> Replace
- 啟用 Regular expression
- Find What: .*(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)(\.ttf$)
- Replace with: \1\2
- 儲存網址清單,範例檔案如下:
http://www.clearchinese.com/images/fonts/HDZB_5.TTF http://www.clearchinese.com/images/fonts/HDZB_6.TTF
步驟5: 下載檔案
- 安裝與執行 Orbit Downloader
- 選單: 檔案 --> 匯入下載清單 --> 選擇網址清單
- 啟動下載任務
資料驗證用:文章內容是否包含網址[edit]
使用 Google 試算表 REGEXMATCH 函數,符合正規表示法的規則的話,回傳 TRUE。若不符合,則回傳 FALSE。
=REGEXMATCH(A1, "http")
輸入1:
Yahoo! 新聞 https://tw.news.yahoo.com/abc
輸出1:
TRUE
輸入2:
Yahoo! 新聞
輸出2:
FALSE