Extract url from text: Difference between revisions

← Older edit

Extract url from text (edit)

Revision as of 15:25, 28 March 2025

143 bytes added , 28 March 2025

m

→‎從文章內容，擷取網址中的網域部分

Planetoid

Bureaucrats, Administrators

14,953

edits

@@ Line 1: / Line 1: @@
-從文章內容中擷取網址 (又稱 [https://zh.wikipedia.org/zh-tw/%E7%BB%9F%E4%B8%80%E8%B5%84%E6%BA%90%E5%AE%9A%E4%BD%8D%E7%AC%A6 統一資源定位符], [https://en.wikipedia.org/wiki/Uniform_Resource_Locator Uniform Resource Locator])。
+從文章內容中擷取網址 (又稱 [https://zh.wikipedia.org/zh-tw/%E7%BB%9F%E4%B8%80%E8%B5%84%E6%BA%90%E5%AE%9A%E4%BD%8D%E7%AC%A6 統一資源定位符], [https://en.wikipedia.org/wiki/Uniform_Resource_Locator Uniform Resource Locator]) 或[https://zh.wikipedia.org/zh-tw/%E5%9F%9F%E5%90%8D 網域] (domain name)。
 == 從文章內容，擷取完整網址 ==
 === 使用 Google sheet 擷取完整網址 ===
-使用 Google 試算表正規表示法 ([[Regular expression]]) 的 [https://support.google.com/docs/answer/3098244?hl=zh-Hant REGEXEXTRACT] 函數，從文章內容擷取第一個網址。
+* (optional) Step1: [https://workspace.google.com/marketplace/app/extract_urls/143780651832 Extract URLs - Google Workspace Marketplace] "The application extracts links and converts them to the HYPERLINK formula" {{Gd}}
+* (optional) Step2: Using the [https://support.microsoft.com/zh-tw/office/formulatext-%E5%87%BD%E6%95%B8-0a786771-54fd-4ae2-96ee-09cda35439c8 FORMULATEXT 函數 - Microsoft 支援服務]
+* Step3: 使用 Google 試算表正規表示法 ([[Regular expression]]) 的 [https://support.google.com/docs/answer/3098244?hl=zh-Hant REGEXEXTRACT] 函數，從文章內容擷取第一個網址。
 <pre>
 =REGEXEXTRACT(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)")
+</pre>
+詳細操作說明：[https://errerrors.blogspot.com/2023/10/how-to-quickly-extract-links-from-google-sheets.html 如何從 Google 試算表，快速取出連結]
+=== 使用 Google sheet 刪除文章內網址 ===
+Using [https://support.google.com/docs/answer/3098245?hl=zh-Hant REGEXREPLACE] function
+<pre>
+=REGEXREPLACE(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)", "")
 </pre>
@@ Line 49: / Line 61: @@
 # 根據 [http://tools.ietf.org/html/rfc3986/ RFC 3986] 的 [http://tools.ietf.org/html/rfc3986#section-2 Section 2: Characters] 網址允許的文字有 {{kbd | key = <nowiki>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=</nowiki>}}，其他文字則需要加上比例符號 % 編碼。 <ref>[http://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid validation - Which characters make a URL invalid? - Stack Overflow]</ref>
-== 擷取網址中的網域部分 ==
+== 從 HTML 文字，擷取完整網址 ==
-=== 使用 Google sheet 擷取網域 ===
+=== 使用 Google sheet 擷取完整網址 ===
-使用 Google 試算表 [https://support.google.com/docs/answer/3098244?hl=zh-Hant REGEXEXTRACT] 函數
+# Using [https://extract-urls.contributor.pw/ EXTRACT URLs] to extracts links and converts them to the HYPERLINK formula.
-<pre>
+# Using [https://support.google.com/docs/answer/9365792?hl=en FORMULATEXT function - Google Docs Editors Help]
-=REGEXEXTRACT(A1, "(http[s]?\://[^/]+)")
+# Using [https://support.google.com/docs/answer/3098244?hl=zh-Hant REGEXEXTRACT] function to extract the Url from above cell
-</pre>
-輸入:
 <pre>
-Yahoo! 新聞 https://tw.news.yahoo.com/abc
+=REGEXEXTRACT(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)")
 </pre>
-輸出:
+參考資料：
-<pre>
+* [https://support.google.com/docs/thread/34116680/extract-url-from-pasted-external-text-with-link-embedded?hl=en Extract URL from pasted external text with link embedded - Google Docs Editors Community]
-https://tw.news.yahoo.com
-</pre>
-說明:
+== 從文章內容，擷取網址中的網域部分 ==
-# 網域指以 <nowiki>http://</nowiki> 或 <nowiki>https://</nowiki> 開頭，與相臨不是符號 {{kbd | key = <nowiki>/</nowiki>}} 的多個文字：{{kbd | key = <nowiki>[^/]+</nowiki>}}。
+[[Extract domain from text in Mandarin | 從文章擷取網址中的網域部分]]
-== 擷取特定檔案類型的網址 ==
+== 從文章內容，擷取特定檔案類型的網址 ==
 === 使用 Sublime Text 擷取特定檔案類型的網址 ===
 以下語法適用於 [https://www.sublimetext.com/ Sublime Tex]
@@ Line 114: / Line 122: @@
 * 啟動下載任務
-== 資料驗證用  ==
+== 資料驗證用：文章內容是否包含網址 ==
-=== 文章內容是否包含網址 ===
 使用 Google 試算表 [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH] 函數，符合正規表示法的規則的話，回傳 TRUE。若不符合，則回傳 FALSE。
 <pre>
@@ Line 140: / Line 147: @@
 FALSE
 </pre>
-=== 文章內容是否包含網域 ===
-原始資料包含網域，但是網域前面不包含 http e.g. tw.news.yahoo.com 或 www.bbc.co.uk。使用 Google 試算表 [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH] 函數，符合正規表示法的規則的話，回傳 TRUE。若不符合，則回傳 FALSE。 {{exclaim}} 以下語法未處理 [https://zh.wikipedia.org/wiki/IPv4 IPv4] 形式的網域。(如果網域前面包含 http ，則可直接搜尋關鍵字: regular expression extract host )
-<pre>
-=IF(ISERROR(REGEXMATCH(A1, "([a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")), FALSE, REGEXMATCH(A1, "([a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"))
-</pre>
-輸入1:
-<pre>
-www.bbc.co.uk
-</pre>
-輸出1:
-<pre>
-TRUE
-</pre>
-輸入2:
-<pre>
-.0.0.0
-</pre>
-輸出2:
-<pre>
-TRUE
-</pre>
-輸入3:
-<pre>
-Yahoo! 新聞
-</pre>
-輸出3:
-<pre>
-FALSE
-</pre>
-不建議的其他方法:
-* 檢查網域結尾是否是 .com, .tw, .net, .org 因為太多要列舉，該方法沒有效率。
 == References ==
@@ Line 184: / Line 152: @@
 <references />
-[[Category:Regular expression]] [[Category:Data Science]] [[Category:String manipulation]]
+[[Category: Regular expression]] [[Category: Data Science]] [[Category: String manipulation]]

Extract url from text: Difference between revisions

Extract url from text (edit)

Revision as of 15:25, 28 March 2025

Navigation menu

Search