Editing
Extract url from text
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
從文章內容中擷取網址 (又稱 [https://zh.wikipedia.org/zh-tw/%E7%BB%9F%E4%B8%80%E8%B5%84%E6%BA%90%E5%AE%9A%E4%BD%8D%E7%AC%A6 統一資源定位符], [https://en.wikipedia.org/wiki/Uniform_Resource_Locator Uniform Resource Locator]) 或[https://zh.wikipedia.org/zh-tw/%E5%9F%9F%E5%90%8D 網域] (domain name)。 == 從文章內容,擷取完整網址 == === 使用 Google sheet 擷取完整網址 === * (optional) Step1: [https://workspace.google.com/marketplace/app/extract_urls/143780651832 Extract URLs - Google Workspace Marketplace] "The application extracts links and converts them to the HYPERLINK formula" {{Gd}} * (optional) Step2: Using the [https://support.microsoft.com/zh-tw/office/formulatext-%E5%87%BD%E6%95%B8-0a786771-54fd-4ae2-96ee-09cda35439c8 FORMULATEXT 函數 - Microsoft 支援服務] * Step3: 使用 Google 試算表正規表示法 ([[Regular expression]]) 的 [https://support.google.com/docs/answer/3098244?hl=zh-Hant REGEXEXTRACT] 函數,從文章內容擷取第一個網址。 <pre> =REGEXEXTRACT(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)") </pre> 詳細操作說明:[https://errerrors.blogspot.com/2023/10/how-to-quickly-extract-links-from-google-sheets.html 如何從 Google 試算表,快速取出連結] === 使用 Google sheet 刪除文章內網址 === Using [https://support.google.com/docs/answer/3098245?hl=zh-Hant REGEXREPLACE] function <pre> =REGEXREPLACE(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)", "") </pre> === 使用 Sublime Text 擷取完整網址 === 使用 Sublime Text 等支援 regular expression 的文字編輯器 * 選單 Find --> Replace * 啟用 Regular expression * Find What: {{kbd | key= <nowiki>.*(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+).*</nowiki>}} {{exclaim}} 此處輸入資料是一行只有一個網址。如果有多個網址,將會截取靠近行尾的網址。 * Replace with: {{kbd | key= <nowiki>\1</nowiki>}} === 使用 Microsoft Excel 擷取完整網址 === 使用 Excel [https://support.office.com/en-us/article/find-findb-functions-c7912941-af2a-4bdf-a553-d0d89b0a0628?ui=en-US&rs=en-US&ad=US FIND], [https://support.office.com/en-us/article/len-lenb-functions-29236f94-cedc-429d-affd-b5e33d2c67cb?ui=en-US&rs=en-US&ad=US LEN] 與 [https://support.office.com/en-us/article/mid-midb-functions-d5f9e25c-d7d6-472e-b568-4ecb12433028?ui=en-US&rs=en-US&ad=US MID] 等函數。資料限制:網址前後需要間隔空白或換行符號。以下公式從 B2 儲存格擷取完整網址:(公式修改自 guitarthrower 提供的公式<ref>[https://stackoverflow.com/questions/25429211/extract-urls-from-a-cell-of-text-in-excel vba - Extract URL's from a Cell of Text in Excel - Stack Overflow]</ref>) <pre> =IF(ISERROR(MID(SUBSTITUTE(B2, " ", " "),FIND("http",SUBSTITUTE(B2, " ", " ")),IFERROR(FIND(" ",SUBSTITUTE(B2, " ", " "),FIND("http",SUBSTITUTE(B2, " ", " ")))-1,LEN(SUBSTITUTE(B2, " ", " ")))-FIND("http",SUBSTITUTE(B2, " ", " "))+1)), "", MID(SUBSTITUTE(B2, " ", " "),FIND("http",SUBSTITUTE(B2, " ", " ")),IFERROR(FIND(" ",SUBSTITUTE(B2, " ", " "),FIND("http",SUBSTITUTE(B2, " ", " ")))-1,LEN(SUBSTITUTE(B2, " ", " ")))-FIND("http",SUBSTITUTE(B2, " ", " "))+1)) </pre> === 測試資料 === 輸入資料: 不包含 HTML 語法的 [http://www.w3schools.com/tags/att_a_href.asp a href] 屬性標籤 <pre> Yahoo! 新聞 https://tw.news.yahoo.com/abc </pre> 輸出資料: <pre> https://tw.news.yahoo.com/abc </pre> 說明: # 網址可能是 <nowiki>http://</nowiki> 或 <nowiki>https://</nowiki> 開頭,所以條件是 {{kbd | key = <nowiki>http[s]?://</nowiki>}} # 根據 [http://tools.ietf.org/html/rfc3986/ RFC 3986] 的 [http://tools.ietf.org/html/rfc3986#section-2 Section 2: Characters] 網址允許的文字有 {{kbd | key = <nowiki>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=</nowiki>}},其他文字則需要加上比例符號 % 編碼。 <ref>[http://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid validation - Which characters make a URL invalid? - Stack Overflow]</ref> == 從 HTML 文字,擷取完整網址 == === 使用 Google sheet 擷取完整網址 === # Using [https://extract-urls.contributor.pw/ EXTRACT URLs] to extracts links and converts them to the HYPERLINK formula. # Using [https://support.google.com/docs/answer/9365792?hl=en FORMULATEXT function - Google Docs Editors Help] # Using [https://support.google.com/docs/answer/3098244?hl=zh-Hant REGEXEXTRACT] function to extract the Url from above cell <pre> =REGEXEXTRACT(A1, "(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)") </pre> 參考資料: * [https://support.google.com/docs/thread/34116680/extract-url-from-pasted-external-text-with-link-embedded?hl=en Extract URL from pasted external text with link embedded - Google Docs Editors Community] == 從文章內容,擷取網址中的網域部分 == [[Extract domain from text in Mandarin | 從文章擷取網址中的網域部分]] == 從文章內容,擷取特定檔案類型的網址 == === 使用 Sublime Text 擷取特定檔案類型的網址 === 以下語法適用於 [https://www.sublimetext.com/ Sublime Tex] 步驟1: 擷取該網頁的全部網址 * {{Chrome}} 瀏覽器安裝 [https://chrome.google.com/webstore/detail/video-downloader-getthema/nbkekaeindpfpcoldfckljplboolgkfm Video Downloader GetThemAll] 擴充套件 * 安裝後,點選工具列上的 Video Downloader GetThemAll 按鈕 * 點選「save link in txt」 * 儲存網址清單為純文字檔 步驟2: 刪除不包含不包含檔案類型的行,下例是檔案類型 <span style="background-color: yellow;">.ttf</span> * 用 Sublime Tex 開啟網址清單,範例檔案如下: <pre> Frequently Asked Questions http://www.clearchinese.com/faq.htm Contact Us http://www.clearchinese.com/contact.php HDZB_5 http://www.clearchinese.com/images/fonts/HDZB_5.TTF HDZB_6 http://www.clearchinese.com/images/fonts/HDZB_6.TTF </pre> * 選單 Find --> Replace * 啟用 Regular expression * Find What: {{kbd | key= ^((?!\<span style="background-color: yellow;">.ttf</span>).)*$}} {{exclaim}} 此處語法是尋找不包含 .ttf 的行,可再修正為結尾不是 .ttf 的行。 * Replace with: (不需要輸入任何文字) 步驟3: [[Regular replace blank lines | 刪除空白行]] * 選單 Find --> Replace * 啟用 Regular expression * Find What: {{kbd | key= <nowiki>^[\s\t]*$\n</nowiki>}} * Replace with: (不需要輸入任何文字) 步驟4: 只留下網址部分,刪除該行最前面的文字 * 選單 Find --> Replace * 啟用 Regular expression * Find What: {{kbd | key= .*(http[s]?://[a-zA-Z0-9\-_\\._~\:\/\?#\[\]@\!\$&'\(\)\*\+,;\=%]+)(\<span style="background-color: yellow;">.ttf</span>$)}} * Replace with: {{kbd | key= <nowiki>\1\2</nowiki>}} * 儲存網址清單,範例檔案如下: <pre> http://www.clearchinese.com/images/fonts/HDZB_5.TTF http://www.clearchinese.com/images/fonts/HDZB_6.TTF </pre> 步驟5: 下載檔案 * 安裝與執行 [http://www.orbitdownloader.com/ Orbit Downloader] * 選單: 檔案 --> 匯入下載清單 --> 選擇網址清單 * 啟動下載任務 == 資料驗證用:文章內容是否包含網址 == 使用 Google 試算表 [https://support.google.com/docs/answer/3098292?hl=zh-Hant REGEXMATCH] 函數,符合正規表示法的規則的話,回傳 TRUE。若不符合,則回傳 FALSE。 <pre> =REGEXMATCH(A1, "http") </pre> 輸入1: <pre> Yahoo! 新聞 https://tw.news.yahoo.com/abc </pre> 輸出1: <pre> TRUE </pre> 輸入2: <pre> Yahoo! 新聞 </pre> 輸出2: <pre> FALSE </pre> == References == <references /> [[Category: Regular expression]] [[Category: Data Science]] [[Category: String manipulation]]
Summary:
Please note that all contributions to LemonWiki共筆 are considered to be released under the Creative Commons Attribution-NonCommercial-ShareAlike (see
LemonWiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Templates used on this page:
Template:Chrome
(
edit
)
Template:Exclaim
(
edit
)
Template:Gd
(
edit
)
Template:Kbd
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Current events
Recent changes
Random page
Help
Categories
Tools
What links here
Related changes
Special pages
Page information