Extract domain from text: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
No edit summary
 
(One intermediate revision by the same user not shown)
Line 6: Line 6:
Use Google Spreadsheet [https://support.google.com/docs/answer/3098244?hl=en REGEXEXTRACT] function
Use Google Spreadsheet [https://support.google.com/docs/answer/3098244?hl=en REGEXEXTRACT] function
<pre>
<pre>
=REGEXEXTRACT(A1, "(http[s]?\://[^/]+)")
=REGEXEXTRACT(A2, "(http[s]?\://[^/]+)")
</pre>
</pre>


Input:
Input text:
<pre>
<pre>
Yahoo! News https://tw.news.yahoo.com/abc
Yahoo! News https://tw.news.yahoo.com/abc

Latest revision as of 14:50, 24 December 2025

Extract the domain part from URLs in article content

🌐 Switch language: English, 漢字


Using Google Sheets to extract domains[edit]

Use Google Spreadsheet REGEXEXTRACT function

=REGEXEXTRACT(A2, "(http[s]?\://[^/]+)")

Input text:

Yahoo! News https://tw.news.yahoo.com/abc

Output:

https://tw.news.yahoo.com

Explanation:

  • Domain refers to text that starts with http:// or https://, followed by multiple characters that are not the symbol /: [^/]+. =

Data Validation: Does the article content contain a domain[edit]

The original data includes domains, but the domains don't include http prefix, e.g., tw.news.yahoo.com or www.bbc.co.uk. Using Google Spreadsheet REGEXMATCH function, if it matches the regular expression rules, it returns TRUE. If not, it returns FALSE. Icon_exclaim.gif The following syntax doesn't handle IPv4 format domains. (If the domain includes http prefix, you can directly search for: regular expression extract host)

=IF(ISERROR(REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")), FALSE, REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"))

Input 1:

www.bbc.co.uk

Output 1:

TRUE

Input 2:

# 0.0.0

Output 2:

TRUE

Input 3:

Yahoo! News

Output 3:

FALSE

Other methods not recommended:

  • Checking if the domain ends with .com, .tw, .net, .org is inefficient because there are too many to enumerate.