Extract domain from text: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
| Line 21: | Line 21: | ||
Explanation: | Explanation: | ||
* Domain refers to text that starts with <nowiki>http://</nowiki> or <nowiki>https://</nowiki>, followed by multiple characters that are not the symbol {{kbd | key = <nowiki>/</nowiki>}}: {{kbd | key = <nowiki>[^/]+</nowiki>}}. = | * Domain refers to text that starts with <nowiki>http://</nowiki> or <nowiki>https://</nowiki>, followed by multiple characters that are not the symbol {{kbd | key = <nowiki>/</nowiki>}}: {{kbd | key = <nowiki>[^/]+</nowiki>}}. = | ||
== Data Validation: Does the article content contain a domain == | |||
The original data includes domains, but the domains don't include http prefix, e.g., tw.news.yahoo.com or www.bbc.co.uk. Using Google Spreadsheet [https://support.google.com/docs/answer/3098292?hl=en REGEXMATCH] function, if it matches the regular expression rules, it returns TRUE. If not, it returns FALSE. {{exclaim}} The following syntax doesn't handle [https://en.wikipedia.org/wiki/IPv4 IPv4] format domains. (If the domain includes http prefix, you can directly search for: regular expression extract host) | |||
<pre> | |||
=IF(ISERROR(REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")), FALSE, REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")) | |||
</pre> | |||
Input 1: | |||
<pre> | |||
www.bbc.co.uk | |||
</pre> | |||
Output 1: | |||
<pre> | |||
TRUE | |||
</pre> | |||
Input 2: | |||
<pre> | |||
# 0.0.0 | |||
</pre> | |||
Output 2: | |||
<pre> | |||
TRUE | |||
</pre> | |||
Input 3: | |||
<pre> | |||
Yahoo! News | |||
</pre> | |||
Output 3: | |||
<pre> | |||
FALSE | |||
</pre> | |||
Other methods not recommended: | |||
* Checking if the domain ends with .com, .tw, .net, .org is inefficient because there are too many to enumerate. | |||
[[Category: Regular expression]] [[Category: Data Science]] [[Category: String manipulation]] | [[Category: Regular expression]] [[Category: Data Science]] [[Category: String manipulation]] | ||
Revision as of 15:21, 28 March 2025
Extract the domain part from URLs in article content
Using Google Sheets to extract domains
Use Google Spreadsheet REGEXEXTRACT function
=REGEXEXTRACT(A1, "(http[s]?\://[^/]+)")
Input:
Yahoo! News https://tw.news.yahoo.com/abc
Output:
https://tw.news.yahoo.com
Explanation:
- Domain refers to text that starts with http:// or https://, followed by multiple characters that are not the symbol /: [^/]+. =
Data Validation: Does the article content contain a domain
The original data includes domains, but the domains don't include http prefix, e.g., tw.news.yahoo.com or www.bbc.co.uk. Using Google Spreadsheet REGEXMATCH function, if it matches the regular expression rules, it returns TRUE. If not, it returns FALSE.
The following syntax doesn't handle IPv4 format domains. (If the domain includes http prefix, you can directly search for: regular expression extract host)
=IF(ISERROR(REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")), FALSE, REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"))
Input 1:
www.bbc.co.uk
Output 1:
TRUE
Input 2:
# 0.0.0
Output 2:
TRUE
Input 3:
Yahoo! News
Output 3:
FALSE
Other methods not recommended:
- Checking if the domain ends with .com, .tw, .net, .org is inefficient because there are too many to enumerate.