Extract domain from text: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 21: Line 21:
Explanation:
Explanation:
* Domain refers to text that starts with <nowiki>http://</nowiki> or <nowiki>https://</nowiki>, followed by multiple characters that are not the symbol {{kbd | key = <nowiki>/</nowiki>}}: {{kbd | key = <nowiki>[^/]+</nowiki>}}. =
* Domain refers to text that starts with <nowiki>http://</nowiki> or <nowiki>https://</nowiki>, followed by multiple characters that are not the symbol {{kbd | key = <nowiki>/</nowiki>}}: {{kbd | key = <nowiki>[^/]+</nowiki>}}. =
== Data Validation: Does the article content contain a domain ==
The original data includes domains, but the domains don't include http prefix, e.g., tw.news.yahoo.com or www.bbc.co.uk. Using Google Spreadsheet [https://support.google.com/docs/answer/3098292?hl=en REGEXMATCH] function, if it matches the regular expression rules, it returns TRUE. If not, it returns FALSE. {{exclaim}} The following syntax doesn't handle [https://en.wikipedia.org/wiki/IPv4 IPv4] format domains. (If the domain includes http prefix, you can directly search for: regular expression extract host)
<pre>
=IF(ISERROR(REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")), FALSE, REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"))
</pre>
Input 1:
<pre>
www.bbc.co.uk
</pre>
Output 1:
<pre>
TRUE
</pre>
Input 2:
<pre>
# 0.0.0
</pre>
Output 2:
<pre>
TRUE
</pre>
Input 3:
<pre>
Yahoo! News
</pre>
Output 3:
<pre>
FALSE
</pre>
Other methods not recommended:
* Checking if the domain ends with .com, .tw, .net, .org is inefficient because there are too many to enumerate.


[[Category: Regular expression]] [[Category: Data Science]] [[Category: String manipulation]]
[[Category: Regular expression]] [[Category: Data Science]] [[Category: String manipulation]]

Navigation menu