Extract domain from text: Difference between revisions

Extract domain from text (edit)

Revision as of 15:21, 28 March 2025

1,201 bytes added , 28 March 2025

no edit summary

Planetoid

Bureaucrats, Administrators

15,050

edits

@@ Line 21: / Line 21: @@
 Explanation:
 * Domain refers to text that starts with <nowiki>http://</nowiki> or <nowiki>https://</nowiki>, followed by multiple characters that are not the symbol {{kbd | key = <nowiki>/</nowiki>}}: {{kbd | key = <nowiki>[^/]+</nowiki>}}. =
+== Data Validation: Does the article content contain a domain ==
+The original data includes domains, but the domains don't include http prefix, e.g., tw.news.yahoo.com or www.bbc.co.uk. Using Google Spreadsheet [https://support.google.com/docs/answer/3098292?hl=en REGEXMATCH] function, if it matches the regular expression rules, it returns TRUE. If not, it returns FALSE. {{exclaim}} The following syntax doesn't handle [https://en.wikipedia.org/wiki/IPv4 IPv4] format domains. (If the domain includes http prefix, you can directly search for: regular expression extract host)
+<pre>
+=IF(ISERROR(REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")), FALSE, REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"))
+</pre>
+Input 1:
+<pre>
+www.bbc.co.uk
+</pre>
+Output 1:
+<pre>
+TRUE
+</pre>
+Input 2:
+<pre>
+# 0.0.0
+</pre>
+Output 2:
+<pre>
+TRUE
+</pre>
+Input 3:
+<pre>
+Yahoo! News
+</pre>
+Output 3:
+<pre>
+FALSE
+</pre>
+Other methods not recommended:
+* Checking if the domain ends with .com, .tw, .net, .org is inefficient because there are too many to enumerate.
 [[Category: Regular expression]] [[Category: Data Science]] [[Category: String manipulation]]

Extract domain from text: Difference between revisions

Extract domain from text (edit)

Revision as of 15:21, 28 March 2025

Navigation menu

Search