14,953
edits
No edit summary |
No edit summary |
||
| Line 21: | Line 21: | ||
Explanation: | Explanation: | ||
* Domain refers to text that starts with <nowiki>http://</nowiki> or <nowiki>https://</nowiki>, followed by multiple characters that are not the symbol {{kbd | key = <nowiki>/</nowiki>}}: {{kbd | key = <nowiki>[^/]+</nowiki>}}. = | * Domain refers to text that starts with <nowiki>http://</nowiki> or <nowiki>https://</nowiki>, followed by multiple characters that are not the symbol {{kbd | key = <nowiki>/</nowiki>}}: {{kbd | key = <nowiki>[^/]+</nowiki>}}. = | ||
== Data Validation: Does the article content contain a domain == | |||
The original data includes domains, but the domains don't include http prefix, e.g., tw.news.yahoo.com or www.bbc.co.uk. Using Google Spreadsheet [https://support.google.com/docs/answer/3098292?hl=en REGEXMATCH] function, if it matches the regular expression rules, it returns TRUE. If not, it returns FALSE. {{exclaim}} The following syntax doesn't handle [https://en.wikipedia.org/wiki/IPv4 IPv4] format domains. (If the domain includes http prefix, you can directly search for: regular expression extract host) | |||
<pre> | |||
=IF(ISERROR(REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")), FALSE, REGEXMATCH(A1, "([a-zA-Z0-9\-''\\.''~\:\/\?#\[\]@\!\$&'\(\)\''\+,;\=%]+\.[a-zA-Z]{2,}$|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")) | |||
</pre> | |||
Input 1: | |||
<pre> | |||
www.bbc.co.uk | |||
</pre> | |||
Output 1: | |||
<pre> | |||
TRUE | |||
</pre> | |||
Input 2: | |||
<pre> | |||
# 0.0.0 | |||
</pre> | |||
Output 2: | |||
<pre> | |||
TRUE | |||
</pre> | |||
Input 3: | |||
<pre> | |||
Yahoo! News | |||
</pre> | |||
Output 3: | |||
<pre> | |||
FALSE | |||
</pre> | |||
Other methods not recommended: | |||
* Checking if the domain ends with .com, .tw, .net, .org is inefficient because there are too many to enumerate. | |||
[[Category: Regular expression]] [[Category: Data Science]] [[Category: String manipulation]] | [[Category: Regular expression]] [[Category: Data Science]] [[Category: String manipulation]] | ||