Count occurrences of a word in string: Difference between revisions
m (→MySQL way) |
m (→References) |
||
(18 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Counting number of occurrences (or frequency) of a word in string | |||
== Excel == | == Excel == | ||
Using the function [https://support.office.com/zh-tw/article/substitute-%E5%87%BD%E6%95%B8-6434944e-a904-4336-a9b0-1e58df3bc332 SUBSTITUTE] & [https://support.office.com/zh-hk/article/len%E3%80%81lenb-%E5%87%BD%E6%95%B8-29236f94-cedc-429d-affd-b5e33d2c67cb LEN] functions. | |||
[https://docs.google.com/spreadsheets/d/1eg5XuGs8X7bS_RrFECx8ktuPgTwpV12UPfkdlpbHlMU/edit?usp=sharing demo] | If multiple values was allowed in a single cell | ||
* Using the function [https://support.office.com/zh-tw/article/substitute-%E5%87%BD%E6%95%B8-6434944e-a904-4336-a9b0-1e58df3bc332 SUBSTITUTE] & [https://support.office.com/zh-hk/article/len%E3%80%81lenb-%E5%87%BD%E6%95%B8-29236f94-cedc-429d-affd-b5e33d2c67cb LEN] functions. [https://docs.google.com/spreadsheets/d/1eg5XuGs8X7bS_RrFECx8ktuPgTwpV12UPfkdlpbHlMU/edit?usp=sharing demo]<ref>[https://errerrors.blogspot.com/2022/02/excel-calculate-text-occurrence.html Excel 計算文字出現次數]</ref>. Or | |||
If multiple values was '''NOT''' allowed in a single cell | |||
* Using the function [https://support.office.com/zh-hk/article/countif-%E5%87%BD%E6%95%B8-e0de10c6-f885-4e71-abb4-1f464816df34 COUNTIF] | |||
== MySQL way == | == MySQL way == | ||
Line 15: | Line 19: | ||
SELECT FLOOR((CHAR_LENGTH(@paragraph) - CHAR_LENGTH(REPLACE(@paragraph, @term, ''))) / CHAR_LENGTH(@term)) AS occurrences; | SELECT FLOOR((CHAR_LENGTH(@paragraph) - CHAR_LENGTH(REPLACE(@paragraph, @term, ''))) / CHAR_LENGTH(@term)) AS occurrences; | ||
</pre> | </pre> | ||
[http://sqlfiddle.com/#!9/6d06f/480/2 online example] | |||
<pre> | |||
-- Count occurrences of a string: . | |||
SET @input = "www.google.com"; | |||
SET @separator = "."; | |||
SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator; | |||
-- expected result: 2 | |||
= | -- Count occurrences of a string: og | ||
SET @input = "www.google.com"; | |||
SET @separator = "og"; | |||
SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator; | |||
-- expected result: 1 | |||
</pre> | |||
== PHP == | |||
Using the [https://www.php.net/manual/en/function.mb-substr-count.php mb_substr_count] ('''binary safe''') or [https://www.php.net/manual/en/function.substr-count.php substr_count] functions. See details on [http://sandbox.onlinephpfunctions.com/code/0b72da5f00701341c77623889d177750242cbb7a demo]. | |||
<pre> | <pre> | ||
<?php | |||
$input = 'an apple a day keeps the doctor away'; | $input = 'an apple a day keeps the doctor away'; | ||
$term = 'apple'; | $term = 'apple'; | ||
echo substr_count($input, $term); | echo substr_count($input, $term) . PHP_EOL; | ||
$input = '一天一蘋果,醫生遠離我'; | |||
$term = '蘋果'; | |||
echo mb_substr_count($input, $term, 'UTF-8') . PHP_EOL; | |||
</pre> | |||
== BASH == | |||
data preparation | |||
* (1) separate each string by [[Return symbol | return_symbol]] <ref>[https://www.unix.com/shell-programming-and-scripting/83076-replacing-commas-newlines-using-sed.html replacing comma's with newlines using sed]</ref> | |||
* (2) check the [https://www.computerhope.com/unix/uuniq.htm uniq command] is exists on Cygwin of {{Win}} or {{Linux}} | |||
* (3) (optional) input the command {{kbd | key=<nowiki>export LC_ALL='C'</nowiki>}} on Cygwin of {{Win}} if met the error message "Invalid or incomplete multibyte or wide character" after input the following {{kbd | key=<nowiki>sort</nowiki>}} command | |||
* (4) execute the following command {{kbd | key=<nowiki>sort <file.txt> | uniq -ic | sort -nr</nowiki>}}<ref>[https://unix.stackexchange.com/questions/134446/counting-the-occurrences-of-the-string text processing - Counting the occurrences of the string - Unix & Linux Stack Exchange]</ref><ref>[https://unix.stackexchange.com/questions/170043/sort-and-count-number-of-occurrence-of-lines Sort and count number of occurrence of lines - Unix & Linux Stack Exchange]</ref> | |||
* (5) Remove the leading whitespace in the file: Using the [[Text editor with support for regular expression | text editor]] with support for [[Regular expression|regular expression]] and replace {{kbd | key=<nowiki>^\s+(\d+)\s+</nowiki>}} with {{kbd | key=<nowiki>\1\t</nowiki>}} | |||
file: test.txt | |||
<pre> | |||
#apple | |||
#追劇 | |||
#電影 | |||
#綜藝 | |||
#Apple | |||
#藍芽 | |||
</pre> | |||
=== Output format I: occurrence & keyword === | |||
{{exclaim}} The term each line in the input file was allowed contains whitespaces. | |||
Result of the execution of command: {{kbd | key=<nowiki>sort test.txt | uniq -ic | sort -nr</nowiki>}} {{exclaim}} case insensitive | |||
<pre> | |||
2 #Apple | |||
1 #電影 | |||
1 #追劇 | |||
1 #藍芽 | |||
1 #綜藝 | |||
</pre> | |||
Result of the execution of command: {{kbd | key=<nowiki>sort test.txt | uniq -c</nowiki>}} {{exclaim}} case sensitive | |||
<pre> | |||
1 #Apple | |||
1 #apple | |||
1 #綜藝 | |||
1 #藍芽 | |||
1 #追劇 | |||
1 #電影 | |||
</pre> | |||
=== Output format II: keyword & occurrence === | |||
{{exclaim}} The term each line in the input file should '''not''' contains whitespaces. | |||
Result of the execution of command: {{kbd | key=<nowiki>sort test.txt | uniq -ic | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } '</nowiki>}} <ref>[https://stackoverflow.com/questions/11967776/swap-two-columns-awk-sed-python-perl Swap two columns - awk, sed, python, perl - Stack Overflow]</ref>{{exclaim}} case insensitive | |||
<pre> | |||
#Apple 2 | |||
#電影 1 | |||
#追劇 1 | |||
#藍芽 1 | |||
#綜藝 1 | |||
</pre> | </pre> | ||
[[Category:Software]] [[Category:Programming]] [[Category:Data Science]] [[Category: | Result of the execution of command: {{kbd | key=<nowiki>sort test.txt | uniq -c | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } '</nowiki>}} {{exclaim}} case sensitive | ||
<pre> | |||
#電影 1 | |||
#追劇 1 | |||
#藍芽 1 | |||
#綜藝 1 | |||
#apple 1 | |||
#Apple 1 | |||
</pre> | |||
=== Verification of count occurrence === | |||
<pre> | |||
cat test.txt | grep -i "#apple$" | wc -l | |||
# or | |||
cat test.txt | grep -iw "#apple" | wc -l | |||
</pre> | |||
Options<ref>[https://en.wikibooks.org/wiki/Grep Grep - Wikibooks, open books for an open world]</ref> | |||
* {{kbd | key=<nowiki>-i</nowiki>}} means {{kbd | key=<nowiki>Ignore uppercase vs. lowercase.</nowiki>}} | |||
* {{kbd | key=<nowiki>-w</nowiki>}} means {{kbd | key=<nowiki>--word-regexp</nowiki>}} | |||
== Further reading == | |||
* [http://stackoverflow.com/questions/12344795/count-the-number-of-occurences-of-a-string-in-a-varchar-field mysql - Count the number of occurences of a string in a VARCHAR field? - Stack Overflow] {{access | date = 2016-09-21}} | |||
* [https://www.got-it.ai/solutions/excel-chat/excel-tutorial/conditional-formatting/most-frequently-occurring-text Excel formula: Most frequently occurring text - Excelchat] | |||
== References == | |||
<references /> | |||
[[Category:Software]] [[Category:Programming]] [[Category:Data Science]] [[Category:String manipulation]] [[Category:Data transformation]] [[Category:MySQL]] |
Latest revision as of 12:32, 15 February 2023
Counting number of occurrences (or frequency) of a word in string
Excel[edit]
If multiple values was allowed in a single cell
- Using the function SUBSTITUTE & LEN functions. demo[1]. Or
If multiple values was NOT allowed in a single cell
- Using the function COUNTIF
MySQL way[edit]
SET @paragraph := 'an apple a day keeps the doctor away'; SET @term := 'apple'; SELECT FLOOR((LENGTH(@paragraph) - LENGTH(REPLACE(@paragraph, @term, ''))) / LENGTH(@term)) AS occurrences; /* same with the following query */ SELECT FLOOR((CHAR_LENGTH(@paragraph) - CHAR_LENGTH(REPLACE(@paragraph, @term, ''))) / CHAR_LENGTH(@term)) AS occurrences;
-- Count occurrences of a string: . SET @input = "www.google.com"; SET @separator = "."; SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator; -- expected result: 2 -- Count occurrences of a string: og SET @input = "www.google.com"; SET @separator = "og"; SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator; -- expected result: 1
PHP[edit]
Using the mb_substr_count (binary safe) or substr_count functions. See details on demo.
<?php $input = 'an apple a day keeps the doctor away'; $term = 'apple'; echo substr_count($input, $term) . PHP_EOL; $input = '一天一蘋果,醫生遠離我'; $term = '蘋果'; echo mb_substr_count($input, $term, 'UTF-8') . PHP_EOL;
BASH[edit]
data preparation
- (1) separate each string by return_symbol [2]
- (2) check the uniq command is exists on Cygwin of Win or Linux
- (3) (optional) input the command export LC_ALL='C' on Cygwin of Win if met the error message "Invalid or incomplete multibyte or wide character" after input the following sort command
- (4) execute the following command sort <file.txt> | uniq -ic | sort -nr[3][4]
- (5) Remove the leading whitespace in the file: Using the text editor with support for regular expression and replace ^\s+(\d+)\s+ with \1\t
file: test.txt
#apple #追劇 #電影 #綜藝 #Apple #藍芽
Output format I: occurrence & keyword[edit]
The term each line in the input file was allowed contains whitespaces.
Result of the execution of command: sort test.txt | uniq -ic | sort -nr case insensitive
2 #Apple 1 #電影 1 #追劇 1 #藍芽 1 #綜藝
Result of the execution of command: sort test.txt | uniq -c case sensitive
1 #Apple 1 #apple 1 #綜藝 1 #藍芽 1 #追劇 1 #電影
Output format II: keyword & occurrence[edit]
The term each line in the input file should not contains whitespaces.
Result of the execution of command: sort test.txt | uniq -ic | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } ' [5] case insensitive
#Apple 2 #電影 1 #追劇 1 #藍芽 1 #綜藝 1
Result of the execution of command: sort test.txt | uniq -c | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } ' case sensitive
#電影 1 #追劇 1 #藍芽 1 #綜藝 1 #apple 1 #Apple 1
Verification of count occurrence[edit]
cat test.txt | grep -i "#apple$" | wc -l # or cat test.txt | grep -iw "#apple" | wc -l
Options[6]
- -i means Ignore uppercase vs. lowercase.
- -w means --word-regexp
Further reading[edit]
- mysql - Count the number of occurences of a string in a VARCHAR field? - Stack Overflow [Last visited: 2016-09-21]
- Excel formula: Most frequently occurring text - Excelchat
References[edit]
- ↑ Excel 計算文字出現次數
- ↑ replacing comma's with newlines using sed
- ↑ text processing - Counting the occurrences of the string - Unix & Linux Stack Exchange
- ↑ Sort and count number of occurrence of lines - Unix & Linux Stack Exchange
- ↑ Swap two columns - awk, sed, python, perl - Stack Overflow
- ↑ Grep - Wikibooks, open books for an open world