Editing Count occurrences of a word in string

Counting number of occurrences (or frequency) of a word in string

== Excel ==

If multiple values was allowed in a single cell
* Using the function [https://support.office.com/zh-tw/article/substitute-%E5%87%BD%E6%95%B8-6434944e-a904-4336-a9b0-1e58df3bc332 SUBSTITUTE] & [https://support.office.com/zh-hk/article/len%E3%80%81lenb-%E5%87%BD%E6%95%B8-29236f94-cedc-429d-affd-b5e33d2c67cb LEN] functions. [https://docs.google.com/spreadsheets/d/1eg5XuGs8X7bS_RrFECx8ktuPgTwpV12UPfkdlpbHlMU/edit?usp=sharing demo]<ref>[https://errerrors.blogspot.com/2022/02/excel-calculate-text-occurrence.html Excel 計算文字出現次數]</ref>. Or

If multiple values was '''NOT''' allowed in a single cell
* Using the function [https://support.office.com/zh-hk/article/countif-%E5%87%BD%E6%95%B8-e0de10c6-f885-4e71-abb4-1f464816df34 COUNTIF]

== MySQL way ==
<pre>
SET @paragraph := 'an apple a day keeps the doctor away';
SET @term := 'apple';

SELECT FLOOR((LENGTH(@paragraph) - LENGTH(REPLACE(@paragraph, @term, ''))) / LENGTH(@term)) AS occurrences;

/* same with the following query */
SELECT FLOOR((CHAR_LENGTH(@paragraph) - CHAR_LENGTH(REPLACE(@paragraph, @term, ''))) / CHAR_LENGTH(@term)) AS occurrences;
</pre>
[http://sqlfiddle.com/#!9/6d06f/480/2 online example]

<pre>
-- Count occurrences of a string: .
SET @input = "www.google.com";
SET @separator = ".";
SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator;
-- expected result: 2

-- Count occurrences of a string: og
SET @input = "www.google.com";
SET @separator = "og";
SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator;
-- expected result: 1
</pre>


== PHP ==
Using the [https://www.php.net/manual/en/function.mb-substr-count.php mb_substr_count] ('''binary safe''') or [https://www.php.net/manual/en/function.substr-count.php substr_count] functions. See details on [http://sandbox.onlinephpfunctions.com/code/0b72da5f00701341c77623889d177750242cbb7a demo].
<pre>
<?php

$input = 'an apple a day keeps the doctor away';
$term = 'apple';

echo substr_count($input, $term) . PHP_EOL;

$input = '一天一蘋果，醫生遠離我';
$term = '蘋果';

echo mb_substr_count($input, $term, 'UTF-8') . PHP_EOL;
</pre>

== BASH ==
data preparation
* (1) separate each string by [[Return symbol | return_symbol]] <ref>[https://www.unix.com/shell-programming-and-scripting/83076-replacing-commas-newlines-using-sed.html replacing comma's with newlines using sed]</ref>
* (2) check the [https://www.computerhope.com/unix/uuniq.htm uniq command] is exists on Cygwin of {{Win}} or {{Linux}}
* (3) (optional) input the command {{kbd | key=<nowiki>export LC_ALL='C'</nowiki>}} on Cygwin of {{Win}} if met the error message "Invalid or incomplete multibyte or wide character" after input the following {{kbd | key=<nowiki>sort</nowiki>}} command
* (4) execute the following command {{kbd | key=<nowiki>sort <file.txt> | uniq -ic | sort -nr</nowiki>}}<ref>[https://unix.stackexchange.com/questions/134446/counting-the-occurrences-of-the-string text processing - Counting the occurrences of the string - Unix & Linux Stack Exchange]</ref><ref>[https://unix.stackexchange.com/questions/170043/sort-and-count-number-of-occurrence-of-lines Sort and count number of occurrence of lines - Unix & Linux Stack Exchange]</ref>
* (5) Remove the leading whitespace in the file: Using the [[Text editor with support for regular expression | text editor]] with support for [[Regular expression|regular expression]] and replace {{kbd | key=<nowiki>^\s+(\d+)\s+</nowiki>}} with {{kbd | key=<nowiki>\1\t</nowiki>}}

=== Input Format A: One term per line ===
{{exclaim}} Each line contains only one term/keyword

file: test.txt
<pre>
#apple
#追劇
#電影
#綜藝
#Apple
#藍芽
</pre>

==== Output format I: count followed by keyword ====
{{exclaim}} The term each line in the input file was allowed contains whitespaces.

Result of the execution of command: {{kbd | key=<nowiki>sort test.txt | uniq -ic | sort -nr</nowiki>}} {{exclaim}} case insensitive
<pre>
   2 #Apple
   1 #電影
   1 #追劇
   1 #藍芽
   1 #綜藝
</pre>

Result of the execution of command: {{kbd | key=<nowiki>sort test.txt | uniq -c</nowiki>}} {{exclaim}} case sensitive
<pre>
   1 #Apple
   1 #apple
   1 #綜藝
   1 #藍芽
   1 #追劇
   1 #電影
</pre>

==== Output format II: keyword followed by count ====
{{exclaim}} The term each line in the input file should '''not''' contains whitespaces.

Result of the execution of command: {{kbd | key=<nowiki>sort test.txt | uniq -ic | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } '</nowiki>}} <ref>[https://stackoverflow.com/questions/11967776/swap-two-columns-awk-sed-python-perl Swap two columns - awk, sed, python, perl - Stack Overflow]</ref>{{exclaim}} case insensitive
<pre>
#Apple 2
#電影 1
#追劇 1
#藍芽 1
#綜藝 1
</pre>

Result of the execution of command: {{kbd | key=<nowiki>sort test.txt | uniq -c | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } '</nowiki>}} {{exclaim}} case sensitive
<pre>
#電影 1
#追劇 1
#藍芽 1
#綜藝 1
#apple 1
#Apple 1
</pre>

=== Input Format B: Multiple terms per line ===
{{exclaim}} Each line contains multiple terms/keywords separated by spaces

file: input.txt
<pre>
電影 追劇 綜藝
藍芽 apple 電影
電影 綜藝
</pre>

==== Method using awk for word frequency counting ====

{{kbd | key=<nowiki>awk '{for(i=1;i<=NF;i++) count[$i]++} END {for(word in count) print count[word], word}' input.txt | sort -nr</nowiki>}}

Output:
<pre>
3 電影
2 綜藝
1 追劇
1 藍芽
1 apple
</pre>

How it works:
* {{kbd | key=<nowiki>{for(i=1;i<=NF;i++) count[$i]++}</nowiki>}} - Loop through each field (word) in each line and increment its count
* {{kbd | key=<nowiki>END {for(word in count) print count[word], word}</nowiki>}} - After processing all lines, print count and word for each unique word
* {{kbd | key=<nowiki>sort -nr</nowiki>}} - Sort numerically in descending order

=== Verification of count occurrence ===
<pre>
cat test.txt | grep -i "#apple$" | wc -l

# or
cat test.txt | grep -iw "#apple" | wc -l
</pre>

Options<ref>[https://en.wikibooks.org/wiki/Grep Grep - Wikibooks, open books for an open world]</ref>
* {{kbd | key=<nowiki>-i</nowiki>}} means {{kbd | key=<nowiki>Ignore uppercase vs. lowercase.</nowiki>}}
* {{kbd | key=<nowiki>-w</nowiki>}} means {{kbd | key=<nowiki>--word-regexp</nowiki>}}

== Further reading ==
* [http://stackoverflow.com/questions/12344795/count-the-number-of-occurences-of-a-string-in-a-varchar-field mysql - Count the number of occurences of a string in a VARCHAR field? - Stack Overflow] {{access | date = 2016-09-21}}
* [https://www.got-it.ai/solutions/excel-chat/excel-tutorial/conditional-formatting/most-frequently-occurring-text Excel formula: Most frequently occurring text - Excelchat]

== References ==
<references />

[[Category: Software]]
[[Category: Programming]]
[[Category: Data Science]] 
[[Category: String manipulation]] 
[[Category: Data transformation]]
[[Category: MySQL]]
[[Category: Revised with LLMs]]