Count occurrences of a word in string: Difference between revisions

Revision as of 19:25, 5 September 2025

Counting number of occurrences (or frequency) of a word in string

Excel

If multiple values was allowed in a single cell

Using the function SUBSTITUTE & LEN functions. demo^[1]. Or

If multiple values was NOT allowed in a single cell

Using the function COUNTIF

MySQL way

SET @paragraph := 'an apple a day keeps the doctor away';
SET @term := 'apple';

SELECT FLOOR((LENGTH(@paragraph) - LENGTH(REPLACE(@paragraph, @term, ''))) / LENGTH(@term)) AS occurrences;

/* same with the following query */
SELECT FLOOR((CHAR_LENGTH(@paragraph) - CHAR_LENGTH(REPLACE(@paragraph, @term, ''))) / CHAR_LENGTH(@term)) AS occurrences;

online example

-- Count occurrences of a string: .
SET @input = "www.google.com";
SET @separator = ".";
SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator;
-- expected result: 2

-- Count occurrences of a string: og
SET @input = "www.google.com";
SET @separator = "og";
SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator;
-- expected result: 1

PHP

Using the mb_substr_count (binary safe) or substr_count functions. See details on demo.

<?php

$input = 'an apple a day keeps the doctor away';
$term = 'apple';

echo substr_count($input, $term) . PHP_EOL;

$input = '一天一蘋果，醫生遠離我';
$term = '蘋果';

echo mb_substr_count($input, $term, 'UTF-8') . PHP_EOL;

BASH

data preparation

(1) separate each string by return_symbol ^[2]
(2) check the uniq command is exists on Cygwin of Win or Linux
(3) (optional) input the command export LC_ALL='C' on Cygwin of Win if met the error message "Invalid or incomplete multibyte or wide character" after input the following sort command
(4) execute the following command sort <file.txt> | uniq -ic | sort -nr^[3]^[4]
(5) Remove the leading whitespace in the file: Using the text editor with support for regular expression and replace ^\s+(\d+)\s+ with \1\t

Input Format A: One term per line

Each line contains only one term/keyword

file: test.txt

#apple
#追劇
#電影
#綜藝
#Apple
#藍芽

Output format I: count followed by keyword

The term each line in the input file was allowed contains whitespaces.

Result of the execution of command: sort test.txt | uniq -ic | sort -nr case insensitive

   2 #Apple
   1 #電影
   1 #追劇
   1 #藍芽
   1 #綜藝

Result of the execution of command: sort test.txt | uniq -c case sensitive

   1 #Apple
   1 #apple
   1 #綜藝
   1 #藍芽
   1 #追劇
   1 #電影

Output format II: keyword followed by count

The term each line in the input file should not contains whitespaces.

Result of the execution of command: sort test.txt | uniq -ic | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } ' ^[5] case insensitive

#Apple 2
#電影 1
#追劇 1
#藍芽 1
#綜藝 1

Result of the execution of command: sort test.txt | uniq -c | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } ' case sensitive

#電影 1
#追劇 1
#藍芽 1
#綜藝 1
#apple 1
#Apple 1

Input Format B: Multiple terms per line

Each line contains multiple terms/keywords separated by spaces

file: input.txt

電影 追劇 綜藝
藍芽 apple 電影
電影 綜藝

Method using awk for word frequency counting

awk '{for(i=1;i<=NF;i++) count[$i]++} END {for(word in count) print count[word], word}' input.txt | sort -nr

Output:

3 電影
2 綜藝
1 追劇
1 藍芽
1 apple

How it works:

{for(i=1;i<=NF;i++) count[$i]++} - Loop through each field (word) in each line and increment its count
END {for(word in count) print count[word], word} - After processing all lines, print count and word for each unique word
sort -nr - Sort numerically in descending order

Verification of count occurrence

cat test.txt | grep -i "#apple$" | wc -l

# or
cat test.txt | grep -iw "#apple" | wc -l

Options^[6]

-i means Ignore uppercase vs. lowercase.
-w means --word-regexp

References

[1] Excel 計算文字出現次數

[2] replacing comma's with newlines using sed

[3] text processing - Counting the occurrences of the string - Unix & Linux Stack Exchange

[4] Sort and count number of occurrence of lines - Unix & Linux Stack Exchange

[5] Swap two columns - awk, sed, python, perl - Stack Overflow

[6] Grep - Wikibooks, open books for an open world

[1]

[2]

[3]

[4]

[5]

[6]

@@ Line 59: / Line 59: @@
 * (4) execute the following command {{kbd | key=<nowiki>sort <file.txt> | uniq -ic | sort -nr</nowiki>}}<ref>[https://unix.stackexchange.com/questions/134446/counting-the-occurrences-of-the-string text processing - Counting the occurrences of the string - Unix & Linux Stack Exchange]</ref><ref>[https://unix.stackexchange.com/questions/170043/sort-and-count-number-of-occurrence-of-lines Sort and count number of occurrence of lines - Unix & Linux Stack Exchange]</ref>
 * (5) Remove the leading whitespace in the file: Using the [[Text editor with support for regular expression | text editor]] with support for [[Regular expression|regular expression]] and replace {{kbd | key=<nowiki>^\s+(\d+)\s+</nowiki>}} with {{kbd | key=<nowiki>\1\t</nowiki>}}
+=== Input Format A: One term per line ===
+{{exclaim}} Each line contains only one term/keyword
 file: test.txt
@@ Line 70: / Line 73: @@
 </pre>
-=== Output format I: occurrence & keyword ===
+==== Output format I: count followed by keyword ====
 {{exclaim}} The term each line in the input file was allowed contains whitespaces.
@@ Line 92: / Line 95: @@
 </pre>
+==== Output format II: keyword followed by count ====
-=== Output format II: keyword & occurrence ===
 {{exclaim}} The term each line in the input file should '''not''' contains whitespaces.
@@ Line 115: / Line 117: @@
 </pre>
+=== Input Format B: Multiple terms per line ===
+{{exclaim}} Each line contains multiple terms/keywords separated by spaces
+file: input.txt
+<pre>
+電影 追劇 綜藝
+藍芽 apple 電影
+電影 綜藝
+</pre>
+==== Method using awk for word frequency counting ====
+{{kbd | key=<nowiki>awk '{for(i=1;i<=NF;i++) count[$i]++} END {for(word in count) print count[word], word}' input.txt | sort -nr</nowiki>}}
+Output:
+<pre>
+電影
+綜藝
+追劇
+藍芽
+apple
+</pre>
+How it works:
+* {{kbd | key=<nowiki>{for(i=1;i<=NF;i++) count[$i]++}</nowiki>}} - Loop through each field (word) in each line and increment its count
+* {{kbd | key=<nowiki>END {for(word in count) print count[word], word}</nowiki>}} - After processing all lines, print count and word for each unique word
+* {{kbd | key=<nowiki>sort -nr</nowiki>}} - Sort numerically in descending order
 === Verification of count occurrence ===

Count occurrences of a word in string: Difference between revisions

Revision as of 19:25, 5 September 2025

Contents

Excel

MySQL way

PHP

BASH

Input Format A: One term per line

Output format I: count followed by keyword

Output format II: keyword followed by count

Input Format B: Multiple terms per line

Method using awk for word frequency counting

Verification of count occurrence

Further reading

References

Navigation menu

Count occurrences of a word in string: Difference between revisions

Revision as of 19:25, 5 September 2025

Excel

MySQL way

PHP

BASH

Input Format A: One term per line

Output format I: count followed by keyword

Output format II: keyword followed by count

Input Format B: Multiple terms per line

Method using awk for word frequency counting

Verification of count occurrence

Further reading

References

Navigation menu

Search