Count occurrences of a word in string: Difference between revisions
(→BASH) |
mNo edit summary |
||
Line 164: | Line 164: | ||
<references /> | <references /> | ||
[[Category:Software]] [[Category:Programming]] [[Category:Data Science]] [[Category:String manipulation]] [[Category:Data transformation]] [[Category:MySQL]] | [[Category: Software]] | ||
[[Category: Programming]] | |||
[[Category: Data Science]] | |||
[[Category: String manipulation]] | |||
[[Category: Data transformation]] | |||
[[Category: MySQL]] | |||
[[Category: Revised with LLMs]] |
Latest revision as of 19:26, 5 September 2025
Counting number of occurrences (or frequency) of a word in string
Excel[edit]
If multiple values was allowed in a single cell
- Using the function SUBSTITUTE & LEN functions. demo[1]. Or
If multiple values was NOT allowed in a single cell
- Using the function COUNTIF
MySQL way[edit]
SET @paragraph := 'an apple a day keeps the doctor away'; SET @term := 'apple'; SELECT FLOOR((LENGTH(@paragraph) - LENGTH(REPLACE(@paragraph, @term, ''))) / LENGTH(@term)) AS occurrences; /* same with the following query */ SELECT FLOOR((CHAR_LENGTH(@paragraph) - CHAR_LENGTH(REPLACE(@paragraph, @term, ''))) / CHAR_LENGTH(@term)) AS occurrences;
-- Count occurrences of a string: . SET @input = "www.google.com"; SET @separator = "."; SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator; -- expected result: 2 -- Count occurrences of a string: og SET @input = "www.google.com"; SET @separator = "og"; SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator; -- expected result: 1
PHP[edit]
Using the mb_substr_count (binary safe) or substr_count functions. See details on demo.
<?php $input = 'an apple a day keeps the doctor away'; $term = 'apple'; echo substr_count($input, $term) . PHP_EOL; $input = '一天一蘋果,醫生遠離我'; $term = '蘋果'; echo mb_substr_count($input, $term, 'UTF-8') . PHP_EOL;
BASH[edit]
data preparation
- (1) separate each string by return_symbol [2]
- (2) check the uniq command is exists on Cygwin of Win
or Linux
- (3) (optional) input the command export LC_ALL='C' on Cygwin of Win
if met the error message "Invalid or incomplete multibyte or wide character" after input the following sort command
- (4) execute the following command sort <file.txt> | uniq -ic | sort -nr[3][4]
- (5) Remove the leading whitespace in the file: Using the text editor with support for regular expression and replace ^\s+(\d+)\s+ with \1\t
Input Format A: One term per line[edit]
Each line contains only one term/keyword
file: test.txt
#apple #追劇 #電影 #綜藝 #Apple #藍芽
Output format I: count followed by keyword[edit]
The term each line in the input file was allowed contains whitespaces.
Result of the execution of command: sort test.txt | uniq -ic | sort -nr case insensitive
2 #Apple 1 #電影 1 #追劇 1 #藍芽 1 #綜藝
Result of the execution of command: sort test.txt | uniq -c case sensitive
1 #Apple 1 #apple 1 #綜藝 1 #藍芽 1 #追劇 1 #電影
Output format II: keyword followed by count[edit]
The term each line in the input file should not contains whitespaces.
Result of the execution of command: sort test.txt | uniq -ic | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } ' [5] case insensitive
#Apple 2 #電影 1 #追劇 1 #藍芽 1 #綜藝 1
Result of the execution of command: sort test.txt | uniq -c | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } ' case sensitive
#電影 1 #追劇 1 #藍芽 1 #綜藝 1 #apple 1 #Apple 1
Input Format B: Multiple terms per line[edit]
Each line contains multiple terms/keywords separated by spaces
file: input.txt
電影 追劇 綜藝 藍芽 apple 電影 電影 綜藝
Method using awk for word frequency counting[edit]
awk '{for(i=1;i<=NF;i++) count[$i]++} END {for(word in count) print count[word], word}' input.txt | sort -nr
Output:
3 電影 2 綜藝 1 追劇 1 藍芽 1 apple
How it works:
- {for(i=1;i<=NF;i++) count[$i]++} - Loop through each field (word) in each line and increment its count
- END {for(word in count) print count[word], word} - After processing all lines, print count and word for each unique word
- sort -nr - Sort numerically in descending order
Verification of count occurrence[edit]
cat test.txt | grep -i "#apple$" | wc -l # or cat test.txt | grep -iw "#apple" | wc -l
Options[6]
- -i means Ignore uppercase vs. lowercase.
- -w means --word-regexp
Further reading[edit]
- mysql - Count the number of occurences of a string in a VARCHAR field? - Stack Overflow [Last visited: 2016-09-21]
- Excel formula: Most frequently occurring text - Excelchat
References[edit]
- ↑ Excel 計算文字出現次數
- ↑ replacing comma's with newlines using sed
- ↑ text processing - Counting the occurrences of the string - Unix & Linux Stack Exchange
- ↑ Sort and count number of occurrence of lines - Unix & Linux Stack Exchange
- ↑ Swap two columns - awk, sed, python, perl - Stack Overflow
- ↑ Grep - Wikibooks, open books for an open world