Count occurrences of a word in string: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
Line 3: Line 3:
== Excel ==
== Excel ==


If the cell value was allowed multiple values
If multiple values was allowed in a single cell
* Using the function [https://support.office.com/zh-tw/article/substitute-%E5%87%BD%E6%95%B8-6434944e-a904-4336-a9b0-1e58df3bc332 SUBSTITUTE] & [https://support.office.com/zh-hk/article/len%E3%80%81lenb-%E5%87%BD%E6%95%B8-29236f94-cedc-429d-affd-b5e33d2c67cb LEN] functions. [https://docs.google.com/spreadsheets/d/1eg5XuGs8X7bS_RrFECx8ktuPgTwpV12UPfkdlpbHlMU/edit?usp=sharing demo]<ref>[https://errerrors.blogspot.com/2022/02/excel-calculate-text-occurrence.html Excel 計算文字出現次數]</ref>. Or
* Using the function [https://support.office.com/zh-tw/article/substitute-%E5%87%BD%E6%95%B8-6434944e-a904-4336-a9b0-1e58df3bc332 SUBSTITUTE] & [https://support.office.com/zh-hk/article/len%E3%80%81lenb-%E5%87%BD%E6%95%B8-29236f94-cedc-429d-affd-b5e33d2c67cb LEN] functions. [https://docs.google.com/spreadsheets/d/1eg5XuGs8X7bS_RrFECx8ktuPgTwpV12UPfkdlpbHlMU/edit?usp=sharing demo]<ref>[https://errerrors.blogspot.com/2022/02/excel-calculate-text-occurrence.html Excel 計算文字出現次數]</ref>. Or


If the cell value was NOT allowed multiple values
If multiple values was '''NOT''' allowed in a single cell
* Using the function [https://support.office.com/zh-hk/article/countif-%E5%87%BD%E6%95%B8-e0de10c6-f885-4e71-abb4-1f464816df34 COUNTIF]
* Using the function [https://support.office.com/zh-hk/article/countif-%E5%87%BD%E6%95%B8-e0de10c6-f885-4e71-abb4-1f464816df34 COUNTIF]



Revision as of 17:58, 4 February 2022

Counting number of occurrences (or frequency) of a word in string

Excel

If multiple values was allowed in a single cell

If multiple values was NOT allowed in a single cell

MySQL way

SET @paragraph := 'an apple a day keeps the doctor away';
SET @term := 'apple';

SELECT FLOOR((LENGTH(@paragraph) - LENGTH(REPLACE(@paragraph, @term, ''))) / LENGTH(@term)) AS occurrences;

/* same with the following query */
SELECT FLOOR((CHAR_LENGTH(@paragraph) - CHAR_LENGTH(REPLACE(@paragraph, @term, ''))) / CHAR_LENGTH(@term)) AS occurrences;

online example

-- Count occurrences of a string: .
SET @input = "www.google.com";
SET @separator = ".";
SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator;
-- expected result: 2

-- Count occurrences of a string: og
SET @input = "www.google.com";
SET @separator = "og";
SELECT (LENGTH(@input ) - LENGTH(REPLACE(@input , @separator, ""))) / LENGTH(@separator) AS count_of_separator;
-- expected result: 1


PHP

Using the mb_substr_count (binary safe) or substr_count functions. See details on demo.

<?php

$input = 'an apple a day keeps the doctor away';
$term = 'apple';

echo substr_count($input, $term) . PHP_EOL;

$input = '一天一蘋果,醫生遠離我';
$term = '蘋果';

echo mb_substr_count($input, $term, 'UTF-8') . PHP_EOL;

BASH

data preparation

  • (1) separate each string by return_symbol [2]
  • (2) check the uniq command is exists on Cygwin of Win Os windows.png or Linux Os linux.png
  • (3) (optional) input the command export LC_ALL='C' on Cygwin of Win Os windows.png if met the error message "Invalid or incomplete multibyte or wide character" after input the following sort command
  • (4) execute the following command sort <file.txt> | uniq -ic | sort -nr[3][4]
  • (5) Remove the leading whitespace in the file: Using the text editor with support for regular expression and replace ^\s+(\d+)\s+ with \1\t

file: test.txt

#apple
#追劇
#電影
#綜藝
#Apple
#藍芽

Output format I: occurrence & keyword

Icon_exclaim.gif The term each line in the input file was allowed contains whitespaces.

Result of the execution of command: sort test.txt | uniq -ic | sort -nr Icon_exclaim.gif case insensitive

   2 #Apple
   1 #電影
   1 #追劇
   1 #藍芽
   1 #綜藝

Result of the execution of command: sort test.txt | uniq -c Icon_exclaim.gif case sensitive

   1 #Apple
   1 #apple
   1 #綜藝
   1 #藍芽
   1 #追劇
   1 #電影


Output format II: keyword & occurrence

Icon_exclaim.gif The term each line in the input file should not contains whitespaces.

Result of the execution of command: sort test.txt | uniq -ic | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } ' [5]Icon_exclaim.gif case insensitive

#Apple 2
#電影 1
#追劇 1
#藍芽 1
#綜藝 1

Result of the execution of command: sort test.txt | uniq -c | sort -nr | awk ' { t = $1; $1 = $2; $2 = t; print; } ' Icon_exclaim.gif case sensitive

#電影 1
#追劇 1
#藍芽 1
#綜藝 1
#apple 1
#Apple 1


Verification of count occurrence

cat test.txt | grep -i "#apple$" | wc -l

# or
cat test.txt | grep -iw "#apple" | wc -l

Options[6]

  • -i means Ignore uppercase vs. lowercase.
  • -w means --word-regexp

References