Find and remove duplicates

From LemonWiki共筆
Jump to navigation Jump to search

Find and remove duplicates in Excel[edit]

Find duplicates in Excel[edit]

Finding duplicate rows that differ in one column

Finding duplicate rows that differ in multiple columns

Counging

Remove duplicates in Excel[edit]


Find and remove duplicates in Cygwin/BASH[edit]

Finding duplicate values in Cygwin/BASH[edit]

  • uniq command on Cygwin of Win Os windows.png or Linux Os linux.png : uniq -d <file.txt> > <duplicated_items.txt>[1]


Remove duplicate values in Cygwin/BASH[edit]

GNU Coreutils: sort invocation OS: Linux Os linux.png , cygwin of Win Os windows.png .

Data preparation:

case 1: To remove duplicate lines of entire paragraph

  • sort -us -o <output_unique.file> <input.file> in a large text file (GB)[2]
  • cat <input.file> | grep <pattern> | sort | uniq Processes text line by line and prints the unique lines which match a specified pattern. Equal to these steps: (1) cat <input.file> | grep <pattern> > <tmp.file> (2) sort <tmp.file> | uniq

Case 2: Ignore first n line(s) & remove duplicate lines[3][4][5]

  • (1) ignore first one line: (head -n 1 <file> && tail -n +2 <file> | sort -us) > newfile
  • (2) ignore first two lines: (head -n 2 <file> && tail -n +3 <file> | sort -us) > newfile

Find and remove duplicates in MySQL[edit]

Find duplicates in MySQL[edit]

Finding duplicate value that differ in one column[6]

-- Generate test data.
CREATE TABLE `table_name` (
  `id` int(11) NOT NULL,
  `content` varchar(5) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `table_name` (`id`, `content`) VALUES
(1, 'apple'),
(2, 'lemon'),
(3, 'apple');

ALTER TABLE `table_name`
  ADD PRIMARY KEY (`id`);

-- Find duplicated data
SELECT `content`, COUNT(*) count 
FROM `table_name` 
GROUP BY `content` 
HAVING count > 1;

SELECT tmp.* FROM 
( 
  SELECT `content`, count(*) count FROM `table_name` GROUP BY `content` 
) tmp 
WHERE tmp.count >1;

Finding duplicate rows that differ in multiple columns: Using CONCAT for multiple columns ex: column_1, column_2

SELECT count(*) count, CONCAT(  `column_1`, `column_2`  ) 'key'
	FROM `table_name`
	GROUP BY CONCAT(  `column_1`, `column_2`  )
HAVING count > 1;

or

SELECT tmp.key FROM
(
	SELECT count(*) count, CONCAT(  `column_1`, `column_2`  ) 'key'
	FROM `table_name`
	GROUP BY CONCAT(  `column_1`, `column_2`  )
) tmp
WHERE tmp.count >=2

Select deduplicated records


For counting purpose: find the count of repeated id (type: int) between table_a and table_b

SELECT count(DISTINCT(id)) FROM table_a WHERE id IN
(
   SELECT DISTINCT(id) FROM table_b
)

Counting number of duplicate occurrence MySQL: find the number of duplicate occurrence between list_a & list_b which using the same primary key: column name id

  • SELECT count(DISTINCT(`id`)) FROM `list_a` WHERE `id` IN (SELECT DISTINCT(`id`) FROM `list_b`) ;


Remove duplicates in MySQL[edit]


Find and remove duplicates in Google Spreadsheet[edit]

Find and remove duplicates in PHP[edit]

Find duplicates in PHP[edit]

Remove duplicates in PHP[edit]

Find and remove duplicates in JavaScript[edit]

Remove duplicates in JavaScript[edit]


Other issues[edit]

  • symbol e.g. data-mining or data_mining

References[edit]