Page 1 of 1

how do I remove duplicate words on a text list?[solved]

Posted: Fri 01 Mar 2013, 09:19
by scsijon
I have a list of 24000+ 'words' in a Text file. It's in a vertical list format.

However, about 40+% are duplicates and i'd like to delete them.

The format of the 'words' are basically anything that can be typed in on a keyboard.

I have used 'return-newline' as the separator.

Has anyone a simple script I can run to 'fix the problem'

thanks

Re: how do I remove duplicate words on a text list?

Posted: Fri 01 Mar 2013, 10:57
by SFR
Hey Scsijon

To make it clear - the list looks like this:
abc
def
abc
abc
blablabla
zxzx
something
blablabla
...


and should look like this:
abc
def
blablabla
zxzx
something
...


right?

If you don't mind that lines will be also sorted:

Code: Select all

sort -u input_file
But if it's a problem, here's cute awk one-liner I just found on Stack Overflow:

Code: Select all

awk '!_[$0]++' input_file
Greetings!

Posted: Fri 01 Mar 2013, 12:49
by vovchik
Dear guys,

This is pretty easy too:

Code: Select all

cat some.txt | sort | uniq
With kind regards,
vovchik

Posted: Fri 01 Mar 2013, 15:52
by amigo
I recently had the same problem and wrote something for it. But, I don't find it right now, so I've re-created it:

Code: Select all

#!/bin/bash
# uniq_no-sort
# print out uniq lines, but without sorting them

FILE=$1

OUT="$FILE.uniq"
: > $OUT

while read LINE ; do
	if ! [[ $(fgrep -q $LINE $OUT) ]] ; then
		echo $LINE >> $OUT
	fi
done < $FILE

Posted: Fri 01 Mar 2013, 16:28
by tallboy
Sorry, mistake, could not find a way to delete post.

Posted: Fri 01 Mar 2013, 18:50
by GustavoYz
vovchik wrote:

Code: Select all

cat some.txt | sort | uniq
With kind regards,
vovchik
Cat isnt really needed:

Code: Select all

sort file.txt | uniq

Posted: Fri 01 Mar 2013, 22:17
by scsijon
Sorry folks, I wish it was that simple.

I have already sorted the list, that was when I relaized how many duplicates were in it.


consider this:


aaa
aba
ada
ad
aea
aea
aea
agd
agd
ased
ased
ased-ss
ased-ss<p

and on we go.

I want to remove all the duplicates found.

It's what happens when you need to rebuild a crashed component list from backups.

Posted: Fri 01 Mar 2013, 23:15
by Keef

Code: Select all

# cat list.txt 
aaa 
aba 
ada 
ad 
aea 
aea 
aea 
agd 
agd 
ased 
ased 
ased-ss 
ased-ss<p
# sort list.txt | uniq
aaa 
aba 
ad 
ada 
aea 
agd 
ased 
ased-ss 
ased-ss<p
# 
Seems to work for me....

Posted: Fri 01 Mar 2013, 23:59
by scsijon
sorry , vovchick, GustavoYz and Keef

yes it does, I must have done a typo the first time I tried it.

thanks all