how do I remove duplicate words on a text list?[solved]

Message

scsijon · #1 Post by **scsijon** » Fri 01 Mar 2013, 09:19

I have a list of 24000+ 'words' in a Text file. It's in a vertical list format.

However, about 40+% are duplicates and i'd like to delete them.

The format of the 'words' are basically anything that can be typed in on a keyboard.

I have used 'return-newline' as the separator.

Has anyone a simple script I can run to 'fix the problem'

thanks

SFR · #2 Post by **SFR** » Fri 01 Mar 2013, 10:57

Hey Scsijon

To make it clear - the list looks like this:
abc
def
abc
abc
blablabla
zxzx
something
blablabla
...

and should look like this:
abc
def
blablabla
zxzx
something
...

right?

If you don't mind that lines will be also sorted:

Code: Select all

sort -u input_file

But if it's a problem, here's cute awk one-liner I just found on Stack Overflow:

Code: Select all

awk '!_[$0]++' input_file

Greetings!

vovchik · #3 Post by **vovchik** » Fri 01 Mar 2013, 12:49

Dear guys,

This is pretty easy too:

Code: Select all

cat some.txt | sort | uniq

With kind regards,
vovchik

amigo · #4 Post by **amigo** » Fri 01 Mar 2013, 15:52

I recently had the same problem and wrote something for it. But, I don't find it right now, so I've re-created it:

Code: Select all

#!/bin/bash
# uniq_no-sort
# print out uniq lines, but without sorting them

FILE=$1

OUT="$FILE.uniq"
: > $OUT

while read LINE ; do
	if ! [[ $(fgrep -q $LINE $OUT) ]] ; then
		echo $LINE >> $OUT
	fi
done < $FILE

tallboy · #5 Post by **tallboy** » Fri 01 Mar 2013, 16:28

Sorry, mistake, could not find a way to delete post.

GustavoYz · #6 Post by **GustavoYz** » Fri 01 Mar 2013, 18:50

vovchik wrote:
Code: Select all
cat some.txt | sort | uniq
With kind regards,
vovchik

Cat isnt really needed:

Code: Select all

sort file.txt | uniq

scsijon · #7 Post by **scsijon** » Fri 01 Mar 2013, 22:17

Sorry folks, I wish it was that simple.

I have already sorted the list, that was when I relaized how many duplicates were in it.

consider this:

aaa
aba
ada
ad
aea
aea
aea
agd
agd
ased
ased
ased-ss
ased-ss<p

and on we go.

I want to remove all the duplicates found.

It's what happens when you need to rebuild a crashed component list from backups.

Keef · #8 Post by **Keef** » Fri 01 Mar 2013, 23:15

Code: Select all

# cat list.txt 
aaa 
aba 
ada 
ad 
aea 
aea 
aea 
agd 
agd 
ased 
ased 
ased-ss 
ased-ss<p
# sort list.txt | uniq
aaa 
aba 
ad 
ada 
aea 
agd 
ased 
ased-ss 
ased-ss<p
#

Seems to work for me....

scsijon · #9 Post by **scsijon** » Fri 01 Mar 2013, 23:59

sorry , vovchick, GustavoYz and Keef

yes it does, I must have done a typo the first time I tried it.

thanks all

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

how do I remove duplicate words on a text list?[solved]

how do I remove duplicate words on a text list?[solved]

Re: how do I remove duplicate words on a text list?