Author |
Message |
scsijon
Joined: 23 May 2007 Posts: 1313 Location: the australian mallee
|
Posted: Fri 01 Mar 2013, 05:19 Post subject:
how do I remove duplicate words on a text list?[solved] |
|
I have a list of 24000+ 'words' in a Text file. It's in a vertical list format.
However, about 40+% are duplicates and i'd like to delete them.
The format of the 'words' are basically anything that can be typed in on a keyboard.
I have used 'return-newline' as the separator.
Has anyone a simple script I can run to 'fix the problem'
thanks
Last edited by scsijon on Fri 01 Mar 2013, 20:00; edited 1 time in total
|
Back to top
|
|
 |
SFR

Joined: 26 Oct 2011 Posts: 1655
|
Posted: Fri 01 Mar 2013, 06:57 Post subject:
Re: how do I remove duplicate words on a text list? |
|
Hey Scsijon
To make it clear - the list looks like this:
abc
def
abc
abc
blablabla
zxzx
something
blablabla
...
and should look like this:
abc
def
blablabla
zxzx
something
...
right?
If you don't mind that lines will be also sorted:
But if it's a problem, here's cute awk one-liner I just found on Stack Overflow:
Code: | awk '!_[$0]++' input_file |
Greetings!
_________________ [O]bdurate [R]ules [D]estroy [E]nthusiastic [R]ebels => [C]reative [H]umans [A]lways [O]pen [S]ource
Omnia mea mecum porto.
|
Back to top
|
|
 |
vovchik

Joined: 23 Oct 2006 Posts: 1447 Location: Ukraine
|
Posted: Fri 01 Mar 2013, 08:49 Post subject:
|
|
Dear guys,
This is pretty easy too:
Code: |
cat some.txt | sort | uniq
|
With kind regards,
vovchik
|
Back to top
|
|
 |
amigo
Joined: 02 Apr 2007 Posts: 2641
|
Posted: Fri 01 Mar 2013, 11:52 Post subject:
|
|
I recently had the same problem and wrote something for it. But, I don't find it right now, so I've re-created it:
Code: | #!/bin/bash
# uniq_no-sort
# print out uniq lines, but without sorting them
FILE=$1
OUT="$FILE.uniq"
: > $OUT
while read LINE ; do
if ! [[ $(fgrep -q $LINE $OUT) ]] ; then
echo $LINE >> $OUT
fi
done < $FILE
|
|
Back to top
|
|
 |
tallboy

Joined: 21 Sep 2010 Posts: 907 Location: Oslo, Norway
|
Posted: Fri 01 Mar 2013, 12:28 Post subject:
|
|
Sorry, mistake, could not find a way to delete post.
_________________ True freedom is a live Puppy on a multisession CD/DVD.
|
Back to top
|
|
 |
GustavoYz

Joined: 07 Jul 2010 Posts: 894 Location: .ar
|
Posted: Fri 01 Mar 2013, 14:50 Post subject:
|
|
vovchik wrote: |
Code: |
cat some.txt | sort | uniq
|
With kind regards,
vovchik |
Cat isnt really needed:
Code: | sort file.txt | uniq |
_________________

|
Back to top
|
|
 |
scsijon
Joined: 23 May 2007 Posts: 1313 Location: the australian mallee
|
Posted: Fri 01 Mar 2013, 18:17 Post subject:
|
|
Sorry folks, I wish it was that simple.
I have already sorted the list, that was when I relaized how many duplicates were in it.
consider this:
aaa
aba
ada
ad
aea
aea
aea
agd
agd
ased
ased
ased-ss
ased-ss<p
and on we go.
I want to remove all the duplicates found.
It's what happens when you need to rebuild a crashed component list from backups.
|
Back to top
|
|
 |
Keef

Joined: 20 Dec 2007 Posts: 893 Location: Staffordshire
|
Posted: Fri 01 Mar 2013, 19:15 Post subject:
|
|
Code: |
# cat list.txt
aaa
aba
ada
ad
aea
aea
aea
agd
agd
ased
ased
ased-ss
ased-ss<p
# sort list.txt | uniq
aaa
aba
ad
ada
aea
agd
ased
ased-ss
ased-ss<p
#
|
Seems to work for me....
|
Back to top
|
|
 |
scsijon
Joined: 23 May 2007 Posts: 1313 Location: the australian mallee
|
Posted: Fri 01 Mar 2013, 19:59 Post subject:
|
|
sorry , vovchick, GustavoYz and Keef
yes it does, I must have done a typo the first time I tried it.
thanks all
|
Back to top
|
|
 |
|