how do I remove duplicate words on a text list?[solved]
how do I remove duplicate words on a text list?[solved]
I have a list of 24000+ 'words' in a Text file. It's in a vertical list format.
However, about 40+% are duplicates and i'd like to delete them.
The format of the 'words' are basically anything that can be typed in on a keyboard.
I have used 'return-newline' as the separator.
Has anyone a simple script I can run to 'fix the problem'
thanks
However, about 40+% are duplicates and i'd like to delete them.
The format of the 'words' are basically anything that can be typed in on a keyboard.
I have used 'return-newline' as the separator.
Has anyone a simple script I can run to 'fix the problem'
thanks
Last edited by scsijon on Sat 02 Mar 2013, 00:00, edited 1 time in total.
Re: how do I remove duplicate words on a text list?
Hey Scsijon
To make it clear - the list looks like this:
abc
def
abc
abc
blablabla
zxzx
something
blablabla
...
and should look like this:
abc
def
blablabla
zxzx
something
...
right?
If you don't mind that lines will be also sorted:
But if it's a problem, here's cute awk one-liner I just found on Stack Overflow:
Greetings!
To make it clear - the list looks like this:
abc
def
abc
abc
blablabla
zxzx
something
blablabla
...
and should look like this:
abc
def
blablabla
zxzx
something
...
right?
If you don't mind that lines will be also sorted:
Code: Select all
sort -u input_file
Code: Select all
awk '!_[$0]++' input_file
[color=red][size=75][O]bdurate [R]ules [D]estroy [E]nthusiastic [R]ebels => [C]reative [H]umans [A]lways [O]pen [S]ource[/size][/color]
[b][color=green]Omnia mea mecum porto.[/color][/b]
[b][color=green]Omnia mea mecum porto.[/color][/b]
Dear guys,
This is pretty easy too:
With kind regards,
vovchik
This is pretty easy too:
Code: Select all
cat some.txt | sort | uniq
vovchik
I recently had the same problem and wrote something for it. But, I don't find it right now, so I've re-created it:
Code: Select all
#!/bin/bash
# uniq_no-sort
# print out uniq lines, but without sorting them
FILE=$1
OUT="$FILE.uniq"
: > $OUT
while read LINE ; do
if ! [[ $(fgrep -q $LINE $OUT) ]] ; then
echo $LINE >> $OUT
fi
done < $FILE
Cat isnt really needed:vovchik wrote:With kind regards,Code: Select all
cat some.txt | sort | uniq
vovchik
Code: Select all
sort file.txt | uniq
Sorry folks, I wish it was that simple.
I have already sorted the list, that was when I relaized how many duplicates were in it.
consider this:
aaa
aba
ada
ad
aea
aea
aea
agd
agd
ased
ased
ased-ss
ased-ss<p
and on we go.
I want to remove all the duplicates found.
It's what happens when you need to rebuild a crashed component list from backups.
I have already sorted the list, that was when I relaized how many duplicates were in it.
consider this:
aaa
aba
ada
ad
aea
aea
aea
agd
agd
ased
ased
ased-ss
ased-ss<p
and on we go.
I want to remove all the duplicates found.
It's what happens when you need to rebuild a crashed component list from backups.
Code: Select all
# cat list.txt
aaa
aba
ada
ad
aea
aea
aea
agd
agd
ased
ased
ased-ss
ased-ss<p
# sort list.txt | uniq
aaa
aba
ad
ada
aea
agd
ased
ased-ss
ased-ss<p
#