how do I remove duplicate words on a text list?[solved]

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
scsijon
Posts: 1596
Joined: Thu 24 May 2007, 03:59
Location: the australian mallee
Contact:

how do I remove duplicate words on a text list?[solved]

#1 Post by scsijon »

I have a list of 24000+ 'words' in a Text file. It's in a vertical list format.

However, about 40+% are duplicates and i'd like to delete them.

The format of the 'words' are basically anything that can be typed in on a keyboard.

I have used 'return-newline' as the separator.

Has anyone a simple script I can run to 'fix the problem'

thanks
Last edited by scsijon on Sat 02 Mar 2013, 00:00, edited 1 time in total.

User avatar
SFR
Posts: 1800
Joined: Wed 26 Oct 2011, 21:52

Re: how do I remove duplicate words on a text list?

#2 Post by SFR »

Hey Scsijon

To make it clear - the list looks like this:
abc
def
abc
abc
blablabla
zxzx
something
blablabla
...


and should look like this:
abc
def
blablabla
zxzx
something
...


right?

If you don't mind that lines will be also sorted:

Code: Select all

sort -u input_file
But if it's a problem, here's cute awk one-liner I just found on Stack Overflow:

Code: Select all

awk '!_[$0]++' input_file
Greetings!
[color=red][size=75][O]bdurate [R]ules [D]estroy [E]nthusiastic [R]ebels => [C]reative [H]umans [A]lways [O]pen [S]ource[/size][/color]
[b][color=green]Omnia mea mecum porto.[/color][/b]

User avatar
vovchik
Posts: 1507
Joined: Tue 24 Oct 2006, 00:02
Location: Ukraine

#3 Post by vovchik »

Dear guys,

This is pretty easy too:

Code: Select all

cat some.txt | sort | uniq
With kind regards,
vovchik

amigo
Posts: 2629
Joined: Mon 02 Apr 2007, 06:52

#4 Post by amigo »

I recently had the same problem and wrote something for it. But, I don't find it right now, so I've re-created it:

Code: Select all

#!/bin/bash
# uniq_no-sort
# print out uniq lines, but without sorting them

FILE=$1

OUT="$FILE.uniq"
: > $OUT

while read LINE ; do
	if ! [[ $(fgrep -q $LINE $OUT) ]] ; then
		echo $LINE >> $OUT
	fi
done < $FILE

User avatar
tallboy
Posts: 1760
Joined: Tue 21 Sep 2010, 21:56
Location: Drøbak, Norway

#5 Post by tallboy »

Sorry, mistake, could not find a way to delete post.
True freedom is a live Puppy on a multisession CD/DVD.

User avatar
GustavoYz
Posts: 883
Joined: Wed 07 Jul 2010, 05:11
Location: .ar

#6 Post by GustavoYz »

vovchik wrote:

Code: Select all

cat some.txt | sort | uniq
With kind regards,
vovchik
Cat isnt really needed:

Code: Select all

sort file.txt | uniq

scsijon
Posts: 1596
Joined: Thu 24 May 2007, 03:59
Location: the australian mallee
Contact:

#7 Post by scsijon »

Sorry folks, I wish it was that simple.

I have already sorted the list, that was when I relaized how many duplicates were in it.


consider this:


aaa
aba
ada
ad
aea
aea
aea
agd
agd
ased
ased
ased-ss
ased-ss<p

and on we go.

I want to remove all the duplicates found.

It's what happens when you need to rebuild a crashed component list from backups.

User avatar
Keef
Posts: 987
Joined: Thu 20 Dec 2007, 22:12
Location: Staffordshire

#8 Post by Keef »

Code: Select all

# cat list.txt 
aaa 
aba 
ada 
ad 
aea 
aea 
aea 
agd 
agd 
ased 
ased 
ased-ss 
ased-ss<p
# sort list.txt | uniq
aaa 
aba 
ad 
ada 
aea 
agd 
ased 
ased-ss 
ased-ss<p
# 
Seems to work for me....

scsijon
Posts: 1596
Joined: Thu 24 May 2007, 03:59
Location: the australian mallee
Contact:

#9 Post by scsijon »

sorry , vovchick, GustavoYz and Keef

yes it does, I must have done a typo the first time I tried it.

thanks all

Post Reply