Bash: sort

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
User avatar
zigbert
Posts: 6621
Joined: Wed 29 Mar 2006, 18:13
Location: Valåmoen, Norway
Contact:

Bash: sort

#1 Post by zigbert »

Let's say file1 looks like this:

Code: Select all

03:50|Artist - Title|001 /path/Artist_Title.mp3
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3
...and file2 contains the correct order:

Code: Select all

002 /path/Bartist_Title.mp3
001 /path/Artist_Title.mp3
003 /path/Cartist_Title.mp3
Yes, the sort order is correct even if 002 is above 001. ..


How to sort file1 based on file2 ? ... in the speed of light ;)


Thank you
Sigmund

User avatar
SFR
Posts: 1800
Joined: Wed 26 Oct 2011, 21:52

#2 Post by SFR »

Interesting problem...
Ok, here's the first attempt.

This one will work only if the numbers from file2 (001, 002 ...) are exactly corresponding to the line numbers in file1, as is shown in your examples.

Code: Select all

#!/bin/bash

for i in `awk '{print $1}' file2`; do
  awk 'NR=='$i'' file1
done
I don't know what about "the speed of light"; must be tested on something larger I guess. :wink:

Greetings!
[color=red][size=75][O]bdurate [R]ules [D]estroy [E]nthusiastic [R]ebels => [C]reative [H]umans [A]lways [O]pen [S]ource[/size][/color]
[b][color=green]Omnia mea mecum porto.[/color][/b]

User avatar
L18L
Posts: 3479
Joined: Sat 19 Jun 2010, 18:56
Location: www.eussenheim.de/

sort

#3 Post by L18L »

# time sort -t '|' -k 2 file1
03:50|Artist - Title|001 /path/Artist_Title.mp3
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3

real 0m0.030s
user 0m0.007s
sys 0m0.007s
#

# time for i in `awk '{print $1}' file2`; do
> awk 'NR=='$i'' file1
> done
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:50|Artist - Title|001 /path/Artist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3

real 0m0.158s
user 0m0.053s
sys 0m0.013s
#

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#4 Post by technosaurus »

btw if you are using awk you can right justify text like this:

Code: Select all

echo -e "1 hello\n2 world\n256 last" |awk '{printf "%5s %s\n",$1, $2}'
in awk you can use associative arrays and set it up by processing the first file in a 2nd file {the order matters, it parsers 1st file 1st....} you can also do stuff before and/or after all files

(by associative arrays, I mean that you can just randomly name the fields like a[filename]=number b[filename]=time ...)

here is the template I use, for when I forget all the random features;

Code: Select all

#!/bin/awk -f
#FILENAME (name of current file) $FILENAME (contents of current file)
#NF number of fields, $NF last field
#NR line number in all files		#FNR line number in current file
#ORS (default is "\n")				#RS  (default is "\n")
#OFS (default is " ")				#FS (default is [ \t]*)
#system(command) run a command		#close(filename) close(command)
#ARGC, ARGV similar to C, but skips some stuff
#IGNORECASE (default is 0) set to non-0 or use toupper() or tolower()
#ENVIRON array of env vars ex. ENVIRON["SHELL"] (equivalent of $SHELL)
#getline var < file ... close file or command | getline var
#index(haystack, needle) find needle in haystack
#length(string)
#match(string, regexp) returns where the regex starts, or 0
#RLENGTH length of /match/ substring or -1
#RSTART position where the /match/ substring starts, or 0
#split(string, array, fieldsep) split string into an array separated by fieldsep
#printf(format, expression1,...) print format-ted replacing %* with expressions
#%{c,d/i,e,f,g,o,s,x,X,%} char, decimal int, exp notation, float, shortest of 
#	exp/float, octal, string, hex int, capitalized hex int, a '%' character
#sprintf(format, expression1,...) store printf in a variable
#sub(regexp, replacement, target) replace first regex with replacement in target
#gsub(regexp, replacement, target) like gsub but for all regex in target
#substr(string, start, length)get substring of string from start to start+length
#print > /dev/stdin, /dev/stdout, /dev/stderr, /dev/fd/# or filename
#output can be piped like print $0 | command
#comparisons <,>,<=,>=,==,!=,~,!~,in use && for AND, || for OR, ! for NOT
#	(~ is for regexp and "in" looks for subscript in array)
#/word/{...} like if match(...) {...} equivalent of grep
#(condition) ? if-true-exp : if-false-exp or use if (condition){}
#math +,-,*,/,%,**,log(x),exp(x),,sqrt(x),cos(x),sin(x),atan2(y,x),
#rand(),srand(x),time(),ctime()
#
#function name (parameter-list) {
#     body-of-function
#}

BEGIN {
#actions that happen before any files are read in
}
#
{
#actions to do on files
}
#
END {
#actions to do after all files are done
}

Last edited by technosaurus on Sat 20 Oct 2012, 18:21, edited 1 time in total.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
zigbert
Posts: 6621
Joined: Wed 29 Mar 2006, 18:13
Location: Valåmoen, Norway
Contact:

#5 Post by zigbert »

I am thankful for all tips and input.
There are many ways to solve this, but I am still searching for brilliance :wink:


Sigmund

akash_rawal
Posts: 229
Joined: Wed 25 Aug 2010, 15:38
Location: ISM Dhanbad, Jharkhand, India

#6 Post by akash_rawal »

Here's my attempt:

Code: Select all

#!/bin/bash

#Utility
function endl()
{
	cat
	echo
}

#Our own private directory
tmp="/tmp/sort2"
mkdir -p "$tmp"

#Index file2
i=0
ifsbak="$IFS"
IFS=""
while read line; do
	echo "$i|$line"
	i=$(( $i+1 ))
done < "./file2" > "$tmp/file2_indexed"

#Sort both files alphabetically
sort -t '|' -k 3 -o "$tmp/file1_sorted" "./file1"
sort -t '|' -k 2 -o "$tmp/file2_sorted" "$tmp/file2_indexed"

#Load 'sorted' indices into array
IFS='|'
cut -d '|' -f 1 "$tmp/file2_sorted" | tr '
' '|' | endl | while read -a indices; do
	IFS='
'
	#Don't know why read -a doesn't work outside the loop
	
	#Attach indices to file1_sorted
	IFS=""
	i=0
	while read line; do
		echo "${indices[$i]}|$line"
		i=$(( $i+1 ))
	done < "$tmp/file1_sorted" > "$tmp/file1_indexed"
	#Sort it by attached index
	sort -t '|' -k 1 -n -o "$tmp/file1_sorted_final" "$tmp/file1_indexed"
	#Final output
	cut -d '|' -f 2- "$tmp/file1_sorted_final"
	break
done
For 100000 lines it takes 15 s.

I believe translating the script / parts of the script in awk can speed it up, but too lazy to learn awk :oops:

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

Re: Bash: sort

#7 Post by rcrsn51 »

zigbert wrote:How to sort file1 based on file2 ? ... in the speed of light
That means coding it in C. See attached.

Code: Select all

# time ./zigsort file1 file2
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:50|Artist - Title|001 /path/Artist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3

real	0m0.001s
user	0m0.000s
sys	0m0.000s
Attachments
zigsort-1.0.tar.gz
(3.3 KiB) Downloaded 290 times

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#8 Post by technosaurus »

I _was_ too lazy to learn awk, now too lazy to write 100 lines of shell to do 3 lines of awk
first arg is the unsorted file second arg is the sorted file

Code: Select all

#!/bin/awk -f
BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}
in a shell script it would be:

Code: Select all

awk 'BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}' unsorted_file sorted_file
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

amigo
Posts: 2629
Joined: Mon 02 Apr 2007, 06:52

#9 Post by amigo »

The suggested 'sort' command seemed to me the best:
"sort -t '|' -k 2 file1"
if that produces the desired result. The OP doesn't state how the order is pre-determined. If the order is completely arbitrary, then one of the other suggestions would be best.
Is the order arbitrary, or is it based on the data in column 2 of file1. Otherwise, how do you *produce* file2?

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#10 Post by technosaurus »

to me it sounded as if it is based on the order they appear in the sorted file and have nothing to do with the contents (the sorted file is simply the last column of the unsorted in a user defined order?) - AFAICT everything faster (with exception of compiled C that is 40 times larger) than my awk one-liner sorted by the numeric values instead of the order they appear in the file
- I was just solving the problem - not the underlying cause (padded zeroes, the sort category at the end of line vs. the beginning, arbitrary fields, order and names...)

the time was:
  • real 0m0.009s
    user 0m0.004s
    sys 0m0.004s
and time shouldn't increase significantly based on file length, since that is about the same time it takes awk to BEGIN{print .}
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

#11 Post by rcrsn51 »

technosaurus wrote:

Code: Select all

awk 'BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}' unsorted_file sorted_file
Clever. And surprisingly fast. With a test set of 999 records, it was 1/3 the speed of zigsort.

I wonder if there is any memory penalty for building an associative array that big?

User avatar
zigbert
Posts: 6621
Joined: Wed 29 Mar 2006, 18:13
Location: Valåmoen, Norway
Contact:

#12 Post by zigbert »

technosaurus wrote:

Code: Select all

awk 'BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}' unsorted_file sorted_file
Now we're talking :D


Thanks a lot
Sigmund

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

#13 Post by rcrsn51 »

As another test, I generated a data set of 9999 records.

Code: Select all

# time ./zigsort file1 file2 > file3

real	0m0.031s
user	0m0.008s
sys	0m0.020s

# time ./technosort file1 file2 > file3

real	0m0.039s
user	0m0.036s
sys	0m0.000s
# 
Technosort has caught up. It's holding all its data in memory so it only needs one pass through the files. Zigsort's need to re-read file1 is slowing it down.

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

#14 Post by rcrsn51 »

But if I modify Zigsort to hold all its data internally, it yields

Code: Select all

# time ./zigsort file1 file2 > file3

real	0m0.011s
user	0m0.004s
sys	0m0.004s

jamesbond
Posts: 3433
Joined: Mon 26 Feb 2007, 05:02
Location: The Blue Marble

#15 Post by jamesbond »

My entry. Doesn't assume file1 is already sorted, it matches "012" from file1 exactly with "012" from file2.

Code: Select all

#!/bin/ash

ENTRIES=10000
FILE1=/tmp/file1
FILE2=/tmp/file2
OUTFILE=/tmp/outfile

generate_file1() {
	for a in $(seq 1 $ENTRIES); do
		printf "03:50|Artist - Title|%.3d /path/Artist_Title.mp3\n" $a
	done > $FILE1
}

generate_file2() {
	for a in $(seq 1 $ENTRIES); do
		printf "%.3d /path/Artist_Title.mp3\n" $a
	done | sort -R > $FILE2
}

# generate fake data for testing
generate_file1
generate_file2

time -p -- awk -F"|" '
NR > FNR {
	# sort	
	FS=" "
	print file1[$1]
	next
}
{
	# scan
	line=$0
	sub(/ .*/,"",$3)
	file1[$3]=line
}
' $FILE1 $FILE2 > $OUTFILE
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]

Post Reply