how to grep multiple lines into 1? [SOLVED]

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
User avatar
sc0ttman
Posts: 2812
Joined: Wed 16 Sep 2009, 05:44
Location: UK

how to grep multiple lines into 1? [SOLVED]

#1 Post by sc0ttman »

I want to get various fields from an xml file, and output them on a single line, and repeat for each item in the xml list... example, i want only name, url, genre:

file:

Code: Select all

<item01>
<name>01n</name>
<url>01u</url>
<notinterested>01ni</notinterested>
<genre>01g</genre>
<unneeded>01un</unneeded>
</item01>

<item02>
<name>02n</name>
<url>02u</url>
<notinterested>02ni</notinterested>
<genre>02g</genre>
<unneeded>02un</unneeded>
</item02>
...

and all I want in my output file is `name|url|genre` of each item, on a new line:

01n|01u|01g
02n|02u|02g
...

I cant seem to write a fast way of doing this... But then I am crap at this sort of thing...

EDIT:: SOLVED!
Last edited by sc0ttman on Fri 22 Feb 2013, 13:32, edited 1 time in total.
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]

Ibidem
Posts: 549
Joined: Wed 26 May 2010, 03:31
Location: State of Jefferson

#2 Post by Ibidem »

grep won't cut it: it doesn't transform text.
Off the top of my head:

Code: Select all

#!/bin/sh
while read line
  do
     case $line in
#This is for every tag before the last interesting one:
#outputs "content|"
       (*<name>* |*<url>*) 
        echo -n "$line" |sed -e 's/.*>\(..*\)<.*/\1|/g'
        ;;
#This is for the last tag of interest:
#outputs "content\n"
       (*<genre>*)
        echo "$line" |sed -e 's/.*>\(..*\)<.*/\1/g'
        ;;
       (*) ;;
   esac
  done
This assumes that for every tag you're interested in, both opening and closing are on the same line.
You need to redirect the input to the XML file:

Code: Select all

example.sh < sample.xml
#or
cat sample.xml | example.sh
No idea how fast it is, though.
Last edited by Ibidem on Fri 22 Feb 2013, 01:24, edited 1 time in total.

amigo
Posts: 2629
Joined: Mon 02 Apr 2007, 06:52

#3 Post by amigo »

If the xml files are small and you are *sure* that the xml is one-line-one-tag clean, then some variation of Ibidem's solution will work. Otherwise, you'll need to use a good xml-parser to retrieve tag values.

jamesbond
Posts: 3433
Joined: Mon 26 Feb 2007, 05:02
Location: The Blue Marble

#4 Post by jamesbond »

Code: Select all

#!/bin/dash
{ echo "<root>"; cat -; echo "</root>"; } | xml2 | awk -F= '
$1~/name$/ || $1~/url$/ { printf $2 "|"}
$1~/genre$/ { print $2 }'
Get xml2 from http://www.ofb.net/~egnor/xml2/

Timing on my lousy machine for 1million records:

Code: Select all

 time ./x.sh < y.xml  > /dev/null

real	0m21.785s
user	0m34.710s
sys	0m2.857s
where y.xml is this

Code: Select all

<item01>
<name>01n</name>
<url>http://host1.com/url1</url>
<notinterested>01ni</notinterested>
<genre>01g</genre>
<unneeded>01un</unneeded>
</item01>

<item02>
<name>02n</name>
<url>http://host2.com/url2</url>
<notinterested>02ni</notinterested>
<genre>02g</genre>
<unneeded>02un</unneeded>
</item02> 
repeated 500thousand times (thus a million records).

Output:

Code: Select all

01n|http://host1.com/url1|01g
02n|http://host2.com/url2|02g
Limitation: you'd better make sure that all the fields you want to extract doesn't contain equal sign (=) or things will break.
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#5 Post by seaside »

sc0ttman,

As everyone mentioned, this could be simple if you can count on the xml being in the format you've shown.

Code: Select all

while read line
do
case $line in
*name*) nline=${line%<*}  name=${nline#*>}   name="$name|"                 ;;
*url*)   nline=${line%<*} url=${nline#*>}    url="$url|"                ;;
*genre*) nline=${line%<*} genre=${nline#*>}   
echo $name$url$genre >>pipe-sep-file                   ;;
esac
done <xmlfile
Regards,
s

amigo
Posts: 2629
Joined: Mon 02 Apr 2007, 06:52

#6 Post by amigo »

Here's a more complete example:

Code: Select all

#!/bin/bash

while read LINE ; do
	# if both NAME and GENRE are set, then the  output is ready
	# otherwise, we are just beinning or still composing output
	case $NAME in
		'') : ;;
		*)	if [[ $GENRE ]] ; then
				echo "$NAME|$URL|$GENRE"
				NAME= URL= GENRE=
			fi
		;;
	esac
	
	case $LINE in
		'<name'*) NAME=${LINE#*>} ; NAME=${NAME%%<*}
		;;
		
		'<url'*) URL=${LINE#*>} ; URL=${URL%%<*}
		;;
		'<genre'*) GENRE=${LINE#*>} ; GENRE=${GENRE%%<*}
		;;
	esac
done <test.xml

User avatar
sc0ttman
Posts: 2812
Joined: Wed 16 Sep 2009, 05:44
Location: UK

#7 Post by sc0ttman »

Thanks for all the responses, so far I've gone with the snippet that's easiest to understand, being the lazy soul I am... I used amigos snippet, fitted my purposes most closely, and so far i have this (it'll end up in vlc-gtk when it's done):

Code: Select all

#!/bin/bash
get_icecasts () {
	rm /tmp/icecastlist
	rm /tmp/icecast.xml
	wget -4 -O /tmp/icecast.xml "http://dir.xiph.org/yp.xml"
	if [ -f /tmp/icecast.xml ];then
		while read LINE ; do 
		   # if both NAME and GENRE are set, then the  output is ready 
		   # otherwise, we are just beinning or still composing output 
		   case $NAME in 
			  '') : ;; 
			  *) if [[ $GENRE ]];then 
					echo "IceCast Radio ($GENRE): $NAME|$URL" >> /tmp/icecastlist
					NAME= URL= GENRE= 
				 fi 
			  ;; 
		   esac 
		   case $LINE in 
			  '<server_name'*) NAME="${LINE#*>}" ; NAME="${NAME%%<*}" ;; 
			  '<listen_url'*) URL="${LINE#*>}" ; URL="${URL%%<*}" ;; 
			  '<genre'*) GENRE="${LINE#*>}" ; GENRE="${GENRE%%<*}" ;; 
		   esac 
		done </tmp/icecast.xml
		icecastlist="`cat /tmp/icecastlist | sort | sed 's/&\#039;//g' | uniq`" #clean up a bit
		echo "$icecastlist"
		rm /tmp/icecast.xml
	fi
}
LIST="`get_icecasts`"
echo "$LIST"
(note the sed stuff actually has no backslash, but i added one to show i will be removing the actual html entity)

Anyway.. ideally, just cos I think I can get away with it, I wanna be able to remove the need to write to /tmp/icecastlist, just write to a variable, while retaining the `sort | sed` stuff ... Some thing like, changing:

Code: Select all

echo "IceCast Radio ($GENRE): $NAME|$URL" >> /tmp/icecastlist
...
LIST=`cat /tmp/icecastlist | sort | ...`
to

Code: Select all

LIST="$LIST
IceCast Radio ($GENRE): $NAME|$URL"
...
LIST_CLEANED=`echo "$LIST" | sort | ...`

I tried various things that I expected to work (I can see it shouldn't be hard!) but even that has stumped me - i get a frozen script, no output..
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]

amigo
Posts: 2629
Joined: Mon 02 Apr 2007, 06:52

#8 Post by amigo »

Everything should be as simple as possible, but no simpler than it is.

Code: Select all

get_icecasts () {
while read LINE ; do
	# if both NAME and GENRE are set, then the  output is ready
	# otherwise, we are just beinning or still composing output
	case $NAME in
		'') : ;;
		*)	if [[ $GENRE ]] ; then
				#echo "$NAME|$URL|$GENRE"
				echo "IceCast Radio ($GENRE): $NAME|$URL"
				NAME= URL= GENRE=
			fi
		;;
	esac
	
	case $LINE in
		'<name'*) NAME=${LINE#*>} ; NAME=${NAME%%<*}
		;;
		
		'<url'*) URL=${LINE#*>} ; URL=${URL%%<*}
		;;
		'<genre'*) GENRE=${LINE#*>} ; GENRE=${GENRE%%<*}
		;;
	esac
done <test.xml
}

#get_icecasts
get_icecasts | sort | sed 's/&\#039;//g' | uniq

Although, do you really need to sort and uniq it? Are the entries in the input file not in order and have duplicates?
Why do you need echo? As above it already echos it. And if you need to output it to a file or other program simply pipe or redirect it.

User avatar
sc0ttman
Posts: 2812
Joined: Wed 16 Sep 2009, 05:44
Location: UK

#9 Post by sc0ttman »

Thanks amigo, yep that'll do it... I do need to sort it, as I want genres grouped together.. but I didn't need uniq, force of habit..

The last echo was just so I can test the output... there was another stray one in there as well..

I'll mark the thread SOLVED. Thanks everyone.

Just for completeness:

Code: Select all

get_icecasts () { 
	wget -4 -O /tmp/icecast.xml "http://dir.xiph.org/yp.xml"
	while read LINE ; do 
	   # if both NAME and GENRE are set, then the  output is ready 
	   # otherwise, we are just beinning or still composing output 
	   case $NAME in 
		  '') : ;; 
		  *)   if [[ $GENRE ]] ; then 
				#echo "$NAME|$URL|$GENRE" 
				echo "IceCast Radio ($GENRE): $NAME|$URL" 
				NAME= URL= GENRE= 
			 fi 
		  ;; 
	   esac 
		
	   case $LINE in 
		  '<server_name'*) NAME="${LINE#*>}" ; NAME="${NAME%%<*}"
		  ;; 
		   
		  '<listen_url'*) URL="${LINE#*>}" ; URL="${URL%%<*}"
		  ;; 
		  '<genre'*) GENRE="${LINE#*>}" ; GENRE="${GENRE%%<*}"
		  ;; 
	   esac 
	done </tmp/icecast.xml
} 

get_icecasts | sort | sed 's/&\#039;//g'
The above will get all Icecast radio stations, and build a sorted list, in this format:

IceCast Radio ($GENRE): $NAME|$URL
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]

Post Reply