Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Sat 25 May 2013, 19:37
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
Transforming html special characters to text
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 1 [8 Posts]  
Author Message
Karl Godt


Joined: 20 Jun 2010
Posts: 2686
Location: Kiel,Germany

PostPosted: Fri 16 Mar 2012, 12:42    Post subject:  Transforming html special characters to text  

I am trying to convert html pages back to simple text files to ease the use to grep words returning not so much garbage .

I am stuck with the special characters .

I am using these files as references :
http://www.mediaevent.de/tutorial/sonderzeichen.html for the special characters

and
http://www.w3schools.com/tags/default.asp for the tags .

The code transforms the special char file into something like this :
Code:

" "| |\&\#160\;
¡|¡|\&\#161\;
¢|¢|\&\#162\;
£|£|\&\#163\;
¤|¤|\&\#\164\;
¥|¥|\&\#165\;
¦|¦|\&\#166\;

NOTE : the backslashes in the third column are added for rendering purposes, they do not occur in the created file .

This is achieved by
Code:

echo "Preparing html special char database .."
TMP_SPECIAL_CHAR_FILE="${TMP_DIR}/special_chars.txt"
cp "${SPECIAL_CHAR_FILE}" "${TMP_SPECIAL_CHAR_FILE}"
tmp_specials="${TMP_DIR}/special"
cat "${TMP_SPECIAL_CHAR_FILE}" | grep '&.*\;.*&.*\;' >"${tmp_specials}".0.txt
cat "${TMP_SPECIAL_CHAR_FILE}" | grep -e '[[:blank:]]*–[[:blank:]]*&.*\;' >>"${tmp_specials}".0.txt
cat "${tmp_specials}".0.txt |sed 's#, #,#g' >"${tmp_specials}".1.txt
cat "${tmp_specials}".1.txt |sed 's# (Programmierung)#(Programmierung)#g' >"${tmp_specials}".2.txt
cat "${tmp_specials}".2.txt |sed 's# Symbol#Symbol#g' >"${tmp_specials}".2.0.txt
cat "${tmp_specials}".2.0.txt |sed 's, ,|,g' >"${tmp_specials}".3.txt
cat "${tmp_specials}".3.txt |sed 's,"|"," ",g' >"${tmp_specials}".4.txt
cat "${tmp_specials}".4.txt |sed 's,\t,|,g' >"${tmp_specials}".6.txt
cat "${tmp_specials}".6.txt |tr -s '|' >"${tmp_specials}".8.txt
cat "${tmp_specials}".8.txt |cut -f2-4 -d'|' >"${tmp_specials}".9.txt

Completely OK until now !

But reading that file and substituting the html code with the real-visible char does not work further down :
Code:

cat "${tmp_specials}".9.txt |while read line;do
#echo "$line"
[ ! "`echo "$line" | grep '[[:alnum:]]'`" ] && continue
#SIGN=`echo "$line" |cut -f1 -d'|' |sed 's/\([[:punct:]]\)/\\\\\1/g'`
SIGN=`echo "$line" |cut -f1 -d'|'`

HTML=`echo "$line" |cut -f2 -d'|' |sed 's/\([[:punct:]]\)/\\\\\1/g'`
NUMERIC=`echo "$line" |cut -f3 -d'|' |sed 's/\([[:punct:]]\)/\\\\\1/g'`
echo "'$SIGN' '$HTML' '$NUMERIC'"
#echo "sed 1"
[ "$HTML" != "–" ] && sed -i "s/$HTML/$SIGN/g" "${file_tmp}"
[ "$?" != 0 ] && exit
#echo "sed 2"
sed -i "s/$NUMERIC/$SIGN/g" "${file_tmp}"
[ "$?" != 0 ] && exit


When i open the converted file in geany it looks like this :

Quote:
AAÜG - Einzelnorm


§ 18
" "
Bußgeldvorschriften
(1) Ordnungswidrig handelt, wer vorsÀtzlich oder leichtfertig
1.
entgegen § 8 Abs. 1 Satz 5 Nr. 1 eine Auskunft nicht, nicht richtig, nicht vollstÀndig oder nicht rechtzeitig erteilt oder
2.
entgegen § 8 Abs. 1 Satz 5 Nr. 2 die erforderlichen Unterlagen nicht, nicht vollstÀndig oder nicht rechtzeitig vorlegt.
(2) Die Ordnungswidrigkeit kann mit einer Geldbuße bis zu zweitausendfÃŒnfhundert Euro geahndet werden.
(3) Verwaltungsbehörde im Sinne des § 36 Abs. 1 Nr. 1 des Gesetzes Ìber Ordnungswidrigkeiten ist der VersorgungstrÀger. Abweichend von Satz 1 ist fÌr den VersorgungstrÀger nach § 8 Abs. 4 Nr. 3 Verwaltungsbehörde im Sinne des § 36 Abs. 1 des Gesetzes Ìber Ordnungswidrigkeiten das Bundesversicherungsamt.
(4) Die Geldbußen fließen in die Kasse der Deutschen Rentenversicherung Bund, wenn sie als VersorgungstrÀger den Bußgeldbescheid erlassen hat. § 66 des Zehnten Buches Sozialgesetzbuch gilt entsprechend. Diese Kasse trÀgt abweichend von § 105 Abs. 2 des Gesetzes ÃŒber Ordnungswidrigkeiten die notwendigen Auslagen; sie ist auch ersatzpflichtig im Sinne des § 110 Abs. 4 des Gesetzes ÃŒber Ordnungswidrigkeiten.


This is a federal law site I downloaded using wget -r of around 1,2GB
[ http://www.gesetze-im-internet.de/ ] .

Now converting the special chars

§|§|\&\#167\;
Ä|Ä|\&\#196\;
Ö|Ö|\&\#214\;

and few more special to my country turns out to be difficult .

I suspect the bash builtin while andor read functions to corrupt the signs .

Any Ideas ?
html2txt.big.tar.bz2
Description  heavy script,goes to my-applications/sbin and my-applications/lib with several special char and tag files saved from the internet .
bz2

 Download 
Filename  html2txt.big.tar.bz2 
Filesize  701.14 KB 
Downloaded  108 Time(s) 
Back to top
View user's profile Send private message Visit poster's website 
amigo

Joined: 02 Apr 2007
Posts: 1759

PostPosted: Fri 16 Mar 2012, 13:50    Post subject:  

Try using 'read -r -n1' to read the stream in binary mode one character at a time.
Back to top
View user's profile Send private message 
Karl Godt


Joined: 20 Jun 2010
Posts: 2686
Location: Kiel,Germany

PostPosted: Fri 16 Mar 2012, 16:09    Post subject:  

No luck until now . read -n1 would need to get some kind of filter to work and that filter would need some time to create ...

read -r does not do it too .

Tried to replace " sed -i " with " cat | sed " .. also nothing ..

Tried to replace
type -a echo
echo is a shell builtin
with
echo is /bin/echo
also 0000000 . Crying or Very sad
Back to top
View user's profile Send private message Visit poster's website 
Karl Godt


Joined: 20 Jun 2010
Posts: 2686
Location: Kiel,Germany

PostPosted: Fri 16 Mar 2012, 18:00    Post subject:  

After reding the net i suspected also something about encoding ...

downloaded chardet from debian squeeze repo , used xarchive to extract and cp the pyshared/ files into /usr/lib/python-x.x/site-packages and this is what i got :

There had been few files convertng ok and these are
bash-3.2# chardet /mnt/www.gesetze-im-internet.de.iso.22845/aabg/art_2.htm
/mnt/www.gesetze-im-internet.de.iso.22845/aabg/art_2.htm: ascii (confidence: 1.00)


The bulk not converting ok seems to be
bash-3.2# chardet /mnt/www.gesetze-im-internet.de.iso.22845/aa_g_000/art_10.htm
/mnt/www.gesetze-im-internet.de.iso.22845/aa_g_000/art_10.htm: ISO-8859-2 (confidence: 0.79)

will investigate further to a) transform them to ascii encoding and b) modprobe -v nls_iso8859-2 ..
Back to top
View user's profile Send private message Visit poster's website 
Karl Godt


Joined: 20 Jun 2010
Posts: 2686
Location: Kiel,Germany

PostPosted: Fri 16 Mar 2012, 19:03    Post subject:  

so ... i have 8859-1 compiled into the kernel and modprobing -2 additionally did not work .

Code:
 iconv --verbose -c -f ISO_8859-2:1987 -t ASCII -o /tmp/iconv_file.htm /mnt/www.gesetze-im-internet.de.iso.22845/1_dm_gol/__4.htm


seems to do it ...

6000 directories with their dozen files to convert ... Laughing
Back to top
View user's profile Send private message Visit poster's website 
don570


Joined: 10 Mar 2010
Posts: 2476
Location: Ontario

PostPosted: Sat 17 Mar 2012, 15:12    Post subject:  

Have you compared the results to what you can get
with the converter in Scottman's Akita?

I put a right click option to convert htm files to txt in the
home folder using elinks to do the converting.

____________________________________________
Back to top
View user's profile Send private message 
Karl Godt


Joined: 20 Jun 2010
Posts: 2686
Location: Kiel,Germany

PostPosted: Mon 19 Mar 2012, 00:59    Post subject:  

Quote:
1.
Schmelzanlagen f�r Aluminium und Magnesium;
2.
Fabriken oder Fabrikationsanlagen, in denen folgende Stoffe
hergestellt werden:
a)
anorganische S�uren, Laugen, Salze,
b)
organische L�semittel,
c)
Farb- und Anstrichmittel
d)
K�ltemittel,


I have compiled lynx and that is the output .
I had to compile simply ./configure , because one of the many --with- my configureall.sh detected let configure choke about some missing NLS .

elinks seems quite more interesting , needed also simply configure beacuse lzma.h not found . The rendering is ok , but no copy and paste from it ?

but elinks -dump file.htm >file.txt works great ! Smile Very Happy
Back to top
View user's profile Send private message Visit poster's website 
don570


Joined: 10 Mar 2010
Posts: 2476
Location: Ontario

PostPosted: Sat 24 Mar 2012, 15:52    Post subject:  

You should ask Scottman about elinks. He's the expert
when it comes to elinks. Laughing

He's compiled it so that it's very small.

______________________________________
Back to top
View user's profile Send private message 
Display posts from previous:   Sort by:   
Page 1 of 1 [8 Posts]  
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0689s ][ Queries: 12 (0.0089s) ][ GZIP on ]