| Author |
Message |
Karl Godt

Joined: 20 Jun 2010 Posts: 2682 Location: Kiel,Germany
|
Posted: Fri 16 Mar 2012, 12:42 Post subject:
Transforming html special characters to text |
|
I am trying to convert html pages back to simple text files to ease the use to grep words returning not so much garbage .
I am stuck with the special characters .
I am using these files as references :
http://www.mediaevent.de/tutorial/sonderzeichen.html for the special characters
and
http://www.w3schools.com/tags/default.asp for the tags .
The code transforms the special char file into something like this :
| Code: |
" "| |\&\#160\;
¡|¡|\&\#161\;
¢|¢|\&\#162\;
£|£|\&\#163\;
¤|¤|\&\#\164\;
¥|¥|\&\#165\;
¦|¦|\&\#166\;
|
NOTE : the backslashes in the third column are added for rendering purposes, they do not occur in the created file .
This is achieved by
| Code: |
echo "Preparing html special char database .."
TMP_SPECIAL_CHAR_FILE="${TMP_DIR}/special_chars.txt"
cp "${SPECIAL_CHAR_FILE}" "${TMP_SPECIAL_CHAR_FILE}"
tmp_specials="${TMP_DIR}/special"
cat "${TMP_SPECIAL_CHAR_FILE}" | grep '&.*\;.*&.*\;' >"${tmp_specials}".0.txt
cat "${TMP_SPECIAL_CHAR_FILE}" | grep -e '[[:blank:]]*–[[:blank:]]*&.*\;' >>"${tmp_specials}".0.txt
cat "${tmp_specials}".0.txt |sed 's#, #,#g' >"${tmp_specials}".1.txt
cat "${tmp_specials}".1.txt |sed 's# (Programmierung)#(Programmierung)#g' >"${tmp_specials}".2.txt
cat "${tmp_specials}".2.txt |sed 's# Symbol#Symbol#g' >"${tmp_specials}".2.0.txt
cat "${tmp_specials}".2.0.txt |sed 's, ,|,g' >"${tmp_specials}".3.txt
cat "${tmp_specials}".3.txt |sed 's,"|"," ",g' >"${tmp_specials}".4.txt
cat "${tmp_specials}".4.txt |sed 's,\t,|,g' >"${tmp_specials}".6.txt
cat "${tmp_specials}".6.txt |tr -s '|' >"${tmp_specials}".8.txt
cat "${tmp_specials}".8.txt |cut -f2-4 -d'|' >"${tmp_specials}".9.txt
|
Completely OK until now !
But reading that file and substituting the html code with the real-visible char does not work further down :
| Code: |
cat "${tmp_specials}".9.txt |while read line;do
#echo "$line"
[ ! "`echo "$line" | grep '[[:alnum:]]'`" ] && continue
#SIGN=`echo "$line" |cut -f1 -d'|' |sed 's/\([[:punct:]]\)/\\\\\1/g'`
SIGN=`echo "$line" |cut -f1 -d'|'`
HTML=`echo "$line" |cut -f2 -d'|' |sed 's/\([[:punct:]]\)/\\\\\1/g'`
NUMERIC=`echo "$line" |cut -f3 -d'|' |sed 's/\([[:punct:]]\)/\\\\\1/g'`
echo "'$SIGN' '$HTML' '$NUMERIC'"
#echo "sed 1"
[ "$HTML" != "–" ] && sed -i "s/$HTML/$SIGN/g" "${file_tmp}"
[ "$?" != 0 ] && exit
#echo "sed 2"
sed -i "s/$NUMERIC/$SIGN/g" "${file_tmp}"
[ "$?" != 0 ] && exit
|
When i open the converted file in geany it looks like this :
| Quote: | AAÜG - Einzelnorm
§ 18
" "
BuÃgeldvorschriften
(1) Ordnungswidrig handelt, wer vorsÀtzlich oder leichtfertig
1.
entgegen § 8 Abs. 1 Satz 5 Nr. 1 eine Auskunft nicht, nicht richtig, nicht vollstÀndig oder nicht rechtzeitig erteilt oder
2.
entgegen § 8 Abs. 1 Satz 5 Nr. 2 die erforderlichen Unterlagen nicht, nicht vollstÀndig oder nicht rechtzeitig vorlegt.
(2) Die Ordnungswidrigkeit kann mit einer GeldbuÃe bis zu zweitausendfÃŒnfhundert Euro geahndet werden.
(3) Verwaltungsbehörde im Sinne des § 36 Abs. 1 Nr. 1 des Gesetzes Ìber Ordnungswidrigkeiten ist der VersorgungstrÀger. Abweichend von Satz 1 ist fÌr den VersorgungstrÀger nach § 8 Abs. 4 Nr. 3 Verwaltungsbehörde im Sinne des § 36 Abs. 1 des Gesetzes Ìber Ordnungswidrigkeiten das Bundesversicherungsamt.
(4) Die GeldbuÃen flieÃen in die Kasse der Deutschen Rentenversicherung Bund, wenn sie als VersorgungstrÀger den BuÃgeldbescheid erlassen hat. § 66 des Zehnten Buches Sozialgesetzbuch gilt entsprechend. Diese Kasse trÀgt abweichend von § 105 Abs. 2 des Gesetzes ÃŒber Ordnungswidrigkeiten die notwendigen Auslagen; sie ist auch ersatzpflichtig im Sinne des § 110 Abs. 4 des Gesetzes ÃŒber Ordnungswidrigkeiten.
|
This is a federal law site I downloaded using wget -r of around 1,2GB
[ http://www.gesetze-im-internet.de/ ] .
Now converting the special chars
§|§|\&\#167\;
Ä|Ä|\&\#196\;
Ö|Ö|\&\#214\;
and few more special to my country turns out to be difficult .
I suspect the bash builtin while andor read functions to corrupt the signs .
Any Ideas ?
| Description |
heavy script,goes to my-applications/sbin and my-applications/lib with several special char and tag files saved from the internet .
|

Download |
| Filename |
html2txt.big.tar.bz2 |
| Filesize |
701.14 KB |
| Downloaded |
108 Time(s) |
|
|
Back to top
|
|
 |
amigo
Joined: 02 Apr 2007 Posts: 1759
|
Posted: Fri 16 Mar 2012, 13:50 Post subject:
|
|
Try using 'read -r -n1' to read the stream in binary mode one character at a time.
|
|
Back to top
|
|
 |
Karl Godt

Joined: 20 Jun 2010 Posts: 2682 Location: Kiel,Germany
|
Posted: Fri 16 Mar 2012, 16:09 Post subject:
|
|
No luck until now . read -n1 would need to get some kind of filter to work and that filter would need some time to create ...
read -r does not do it too .
Tried to replace " sed -i " with " cat | sed " .. also nothing ..
Tried to replace
type -a echo
echo is a shell builtin
with
echo is /bin/echo
also 0000000 .
|
|
Back to top
|
|
 |
Karl Godt

Joined: 20 Jun 2010 Posts: 2682 Location: Kiel,Germany
|
Posted: Fri 16 Mar 2012, 18:00 Post subject:
|
|
After reding the net i suspected also something about encoding ...
downloaded chardet from debian squeeze repo , used xarchive to extract and cp the pyshared/ files into /usr/lib/python-x.x/site-packages and this is what i got :
There had been few files convertng ok and these are
bash-3.2# chardet /mnt/www.gesetze-im-internet.de.iso.22845/aabg/art_2.htm
/mnt/www.gesetze-im-internet.de.iso.22845/aabg/art_2.htm: ascii (confidence: 1.00)
The bulk not converting ok seems to be
bash-3.2# chardet /mnt/www.gesetze-im-internet.de.iso.22845/aa_g_000/art_10.htm
/mnt/www.gesetze-im-internet.de.iso.22845/aa_g_000/art_10.htm: ISO-8859-2 (confidence: 0.79)
will investigate further to a) transform them to ascii encoding and b) modprobe -v nls_iso8859-2 ..
|
|
Back to top
|
|
 |
Karl Godt

Joined: 20 Jun 2010 Posts: 2682 Location: Kiel,Germany
|
Posted: Fri 16 Mar 2012, 19:03 Post subject:
|
|
so ... i have 8859-1 compiled into the kernel and modprobing -2 additionally did not work .
| Code: | | iconv --verbose -c -f ISO_8859-2:1987 -t ASCII -o /tmp/iconv_file.htm /mnt/www.gesetze-im-internet.de.iso.22845/1_dm_gol/__4.htm |
seems to do it ...
6000 directories with their dozen files to convert ...
|
|
Back to top
|
|
 |
don570

Joined: 10 Mar 2010 Posts: 2473 Location: Ontario
|
Posted: Sat 17 Mar 2012, 15:12 Post subject:
|
|
Have you compared the results to what you can get
with the converter in Scottman's Akita?
I put a right click option to convert htm files to txt in the
home folder using elinks to do the converting.
____________________________________________
|
|
Back to top
|
|
 |
Karl Godt

Joined: 20 Jun 2010 Posts: 2682 Location: Kiel,Germany
|
Posted: Mon 19 Mar 2012, 00:59 Post subject:
|
|
| Quote: | 1.
Schmelzanlagen f�r Aluminium und Magnesium;
2.
Fabriken oder Fabrikationsanlagen, in denen folgende Stoffe
hergestellt werden:
a)
anorganische S�uren, Laugen, Salze,
b)
organische L�semittel,
c)
Farb- und Anstrichmittel
d)
K�ltemittel, |
I have compiled lynx and that is the output .
I had to compile simply ./configure , because one of the many --with- my configureall.sh detected let configure choke about some missing NLS .
elinks seems quite more interesting , needed also simply configure beacuse lzma.h not found . The rendering is ok , but no copy and paste from it ?
but elinks -dump file.htm >file.txt works great !
|
|
Back to top
|
|
 |
don570

Joined: 10 Mar 2010 Posts: 2473 Location: Ontario
|
Posted: Sat 24 Mar 2012, 15:52 Post subject:
|
|
You should ask Scottman about elinks. He's the expert
when it comes to elinks.
He's compiled it so that it's very small.
______________________________________
|
|
Back to top
|
|
 |
|