Decoding ROX XML Files

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

Decoding ROX XML Files

#1 Post by some1 »

Hi
The decoder has been doing fine at home for a week or so -
but my turf is only a very small area in the realm of unicode.
So - for the common good - please give it a spin,wherever you are.
On scary/dubious sightings - tell some1.

Copy the codeblock to terminal/into script - and run.
Output goes to /root

You may want to have some roxbookmarks -with non-ascii-chars.

Code: Select all


function decode_roxxml() {
#ovo===========================================================
#tool to decode xml 1.0 encoded unicode content into
#     a system- and human-friendly form.
#     input:file "$1" (any file of relevance/interest)
#     output:out-echoed decoded input
#    
# 2014/09/21 by some1 at http://www.murga-linux.com/puppy
#===========================================================o-o
/usr/bin/printf "$(LANG=C;
awk --posix 'BEGIN {a["&"] = "\\046";a["'"] = "\047";
a["""] = "\042";a[">"] = "\076";a["<"] = "\074"}
{t=$0;
{while (match(t,/&((#x[0-9A-F]{2,5})||(amp|gt|lt|quot|apos));/)){\
s=substr(t,RSTART,RLENGTH);
if (!a[s]){\
y=s;sub(/&#x/,"",y);sub(/;/,"",y);
if (length(y) > 4){a[s]="\\U" sprintf("%08s", y)} else a[s]="\\u" sprintf("%04s", y);
};
gsub(s,a[s],t);
}
}print t}' "$1")"
#===========================================================o^o
}


todecode="/root/.config/rox.sourceforge.net/ROX-Filer/Bookmarks.xml"
decode_roxxml "$todecode" >/root/wysiwyg_from_"${todecode##*'/'}"

geany "$todecode" /root/wysiwyg_from_"${todecode##*'/'}"


A pic showing some specs,no bling:
Attachments
decode_roxxml_runtime_pic.tar.gz
(21.37 KiB) Downloaded 283 times

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#2 Post by MochiMoppel »

Produced error

Code: Select all

awk: cmd. line:3: {while (match(t,/&((#x[0-9A-F]{2,5})||(amp|gt|lt|quot|apos));/)){\ 
awk: cmd. line:3:                                                                  ^ backslash not last character on line


------------------
(program exited with code: 0)
Press return to continue
After removing 2 line breaks this works for me:

Code: Select all

#!/bin/sh
function decode_roxxml() { 
#ovo=========================================================== 
#tool to decode xml 1.0 encoded unicode content into 
#     a system- and human-friendly form. 
#     input:file "$1" (any file of relevance/interest) 
#     output:out-echoed decoded input 
#    
# 2014/09/21 by some1 at http://www.murga-linux.com/puppy 
#===========================================================o-o 
/usr/bin/printf "$(LANG=C; 
awk --posix 'BEGIN {a["&"] = "\\046";a["&apos;"] = "\047"; 
a["""] = "\042";a[">"] = "\076";a["<"] = "\074"} 
{t=$0; 
{while (match(t,/&((#x[0-9A-F]{2,5})||(amp|gt|lt|quot|apos));/)){s=substr(t,RSTART,RLENGTH); 
if (!a[s]){y=s;sub(/&#x/,"",y);sub(/;/,"",y); 
if (length(y) > 4){a[s]="\\U" sprintf("%08s", y)} else a[s]="\\u" sprintf("%04s", y); 
}; 
gsub(s,a[s],t); 
} 
}print t}' "$1")" 
#===========================================================o^o 
} 


todecode="/root/.config/rox.sourceforge.net/ROX-Filer/Bookmarks.xml" 
decode_roxxml "$todecode" >/root/wysiwyg_from_"${todecode##*'/'}" 

geany "$todecode" /root/wysiwyg_from_"${todecode##*'/'}"
The script changed HTML entities for the Japanese katagana characters for folder (フォルダ) from

Code: Select all

&#x30D5;&#x30A9;&#x30EB;&#x30C0;
to Unicode

Code: Select all

\u30D5\u30A9\u30EB\u30C0
But this is still not human readable and I don't know how this could be decoded further.
What I would need is UTF-8:

Code: Select all

%E3%83%95%E3%82%A9%E3%83%AB%E3%83%80
This I could turn into human readable text...

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#3 Post by some1 »

@MochiMoppel -yeah - I imagined the possibility of minor dragons
in EastAsia/the higher unicodes/the "non-alphabeticals".
Life is simpler around here -just ascii plus a few weirdos.

I will look into it - but good with some real-life meat to chew on.
I am rather confident - that we can solve it - one way or the other.

I dont have much time now - so just a few words:
The "not last char error":
The "\" is an awk-Continuation-token
which blocks "{" from being interpreted by SHELL.
Make sure that "\" is the last char on line - i.e. no trailing space etc.
I will give it a spin later-on/upload a tar.
The code is really just a BIG "one-liner" -
but to make it readable - I rolled it out using the continuation-chars.
----

Have a nice night/day -whereever you are.

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#4 Post by some1 »

@MochiMoppel


I ran the code posted with your katakana example
- and AFAIK the result seem ok to me :wink:

Put another way:I get what I expect -conceptionally.
But is the output humane at your place with your FONT?
Attachments
katakanaFILE.tar.gz
Contains the output -should be Human Readable with your FONT.
To me the output looks like the wysiwyg.png
(164 Bytes) Downloaded 264 times
katakana.png
I made a file with this content:
(9.49 KiB) Downloaded 391 times
wysiwyg.png
fed it to the decoder - and got this.
Which tells me -that I do not have the FONT to render the unicodes -
but that the unicodes are in the wysiwyg-FILE.
(4.11 KiB) Downloaded 383 times

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#5 Post by MochiMoppel »

Perfect! :D
But why wasn't it converted this way in the previously generated wysiwyg_from_Bookmarks.xml?
I tried again your original script and it now runs without errors. The problem with my first attempt was caused by an old Opera bug: Opera puts a space in front of textlines copied from a webpage. I fixed the preceding spaces but didn't know that Opera also adds trailing spaces :cry:

I then tried the katagana example, but again this only results in \u30D5\u30A9\u30EB\u30C0
What did you do to produce actual katakana?
Attachments
wysiwyg_from_katakana.png
(14.9 KiB) Downloaded 391 times

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#6 Post by some1 »

@MochiMoppel
Perfect! :)
Yeah -its a rather amazing critter.
A few hundred bytes,30 milliseconds -and the world as we know it is
freed from eyesores and broken filesytem-calls.:)
----
Your descriptions are not really of much use - I need files ex-ante and ex-post,info on tools etc - but wait a minute....

KATAKANA is not a problem for the decoder..
I have decoded the whole KATAKANA-unicode-subset.
Every item is decoded - and shows up as a "glyph".
I PM you the set of encoded codepoints - so you can replicate.
Furthermore - you get the result of the decoding done at myplace.
If you still have issues they are at your place,in your implementation,in your code,in your data.
.Its not really a brush-off on my part -but I need relevant data from you,
to be of any help.

In my experience - the decoder allways work.

I have tarred the codeblock in the toppost .
A placeholder for the function - or a script,set the executable bit.
Attachments
decode_roxxml.tar.gz
(728 Bytes) Downloaded 271 times

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#7 Post by MochiMoppel »

some1 wrote:@MochiMoppel
Perfect! :)
Yeah -its a rather amazing critter.
I wish I could say that :cry:
"Perfect" related to your attached file katakanaFILE.tar.gz , which I loaded into Geany to show you that it indeed contains perfect katakana characters. It does not mean that my attempts to reproduce your feat was successful.
Your descriptions are not really of much use
I'm sorry to hear that. All I did is use your original script and your input file and described the output. What else do you expect me to do? I also used the katakana table you mailed me and I even tried with Lucid 5.2.8 and awk 3.1.6 (I normally use Slacko 5.6 and awk 3.1.8.). Same result. Your script just replaces HTML markups (&#x30D5; => \u30D5). There must be other factors involved, but I'm not good at awk and don't know where the trouble starts.

User avatar
Flash
Official Dog Handler
Posts: 13071
Joined: Wed 04 May 2005, 16:04
Location: Arizona USA

#8 Post by Flash »

I can't figure out what it is you guys are talking about. Are you trying to get ROX to display Japanese characters?

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#9 Post by some1 »

Nice - to have the doghandler passing by :)
No -KATAKANA ,thats MochiMoppel.

I want to be able to show anything.
A kind of internationalization - but not like po,mo,gettext etc.

Its a solution to get rid of some annoyances,which occur when
"raw"/encoded unicodes found in rox-xml-files is used outside
the realm of rox - i.e. in guis/menus.Ex. "places" - menus etc.

More generally - with the ability to decode the rox-xml-content -
the door is open for fluid,reliable usage of content from all the
rox-xml-files.Creative minds just have to get accustomed to the fact -
that things no longer break in scripting - because of sudden encoded stuff.
.English-rooted folks may not really realise - that everyone else will get eyesores and broken filesystem-calls when the "raw"/encoded stuff is used..
Musher0,MochiMoppel -knows.
With that said - in the grander scheme of things its really a small thing -
but have been very hard to solve efficiently.
FWIK: Its a done deed -now.

Most unicodes are straightforward like an A is an A - but *some* unicodes -
especially in EastAsia/higher end of the unicodes - have other functionality like "binding","formatting","combinatorial" etc.
Its no problem to decode these -literally-but to me its an unknown - how the effect will be in "real-life".
Imagine - the decoder decodes literally -but if you saw all the A's in
Arizona show up lopsided - it might be interestingly,but also confusing,over there.
I dont *know* -what happens visually in HANGUL,CJK,KATAKANA,
DEVENAGARI etc.My "Fears of the unknown",likely/perhaps.
I am on latin1 - so by sheer inference things will be well in associated spheres/most of the world geographically..But most people dont live there.
And puppy is everywhere,right?

Hope that helps.

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#10 Post by some1 »

@MochiMoppel
whats your printf?

jamesbond
Posts: 3433
Joined: Mon 26 Feb 2007, 05:02
Location: The Blue Marble

#11 Post by jamesbond »

1. make sure that #!/bin/sh is #!/bin/bash (if you are sure that /bin/sh is a symlink to bash then it's ok no need to change).
2. change "/usr/bin/printf" to "printf" - ie use bash's internal printf.
MochiMoppel's /usr/bin/printf probably points to busybox, which does not recognise "\u" escape sequence and thus will print it as is. some1's printf is probably the "full" version of printf from coreutils.
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#12 Post by MochiMoppel »

BusyBox v1.21. ...if that's what you mean.

jamesbond
Posts: 3433
Joined: Mon 26 Feb 2007, 05:02
Location: The Blue Marble

#13 Post by jamesbond »

MochiMoppel wrote:BusyBox v1.21. ...if that's what you mean.
Yup, that explains it. busybox printf doesn't understand \u. Have you tried my suggestion above, which will make use of bash's built-in printf and does it work now?
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#14 Post by some1 »

@jamesbond :)
Thanks!.You are right - one possible reason at MociMoppel could be - that the active printf dont know \u.Only the real gnu-tools does..

Just to be sure I understand this:
The code calls /usr/bin/printf. (Added7Edited:the runtime pic in the toppost
specifies the GNU-printf)
Is it so - that busybox can masqerade as the /usr/bin/printf ???
If so - how can the code guarantee - that the GNU-printf is called?

Thanks for your attention.

FWIW:The decoder works on Lucid 5.28

With respect to the tools used:
My take is that -if the tools work on Lucid 5.25 - the tools on newer
distros will also work.Is that a plausible inference/assumption?

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#15 Post by seaside »

some1 wrote:@jamesbond :)

specifies the GNU-printf)
Is it so - that busybox can masqerade as the /usr/bin/printf ???
If so - how can the code guarantee - that the GNU-printf is called?
Here's a test for bash

Code: Select all

testinfo=$(file `which printf`)
[ "${testinfo/busybox//}" != "${testinfo}" ] && echo 'Using busybox'

This checks if printf is a link to busybox. (Most likely)

Cheers,
seaside

jamesbond
Posts: 3433
Joined: Mon 26 Feb 2007, 05:02
Location: The Blue Marble

#16 Post by jamesbond »

some1, you can always use bash printf "built-in" instead, just remove "/usr/bin" from "/usr/bin/printf" (that is, just call it as "printf" instead of "/usr/bin/printf"). I have bash 4.2 and bash's internal printf works better than busybox printf. Of course, to be sure, change #!/bin/sh to #/bin/bash too (you're using "function" keyword - so your script is bash script anyway, it won't work on plain shell).
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#17 Post by some1 »

Hi
Thanks Guys :)
No-i was not fully aware of the busybox,shell,bash,printf thing.
I will have to think about it:

Just now -when I saw your posts -I did this:
replaced in code /usr/bin/printf with printf snd (still) /bin/sh -> the wrong printf ie the glyph is NOT resolved
then
printf -and I changed to !#/bin/bash -> same result as above

I *do* have a printf in /usr/bin

Just an immidiate observation.probably not the whole truth.
I have to think about it,to understand whats what

Very nice to have you around. :)

jamesbond
Posts: 3433
Joined: Mon 26 Feb 2007, 05:02
Location: The Blue Marble

#18 Post by jamesbond »

Oops, sorry, bad advice. Bash printf's supports "\u" only in bash 4.2 and newer. That practically excludes many puppies except the newest slacko :(
That means in older puppies where bash version < 4.2 and /usr/bin/printf is symlinked to busybox, there is no way to get your script working :(
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#19 Post by MochiMoppel »

jamesbond wrote:Bash printf's supports "\u" only in bash 4.2 and newer.
That's not my biggest concern. I've never used awk or printf for the conversion and even my old Slacko 5.6 comes with bash 4.2, so the \u switch is well supported in Slacko.

The problem for me are the limitations of the \u switch and my hope was that some1's more sophisticated script (which I have trouble to read :cry: ) would somehow overcome the limitations. My problem: \u can convert 2digit, 3digit and 4digit unicode values, but not when they appear in the same string.

An example might help. The following script takes the content of a ROX bookmark file as input. 3 bookmarked directories with names in Japanese (4digit), Greek (3digit) and Spanish (2digit). In this combination only the accented "a" in Malaga will be correctly decoded. However if you remove the third bookmark, Japanese and Greek are OK. And if this is not confusing enough: When the result of the first decoding (3 bookmarks, only Spanish readable) is exported to gtkdialog, Japanese and Greek are readable, Spanish is not. If I can find a reliable conversion method I would gladly implement it in my SpeedDials tool.

Code: Select all

#!/bin/sh
todecode='<?xml version="1.0"?>
<bookmarks>
  <bookmark title="Yokohama">/root/&#x6A2A;&#x6D5C;</bookmark>
  <bookmark title="Athens">/root/&#x391;&#x3B8;&#x3AE;&#x3BD;&#x3B1;</bookmark>
  <bookmark title="Malaga">/root/M&#xE1;laga</bookmark>
</bookmarks>'

T=$(sed 's/&#x\([0-9A-F]*\)/\\u\1/g' <<< "$todecode")
gxmessage "$(echo -e "$T" | sed 's/;//g')"
Attachments
unicode_decoded.png
(18.17 KiB) Downloaded 233 times

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#20 Post by some1 »

@MochiMoppel
You may not find this helpfull -
but I could not resist. :)
Attachments
W1412697733.png
(12.95 KiB) Downloaded 131 times
W1412697715.png
(17.14 KiB) Downloaded 258 times

Post Reply