Decoding ROX XML Files

Message

some1 · #1 Post by **some1** » Thu 02 Oct 2014, 15:22

Hi
The decoder has been doing fine at home for a week or so -
but my turf is only a very small area in the realm of unicode.
So - for the common good - please give it a spin,wherever you are.
On scary/dubious sightings - tell some1.

Copy the codeblock to terminal/into script - and run.
Output goes to /root

You may want to have some roxbookmarks -with non-ascii-chars.

Code: Select all


function decode_roxxml() {
#ovo===========================================================
#tool to decode xml 1.0 encoded unicode content into
#     a system- and human-friendly form.
#     input:file "$1" (any file of relevance/interest)
#     output:out-echoed decoded input
#    
# 2014/09/21 by some1 at http://www.murga-linux.com/puppy
#===========================================================o-o
/usr/bin/printf "$(LANG=C;
awk --posix 'BEGIN {a["&"] = "\\046";a["&apos;"] = "\047";
a["""] = "\042";a[">"] = "\076";a["<"] = "\074"}
{t=$0;
{while (match(t,/&((#x[0-9A-F]{2,5})||(amp|gt|lt|quot|apos));/)){\
s=substr(t,RSTART,RLENGTH);
if (!a[s]){\
y=s;sub(/&#x/,"",y);sub(/;/,"",y);
if (length(y) > 4){a[s]="\\U" sprintf("%08s", y)} else a[s]="\\u" sprintf("%04s", y);
};
gsub(s,a[s],t);
}
}print t}' "$1")"
#===========================================================o^o
}


todecode="/root/.config/rox.sourceforge.net/ROX-Filer/Bookmarks.xml"
decode_roxxml "$todecode" >/root/wysiwyg_from_"${todecode##*'/'}"

geany "$todecode" /root/wysiwyg_from_"${todecode##*'/'}"

A pic showing some specs,no bling:

MochiMoppel · #2 Post by **MochiMoppel** » Fri 03 Oct 2014, 06:27

Produced error

Code: Select all

awk: cmd. line:3: {while (match(t,/&((#x[0-9A-F]{2,5})||(amp|gt|lt|quot|apos));/)){\ 
awk: cmd. line:3:                                                                  ^ backslash not last character on line


------------------
(program exited with code: 0)
Press return to continue

After removing 2 line breaks this works for me:

Code: Select all

#!/bin/sh
function decode_roxxml() { 
#ovo=========================================================== 
#tool to decode xml 1.0 encoded unicode content into 
#     a system- and human-friendly form. 
#     input:file "$1" (any file of relevance/interest) 
#     output:out-echoed decoded input 
#    
# 2014/09/21 by some1 at http://www.murga-linux.com/puppy 
#===========================================================o-o 
/usr/bin/printf "$(LANG=C; 
awk --posix 'BEGIN {a["&"] = "\\046";a["&apos;"] = "\047"; 
a["""] = "\042";a[">"] = "\076";a["<"] = "\074"} 
{t=$0; 
{while (match(t,/&((#x[0-9A-F]{2,5})||(amp|gt|lt|quot|apos));/)){s=substr(t,RSTART,RLENGTH); 
if (!a[s]){y=s;sub(/&#x/,"",y);sub(/;/,"",y); 
if (length(y) > 4){a[s]="\\U" sprintf("%08s", y)} else a[s]="\\u" sprintf("%04s", y); 
}; 
gsub(s,a[s],t); 
} 
}print t}' "$1")" 
#===========================================================o^o 
} 


todecode="/root/.config/rox.sourceforge.net/ROX-Filer/Bookmarks.xml" 
decode_roxxml "$todecode" >/root/wysiwyg_from_"${todecode##*'/'}" 

geany "$todecode" /root/wysiwyg_from_"${todecode##*'/'}"

The script changed HTML entities for the Japanese katagana characters for folder (フォルダ) from

Code: Select all

&#x30D5;&#x30A9;&#x30EB;&#x30C0;

to Unicode

Code: Select all

\u30D5\u30A9\u30EB\u30C0

But this is still not human readable and I don't know how this could be decoded further.
What I would need is UTF-8:

Code: Select all

%E3%83%95%E3%82%A9%E3%83%AB%E3%83%80

This I could turn into human readable text...

some1 · #3 Post by **some1** » Fri 03 Oct 2014, 07:25

@MochiMoppel -yeah - I imagined the possibility of minor dragons
in EastAsia/the higher unicodes/the "non-alphabeticals".
Life is simpler around here -just ascii plus a few weirdos.

I will look into it - but good with some real-life meat to chew on.
I am rather confident - that we can solve it - one way or the other.

I dont have much time now - so just a few words:
The "not last char error":
The "\" is an awk-Continuation-token
which blocks "{" from being interpreted by SHELL.
Make sure that "\" is the last char on line - i.e. no trailing space etc.
I will give it a spin later-on/upload a tar.
The code is really just a BIG "one-liner" -
but to make it readable - I rolled it out using the continuation-chars.
----

Have a nice night/day -whereever you are.

some1 · #4 Post by **some1** » Fri 03 Oct 2014, 10:31

@MochiMoppel

I ran the code posted with your katakana example
- and AFAIK the result seem ok to me

Put another way:I get what I expect -conceptionally.
But is the output humane at your place with your FONT?

MochiMoppel · #5 Post by **MochiMoppel** » Fri 03 Oct 2014, 11:19

Perfect!

But why wasn't it converted this way in the previously generated wysiwyg_from_Bookmarks.xml?
I tried again your original script and it now runs without errors. The problem with my first attempt was caused by an old Opera bug: Opera puts a space in front of textlines copied from a webpage. I fixed the preceding spaces but didn't know that Opera also adds trailing spaces

I then tried the katagana example, but again this only results in \u30D5\u30A9\u30EB\u30C0
What did you do to produce actual katakana?

some1 · #6 Post by **some1** » Sat 04 Oct 2014, 18:26

@MochiMoppel

Perfect!

Yeah -its a rather amazing critter.
A few hundred bytes,30 milliseconds -and the world as we know it is
freed from eyesores and broken filesytem-calls.

----
Your descriptions are not really of much use - I need files ex-ante and ex-post,info on tools etc - but wait a minute....

KATAKANA is not a problem for the decoder..
I have decoded the whole KATAKANA-unicode-subset.
Every item is decoded - and shows up as a "glyph".
I PM you the set of encoded codepoints - so you can replicate.
Furthermore - you get the result of the decoding done at myplace.
If you still have issues they are at your place,in your implementation,in your code,in your data.
.Its not really a brush-off on my part -but I need relevant data from you,
to be of any help.

In my experience - the decoder allways work.

I have tarred the codeblock in the toppost .
A placeholder for the function - or a script,set the executable bit.

MochiMoppel · #7 Post by **MochiMoppel** » Mon 06 Oct 2014, 02:49

some1 wrote:@MochiMoppel
Perfect!
Yeah -its a rather amazing critter.

I wish I could say that

"Perfect" related to your attached file katakanaFILE.tar.gz , which I loaded into Geany to show you that it indeed contains perfect katakana characters. It does not mean that my attempts to reproduce your feat was successful.

Your descriptions are not really of much use

I'm sorry to hear that. All I did is use your original script and your input file and described the output. What else do you expect me to do? I also used the katakana table you mailed me and I even tried with Lucid 5.2.8 and awk 3.1.6 (I normally use Slacko 5.6 and awk 3.1.8.). Same result. Your script just replaces HTML markups (フ => \u30D5). There must be other factors involved, but I'm not good at awk and don't know where the trouble starts.

#8 Post by **Flash** » Mon 06 Oct 2014, 02:55

I can't figure out what it is you guys are talking about. Are you trying to get ROX to display Japanese characters?

some1 · #9 Post by **some1** » Mon 06 Oct 2014, 05:07

Nice - to have the doghandler passing by

No -KATAKANA ,thats MochiMoppel.

I want to be able to show anything.
A kind of internationalization - but not like po,mo,gettext etc.

Its a solution to get rid of some annoyances,which occur when
"raw"/encoded unicodes found in rox-xml-files is used outside
the realm of rox - i.e. in guis/menus.Ex. "places" - menus etc.

More generally - with the ability to decode the rox-xml-content -
the door is open for fluid,reliable usage of content from all the
rox-xml-files.Creative minds just have to get accustomed to the fact -
that things no longer break in scripting - because of sudden encoded stuff.
.English-rooted folks may not really realise - that everyone else will get eyesores and broken filesystem-calls when the "raw"/encoded stuff is used..
Musher0,MochiMoppel -knows.
With that said - in the grander scheme of things its really a small thing -
but have been very hard to solve efficiently.
FWIK: Its a done deed -now.

Most unicodes are straightforward like an A is an A - but *some* unicodes -
especially in EastAsia/higher end of the unicodes - have other functionality like "binding","formatting","combinatorial" etc.
Its no problem to decode these -literally-but to me its an unknown - how the effect will be in "real-life".
Imagine - the decoder decodes literally -but if you saw all the A's in
Arizona show up lopsided - it might be interestingly,but also confusing,over there.
I dont *know* -what happens visually in HANGUL,CJK,KATAKANA,
DEVENAGARI etc.My "Fears of the unknown",likely/perhaps.
I am on latin1 - so by sheer inference things will be well in associated spheres/most of the world geographically..But most people dont live there.
And puppy is everywhere,right?

Hope that helps.

some1 · #10 Post by **some1** » Mon 06 Oct 2014, 06:26

@MochiMoppel
whats your printf?

jamesbond · #11 Post by **jamesbond** » Mon 06 Oct 2014, 11:32

1. make sure that #!/bin/sh is #!/bin/bash (if you are sure that /bin/sh is a symlink to bash then it's ok no need to change).
2. change "/usr/bin/printf" to "printf" - ie use bash's internal printf.
MochiMoppel's /usr/bin/printf probably points to busybox, which does not recognise "\u" escape sequence and thus will print it as is. some1's printf is probably the "full" version of printf from coreutils.

MochiMoppel · #12 Post by **MochiMoppel** » Mon 06 Oct 2014, 11:33

BusyBox v1.21. ...if that's what you mean.

jamesbond · #13 Post by **jamesbond** » Mon 06 Oct 2014, 11:48

MochiMoppel wrote:BusyBox v1.21. ...if that's what you mean.

Yup, that explains it. busybox printf doesn't understand \u. Have you tried my suggestion above, which will make use of bash's built-in printf and does it work now?

some1 · #14 Post by **some1** » Mon 06 Oct 2014, 14:03

@jamesbond

Thanks!.You are right - one possible reason at MociMoppel could be - that the active printf dont know \u.Only the real gnu-tools does..

Just to be sure I understand this:
The code calls /usr/bin/printf. (Added7Edited:the runtime pic in the toppost
specifies the GNU-printf)
Is it so - that busybox can masqerade as the /usr/bin/printf ???
If so - how can the code guarantee - that the GNU-printf is called?

Thanks for your attention.

FWIW:The decoder works on Lucid 5.28

With respect to the tools used:
My take is that -if the tools work on Lucid 5.25 - the tools on newer
distros will also work.Is that a plausible inference/assumption?

seaside · #15 Post by **seaside** » Mon 06 Oct 2014, 16:24

some1 wrote:@jamesbond

specifies the GNU-printf)
Is it so - that busybox can masqerade as the /usr/bin/printf ???
If so - how can the code guarantee - that the GNU-printf is called?

Here's a test for bash

Code: Select all

testinfo=$(file `which printf`)
[ "${testinfo/busybox//}" != "${testinfo}" ] && echo 'Using busybox'

This checks if printf is a link to busybox. (Most likely)

Cheers,
seaside

jamesbond · #16 Post by **jamesbond** » Mon 06 Oct 2014, 17:20

some1, you can always use bash printf "built-in" instead, just remove "/usr/bin" from "/usr/bin/printf" (that is, just call it as "printf" instead of "/usr/bin/printf"). I have bash 4.2 and bash's internal printf works better than busybox printf. Of course, to be sure, change #!/bin/sh to #/bin/bash too (you're using "function" keyword - so your script is bash script anyway, it won't work on plain shell).

some1 · #17 Post by **some1** » Tue 07 Oct 2014, 11:17

Hi
Thanks Guys

No-i was not fully aware of the busybox,shell,bash,printf thing.
I will have to think about it:

Just now -when I saw your posts -I did this:
replaced in code /usr/bin/printf with printf snd (still) /bin/sh -> the wrong printf ie the glyph is NOT resolved
then
printf -and I changed to !#/bin/bash -> same result as above

I *do* have a printf in /usr/bin

Just an immidiate observation.probably not the whole truth.
I have to think about it,to understand whats what

Very nice to have you around.

jamesbond · #18 Post by **jamesbond** » Tue 07 Oct 2014, 12:09

Oops, sorry, bad advice. Bash printf's supports "\u" only in bash 4.2 and newer. That practically excludes many puppies except the newest slacko

That means in older puppies where bash version < 4.2 and /usr/bin/printf is symlinked to busybox, there is no way to get your script working

MochiMoppel · #19 Post by **MochiMoppel** » Tue 07 Oct 2014, 15:20

jamesbond wrote:Bash printf's supports "\u" only in bash 4.2 and newer.

That's not my biggest concern. I've never used awk or printf for the conversion and even my old Slacko 5.6 comes with bash 4.2, so the \u switch is well supported in Slacko.

The problem for me are the limitations of the \u switch and my hope was that some1's more sophisticated script (which I have trouble to read

) would somehow overcome the limitations. My problem: \u can convert 2digit, 3digit and 4digit unicode values, but not when they appear in the same string.

An example might help. The following script takes the content of a ROX bookmark file as input. 3 bookmarked directories with names in Japanese (4digit), Greek (3digit) and Spanish (2digit). In this combination only the accented "a" in Malaga will be correctly decoded. However if you remove the third bookmark, Japanese and Greek are OK. And if this is not confusing enough: When the result of the first decoding (3 bookmarks, only Spanish readable) is exported to gtkdialog, Japanese and Greek are readable, Spanish is not. If I can find a reliable conversion method I would gladly implement it in my SpeedDials tool.

Code: Select all

#!/bin/sh
todecode='<?xml version="1.0"?>
<bookmarks>
  <bookmark title="Yokohama">/root/&#x6A2A;&#x6D5C;</bookmark>
  <bookmark title="Athens">/root/&#x391;&#x3B8;&#x3AE;&#x3BD;&#x3B1;</bookmark>
  <bookmark title="Malaga">/root/M&#xE1;laga</bookmark>
</bookmarks>'

T=$(sed 's/&#x\([0-9A-F]*\)/\\u\1/g' <<< "$todecode")
gxmessage "$(echo -e "$T" | sed 's/;//g')"

some1 · #20 Post by **some1** » Tue 07 Oct 2014, 17:55

@MochiMoppel
You may not find this helpfull -
but I could not resist.

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

Decoding ROX XML Files

Decoding ROX XML Files