SMS backup messages - strip text from .vmg files [SOLVED]

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Message
Author
User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#61 Post by MochiMoppel »

greengeek wrote:MochiBash:
23.07.2015 19:32:59 not sure on =E2=80=9Clexmark=E2=80=9D but the red one is with him and the ye=
llow one with me (for later sharing). Paul
Looks perfect to me.
It seems that your phone is unable to handle text encoded as Quoted-printable and subsequently is unable to store a decoded message. I would regard it as being out of scope for any backup script to include decoding functionality. If you are eager to know what you are misssing you can use online tools to decode your weird messages:
Attachments
quoted-printable.png
(21.34 KiB) Downloaded 348 times

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#62 Post by 6502coder »

Hmm....the "weird" message has "TEXT" followed by a SEMICOLON this time -- before it's always been a COLON. Okay, easy enough to fix:

Code: Select all

$ cat dt4.awk
# This version handles multi-line TEXT sections
BEGIN { RS="\r\n" }
/^Date:/    {   printf( "%s\t%s", substr($1,6), $2) }
/^TEXT(:|;)/    {   text = sprintf( "\t%s\n", substr($0,6))
                getline y
                while (y != "END:VBODY")
                {
                    text = text y "\n"
                    getline y
                }
                printf( "%s", text )
            }
This tested okay on the "weird" message as well as all the previous ones. ("Okay" meaning, aside from the funky encoding stuff, of course. I agree with Mochi that that's somebody else's job...)

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#63 Post by MochiMoppel »

Well, greengeek mentioned it earlier:
On July 6 greengeek wrote:this text was formatted something like the following:

Code: Select all

TEXT;encoding non-standardUTF8:Wow that seems like a lively thing to say :-) %^$%#@$
Note the semicolon after the word TEXT; rather than the colon previously seen.
The question remains if the stuff after the semicolon up to the first colon should be saved as well. I don't regard this as part of the text message (the sender didn't write it, the phone did) and therefore I didn't include it, but eventually this will be for greengeek to decide.

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#64 Post by greengeek »

Fantastic. i am going to mark the thread totally SOLVED. I now have two methods of handling vast quantities of texts in a very short time. There are slight differences between the methods and probably each will be more appropriate at different times depending on the balance of text content and which sender/phone they came from.

I just processed a small bunch of texts from 2 senders/phones and the results are both perfectly useable for my needs:

dt4.awk method:

Code: Select all

23.07.2015	17:17:26	I just have 2 keep drowning it in gaviscon and coffee. You comn past 2nite?
23.07.2015	18:19:54	Wat time r y gettn here?
23.07.2015	18:58:58	I doubt it
23.07.2015	19:32:59	CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:not sure on =E2=80=9Clexmark=E2=80=9D but the red one is with him and the ye=
llow one with me (for later sharing). Paul
23.07.2015	19:49:15	CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:can=E2=80=99t afford an iwatch - Oh no - now i can\, of course :)
24.07.2015	12:20:53	Dnt know you mean mesh in the grille? Is it up 4 sale?
25.07.2015	11:44:14	Man i got the depo and my whole knee has swelled up\, its affecting the knee movement
MochiBash method:

Code: Select all

23.07.2015 17:17:26 	I just have 2 keep drowning it in gaviscon and coffee. You comn past 2nite?
23.07.2015 18:19:54 	Wat time r y gettn here?
23.07.2015 18:58:58 	I doubt it
23.07.2015 19:32:59 	not sure on =E2=80=9Clexmark=E2=80=9D but the red one is with him and the ye=
                    	llow one with me (for later sharing). Paul
23.07.2015 19:49:15 	can=E2=80=99t afford an iwatch - Oh no - now i can\, of course :)
24.07.2015 12:20:53 	Dnt know you mean mesh in the grille? Is it up 4 sale?
25.07.2015 11:44:14 	Man i got the depo and my whole knee has swelled up\, its affecting the knee movement
Thank you all contributors and especially Mochi and 6502coder for the final results. This is a wonderful timesaver for me. And I have learnt plenty about bash and awk (regex and other stuff too) on the way. I started out hoping I could automate some small steps using geany and ended up with a vastly quicker more accurate and almost totally automated process. Thank you!
musher0 wrote:When will you be handing 6502coder the "Oscar for Best Coder"? :)
Do we simple mortals get consolation prizes?
Maybe a nomination? :lol:
Yep, you all get a nomination for "best contributing coders of 2015". If I could i would make some animated gif Gold, Silver and Bronze badges to award. As it is I hope the following congratulatory images will suffice. Many thanks all for the assistance!
:)
.
(ps: thanks to "sevenoaksart" and "Heathers animations" for the gifs)
Attachments
sharingspin.gif
(76.87 KiB) Downloaded 234 times
4thuly.gif
(56.44 KiB) Downloaded 239 times
fireworks5.gif
(13.02 KiB) Downloaded 233 times
Last edited by greengeek on Wed 29 Jul 2015, 09:43, edited 1 time in total.

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#65 Post by greengeek »

By the way Mochi - thanks for the info and links regarding quoted-printable text. I had never heard of it before and seeing it decoded like that was really helpful.
:!:

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#66 Post by some1 »

F.x the '=E2=80=9C' is a representation of the
UTF-8 value of a LEFT QUOTATION MARK;
So in sed we could do something like this::

Code: Select all

   echo -e "$( sed 's/=\([0-9A-F][0-9A-F]\)/\\x\1/g'<<< "$parsedTEXT")"
:   
will decode the UTF-8-representation of codepoints.
Similar in awk - probably using gsub.

But the scheme also uses entities/aliases with special meaning -
e.x. a "=" does wordsplitting -
so we have to augment the sed/awk-preparation-code to also handle these
-to come up with a fully-fledged decoder.

The scheme is probably wellknown/welldocumented - so....

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#67 Post by MochiMoppel »

Good to see the end of the tunnel, but let me add just one line of code. I assume that you can live with a few weird characters here and there, so there is no need for perfect decoding, but 9 in a row is a bit too much. I suggest to replace the ones you may encounter most, encoded quotation marks, into something more readable. This will change 6 different types of quotation marks into single quotes:
  • for f in *.vmg;do
    • fmsg=$(<"$f")
      fbody=${fmsg#*BEGIN:VBODY[[:cntrl:]]}
      fbody=${fbody%[[:cntrl:]]END:VBODY*}
      fdate=${fbody#*Date:}
      fdate=${fdate%%[[:cntrl:]]*}
      space=${fdate//?/ }
      ftext=${fbody#*TEXT*:}
      ftext=${ftext//=E2=80=9[BCDF89]/\'}
      ftext=${ftext//$'\n'/$'\n'$space $'\t'}
      echo -e "$fdate \t$ftext" >> result.txt
    done
Original message:

Code: Select all

Date:23.07.2015 19:32:59
TEXT;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:not sure on =E2=80=9Clexmark=E2=80=9D but the red one is with him and the ye=
llow one with me (for later sharing). Can=E2=80=99t afford an iwatch. Paul
Result:

Code: Select all

23.07.2015 19:32:59 	not sure on 'lexmark' but the red one is with him and the ye=
                    	llow one with me (for later sharing). Can't afford an iwatch. Paul

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#68 Post by 6502coder »

For the record, I consider that MochiMoppel kicked my butt with his concise solutions.

However, since some1 keeps prodding me to uphold the honor of the AWK crowd, here's one last go, incorporating MochiMoppel's mapping into single quotes and (optionally) the stripping out of the charset-encoding headers.

Code: Select all

$ cat dt5.awk
# This version handles multi-line TEXT sections
# If stripEncoding is set to 1, we will delete "CHARSET..." stuff
# preceding the actual text.
# To keep that stuff, just change stripEncoding to zero in the BEGIN section.
# Assists from MochiMoppel and some1

BEGIN { RS="\r\n"; stripEncoding=1 }

/^Date:/        {   printf( "%s\t%s", substr($1,6), $2) }
/^TEXT(:|;)/    {   text = "\t" substr($0,6) "\n"
                    getline y
                    while (y != "END:VBODY")
                    {
                        text = text y "\n"
                        getline y
                    }

                    gsub( /=E2=80=9[BCDF89]/, "'", text )

                    if (stripEncoding && index( text, "CHARSET" ))
                    {
                        cn = index( text, ":" ) + 1
                        text = "\t" substr( text, cn )
                    }
                    printf( "%s", text )
                }
It's been fun!

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#69 Post by greengeek »

Righto, I have selected three versions that I am likely to use. I have attached pets as I prefer to add such functionality temporarily in RAM rather than grafting it into my main sfs (I have no savefile). Attached pets are:

- smsvmgstrip_raw_awk which leaves Quoted-Printable characters untouched and also leaves the "CHARSET" warning intact. I would use this in circumstances where I have a desire to be alerted to any potential oddities.

- smsvmgstrip_clean_awk which substitutes the common quote marks and discards the CHARSET warnings.

- smsvmgstripper_bash which also substitutes the common quote marks and discards the CHARSET warnings. This is likely to be my most frequent choice for simplicity of operation.

Requirements are that the sms messages (.vmg suffix) must be inside a directory /root/Message. Each pet installs it's relevant script in /root and each method exports a differently named text file so all three can co-exist. The scripts can be relocated if desired - I just happen to like working in /root as I run in ram without savefile.

I will give these three a good thrash over the next week and see how things go...
cheers!
Attachments
smsvmgstrip_raw_awk-0.1.pet
(1.3 KiB) Downloaded 142 times
smsvmgstrip_clean_awk-0.1.pet
(1.56 KiB) Downloaded 125 times
smsvmgstripper_bash-0.1.pet
(1.18 KiB) Downloaded 139 times

Post Reply