SMS backup messages - strip text from .vmg files [SOLVED]

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Message
Author
User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#41 Post by MochiMoppel »

Sorry.

Change

Code: Select all

sed -n '/^Date/ N;s/Date://;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
to

Code: Select all

sed -n '/^Date/ N;s/Date://;s/[\n\r]\+TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
[\n\r] instead of [^0-9] should now match only "TEXT:" preceded by Unix/Microsoft style line endings. Changing '*' to '\+' shouldn't make a difference, but explicitly stipulates that the first encountered "TEXT:" must be preceded by at least one lineend character. The '\+' is a GNU extension and might only work in GNU sed. An unescaped '+' might work in other versions... who knows.

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#42 Post by MochiMoppel »

@seaside
Wouldn't fdate=${fdate%% *} remove the time strings?

If you want to get rid of the trailing CR you could do something like fdate=${fdate%?} or fdate=${fdate::-1}, but this would only work when lines end with CRLF (i.e. would work with greengeek's examples).
To get rid of any type of trailing control characters you probably would be better off with something like fdate=${fdate%[[:cntrl:]]*}

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#43 Post by some1 »

Well --

sed:
seaside gave us the concanation of the lines -
after that we just use grouping on what we want as output - interspersed
with tabs ad libitum:

Code: Select all

sed -n '/^Date/ N;s/Date:\(.*\)\.2015 \(.*\)\r\nTEXT:\(.*\)\r/\1\t\2\t\3/p' /pathtosmsdir/*.vmg >yourextracts

bash:
I learned a lot from seasides postings through the years- - so if you think that the code below resembles his take on things - thats no conincidence.
However - I am not a seasoned bashscripter - so any mistakes,brainfarting are mine.

Code: Select all

tb=$'\t'
cr=$'\r'
FILES=/path2smsdir/*.vmg
for f in $FILES; do
[ ! -f "$f" ] && continue #optionally
while IFS=": " read -r tkn rest || [ "$tkn" ]; do
rest="${rest/$cr/}"
case $tkn in
Date) first="${rest/.2015 /$tb}";;
TEXT) printf "%s\t\t%s\n" "$first" "$rest";;
esac 
done < "$f" >>yourextracts
done
EDITED:
Just back from a swim,its very hot and humid around here -
but I think the fog cleared a bit.
The code works - but is really a mindfuck.
There is no point in splitting on space in this.
And we probably have to have 2 stringsops
So more or less :wink: - same thing as seaside.

Code: Select all

tb=$'\t'
cr=$'\r'
FILES=/path2smsdir/*.vmg
for f in $FILES; do
[ ! -f "$f" ] && continue #optionally
while IFS=":" read -r tkn rest || [ "$tkn" ]; do

case $tkn in
Date) rest="${rest/$cr/}";first="${rest/.2015 /$tb}";;
TEXT) printf "%s\t\t%s\n" "$first" "${rest/$cr/}";;
esac 

done < "$f" >>yourextracts
done
----
I will let my question to Technosaurus stand -
just in case he has some gems to share.
---------------------------------------------------------
@Technosaurus:
I really dont want this

Code: Select all

:rest="${rest/$cr/}"
I wanted/hoped for - doing all the splitting
in this line:
while IFS=": " read -r tkn rest || [ "$tkn" ]; do -line

I tried the -d switch with $'\r\n',
tried adapting IFS - to no avail.
Any suggestions?
-----

My reason for doing the timings mentioned previously -
occurred - because initially we did not know the size of strings
and filebytes.
We just had a few example lines of data to go by.

Bash scales badly- I wanted a grip on when/if to drop a bash-solution.
We now know - that files are smallish-that bash wont crawl.
An awk-call IS costly - BUT....
Timings will probably show - from fastest to slowest:
1) awk
2) bash
3) sed

--
So yes- greengreek - please run my codepieces - so we get the
typos/bugs squashed.
Last edited by some1 on Sat 04 Jul 2015, 20:51, edited 1 time in total.

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#44 Post by some1 »

As MochiMoppel indicated:

Hitherto - (all?) codepieces is based on the assumption that
the input-data are NOT jumbled i.e:
We ASSUME that
1) a Date: -line is immediately followed by
2) a TEXT:-line - which is NOT a multiline-field.

Can we/you rely on that assumption,greengeek?

I dont know the original data-format - but the "Yes \,"-sighting might indicate,
that things originally -in the phone- are stored message by message in
simple comma-delimited-format.If so - we just have to replace
"\," with "," f.x /\\,/,/ as the tool allows

Lets see the other "anomalies".

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#45 Post by seaside »

This is a very engaging discussion.

MochiMoppel mentioned-
@seaside
Wouldn't fdate=${fdate%% *} remove the time strings?
Yes, starting from the end of the string remove all chars up to and including the furthermost space.

Greengeek, this is probably not a good time to bring this up, but isn't this sms data in a sql database? Couldn't you pull whatever data you wished with sqlite3 queries?

Cheers,
s

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#46 Post by greengeek »

seaside wrote: this is probably not a good time to bring this up, but isn't this sms data in a sql database? Couldn't you pull whatever data you wished with sqlite3 queries?
Sorry for the delay, I'm busy babysitting my granddaughter for a couple of days and not getting enough computer time :-)
I don't know anything about the format used to store the messages - the phone is an old 'button style' Samsung phone and it offers various options such as "move to folders", "move to phone" and "backup".
I use the "backup multiple" option and it copies the selected messages onto the internal microSD card and places them into the directory /ss-Backup-0001/Message with the .vmg suffix

I then put the microSD into a card reader to transfer the files to the PC. As far as I know there is no way to hook this particular phone up to the PC directly.

I don't know if the messages could be part of a database - I have no experience in that area. Maybe they are in database form when inside the phone (or inside the SIM?) but just get pulled out as the .vmg format during the backup.

I hope to get more testing done later today (more babysitting on the agenda for now..)

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#47 Post by MochiMoppel »

@greengeek
AFAIK sms messages (the stuff after TEXT:) may contain line breaks, your examples don't. If this would have to be considered, some solutions (incl. mine) might have to be revised.

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#48 Post by 6502coder »

some1 wrote:As MochiMoppel indicated:

Hitherto - (all?) codepieces is based on the assumption that
the input-data are NOT jumbled i.e:
We ASSUME that
1) a Date: -line is immediately followed by
2) a TEXT:-line - which is NOT a multiline-field.
My awk solutions (as well as some of the other solutions) do not require that the TEXT line immediately follows the Date line, although they DO assume that the Date line occurs before its corresponding TEXT line, and that Date-TEXT pairs aren't interwoven (ie once a Date line occurs, it is assumed that the matching TEXT line will occur before any other Date line occurs.

If TEXT records can have internal hard EOLs, then evidently we'd have to infer the end of the TEXT by looking for the next record header string ("Date:", "TEXT:", etc.--which would therefore have be a known list of strings) or EOF.

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#49 Post by seaside »

Another approach would be to grab the data between "BEGIN:VBODY" and "END:VBODY" (that's assuming all messages only go there)

Code: Select all

sed  '/^BEGIN:VBODY/,/^END:VBODY/!d;//d' msgfile
Then any multi-line data after "TEXT" could be handled.

Cheers,
s

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#50 Post by some1 »

@seaside

YES - VBODY-tag is the safe approach.
Like sed -awk has ability to do range.

If we look at the 4 data-examples provided by greengeek -
it would be sufficient to just do a direct/dedicated extract of lines 13,14
- which is also quite similar in sed,awk.

BUT - I dont believe that the output-datastructure is static/stable.
Tagging is a tree-thing - to allow for flexibility with respect to
views,speed and storage.
Thats why - I raised the aspect of possibly jumbled data and multi-lines.
So - until more is known about the datastructure - the safe approach
will probably be like you do - covering the posibilitties inside the VBODY.


Hitherto we have -somewhat focused on TEXT: and Date: as if they are KEYWORDS,FIELDNAMES - but that may not necessarily be the case.

I thought of this silly thing - to get a grip of the datasructure:
Grab the BEGIN|END-VBODY-TAGS-part
1.count the lines
2.register the Date: and TEXT: lines position inside the TAGS-extract.
Let greengeek run it on ALL files.
Then - based on the statistics - the questions about the reliability of
the structure would likely be answered - and with that the final choice
of grabbing method for production.
But - in effect - that would be writing the "solution" -just to get to the solution :) .Much like writing a test-program to test if the real program is sane.

Better let greengeek do some eyeballing :)

Besides the reliability of the structure - greengeeks "anomalies" can
refer to formatting/escape - issues.

I will probably retire to lurking.

But as you observed :A rich thread.

Edited:
VBODY - not vbox
Last edited by some1 on Mon 06 Jul 2015, 14:32, edited 1 time in total.

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#51 Post by greengeek »

some1 wrote:Let greengeek run it on ALL files.
Then - based on the statistics - the questions about the reliability of
the structure would likely be answered
There are a couple of anomalies that I have seen which I may need to handle manually if they are too difficult to fix programatically. One is the \, that you mentioned - the strange thing is that this odd way of presenting a "comma" only occurs from some of the cellphones that text me, not all of them. Maybe it is caused by a difference in language selection or encoding in those phones? I will try to find an explanation for this later on. In the meantime I plan to use Geany to resolve these artifacts with a simple manual replace, after the main script has stripped the text from the messages.
The second issue relates to one (only one out of several hundred) messages that I have processed recently - I thought I had kept this odd text as an example of the strange format but when I view it now it does not contain the anomalous format. I don't know why it is now correct as I do not remember manually reformatting it. However, from memory, this text was formatted something like the following:

Code: Select all

TEXT;encoding non-standardUTF8:Wow that seems like a lively thing to say :-) %^$%#@$
Note the semicolon after the word TEXT; rather than the colon previously seen. I do wonder if one of the special characters the sender used may have been a character my phone did not recognise.
I am hoping the sender may be able to retrieve this text and resend it to me so that I can work out if this was a genuinely weird text format or some residue of a mistake i may have made with cutting/pasting (I don't think so).

Other than these two anomalies the texts I have posted so far appear to be standard examples from the one cellphone I most need to record messages from.

As soon as I get time I need to try a couple more of the code sample improvements posted above then I will launch into using the scripts to harvest as many as possible of the 900 approx messages in my inbox. Then I should be able to offer a bit more info in terms of how variable the message format is.

Why do i have so many messages and why do i want to save them? Well, my son is on strong psychiatric medications and needs to be constantly monitored for signs of deterioration. The content of his texts offers a very clear indication of the state of his thinking and makes it possible to predict when things are not good and when he is likely to be needing hospital time. The texts can range from friendly and understandable through to aggressive, frightening and threatening so it is valuable to be able to assess the situation without the emotional tension that can accompany face to face to face communication. This text record also allows me to show his health workers and doctors that there are cycles in his behaviour that mirror his medication regime, and also allow me to demonstrate that he is capable of normal human interaction when the pressure of interaction with the medical and judicial system is not confronting him.

The help you guys have given me in putting these script options together means that already i can process many texts in the time it has previously been taking me to manually process a handful. This is clearly going to be a massive timesaver for me and i am very grateful for all of the contributions. Even last night, while i was distracted with babysitting I was able to backup and harvest over 50 texts in the time that might normally have only allowed me to type 5 or so of them. Thank you all for the help, and I will be reporting back with more tests/info as soon as I can.

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#52 Post by MochiMoppel »

greengeek wrote:

Code: Select all

TEXT;encoding non-standardUTF8:Wow that seems like a lively thing to say :-) %^$%#@$
See 2nd example.

Pure bash:

Code: Select all

for f in *.vmg;do
	fmsg=$(<"$f")
	fbody=${fmsg#*BEGIN:VBODY[[:cntrl:]]}
	fbody=${fbody%[[:cntrl:]]END:VBODY*}
	fdate=${fbody#*Date:}
	fdate=${fdate%%[[:cntrl:]]*}
	space=${fdate//?/ }
	ftext=${fbody#*TEXT*:}
	ftext=${ftext//$'\n'/$'\n'$space $'\t'}
	echo -e "$fdate \t$ftext"  >> result.txt
done
Attachments
MultilineMessages.png
(70.36 KiB) Downloaded 538 times

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#53 Post by some1 »

Hi
To save the eyeballing with respect to the structure inside VBODY,
run the attached script on your SMS-dir.

Formatting problems are different issues.
Attachments
somecheck.tar.gz
Will check the STRUCTURE inside VBODY
Read the instructions.
(918 Bytes) Downloaded 205 times

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#54 Post by 6502coder »

Here's yet another version of the awk script. This one handles multi-line TEXT sections by assuming that the TEXT section is immediately followed by an END:VBODY record.

Code: Select all

$ cat dt3.awk
# This version handles multi-line TEXT sections
BEGIN { RS="\r\n" }
/^Date:/    {   printf( "%s\t%s", substr($1,6), $2) }
/^TEXT:/    {   text = sprintf( "\t%s\n", substr($0,6))
    getline y
    while (y != "END:VBODY")
    {
         text = text y "\n"
         getline y
     }
     printf( "%s", text )
    }

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#55 Post by greengeek »

Sorry for the delay - I haven't seen my 3 year old granddaughter for over a year and she is full of beans and takes up every spare minute. Damn she's a little cutie. Easier being a grandparent than a parent :-)
some1 wrote: tb=$'\t'
cr=$'\r'
FILES=/path2smsdir/*.vmg
for f in $FILES; do
[ ! -f "$f" ] && continue #optionally
while IFS=":" read -r tkn rest || [ "$tkn" ]; do

case $tkn in
Date) rest="${rest/$cr/}";first="${rest/.2015 /$tb}";;
TEXT) printf "%s\t\t%s\n" "$first" "${rest/$cr/}";;
esac

done < "$f" >>yourextracts
done
This works well, but does not include the .2015 portion of the date (which is in line with my original request but I now feel I should include the whole date for clarity)
some1 wrote:Hitherto - (all?) codepieces is based on the assumption that
the input-data are NOT jumbled i.e:
We ASSUME that
1) a Date: -line is immediately followed by
2) a TEXT:-line - which is NOT a multiline-field.

Can we/you rely on that assumption,greengeek?
Based on what I have seen so far, yes that is a fair assumption - however I have seen a couple of weird texts which - although they did follow this pattern - were not quite the same as the others. Unfortunately I accidentally erased these texts and they have also been deleted from the senders outbox (quite annoying) so I cannot get them re-sent for further testing. I will post any future oddities that I find.
MochiMoppel wrote:#!/bin/bash
for f in /root/Message/*.vmg;do
fmsg=$(<"$f")
fbody=${fmsg#*BEGIN:VBODY[[:cntrl:]]}
fbody=${fbody%[[:cntrl:]]END:VBODY*}
fdate=${fbody#*Date:}
fdate=${fdate%%[[:cntrl:]]*}
space=${fdate//?/ }
ftext=${fbody#*TEXT*:}
ftext=${ftext//$'\n'/$'\n'$space $'\t'}
echo -e "$fdate \t$ftext" >> result.txt
done
This works perfectly. Thanks.
6502coder wrote:$ cat dt3.awk
# This version handles multi-line TEXT sections
BEGIN { RS="\r\n" }
/^Date:/ { printf( "%s\t%s", substr($1,6), $2) }
/^TEXT:/ { text = sprintf( "\t%s\n", substr($0,6))
getline y
while (y != "END:VBODY")
{
text = text y "\n"
getline y
}
printf( "%s", text )
}
This works perfectly - although I have no multiline texts to run against it at the moment. I'm not sure how to generate some for testing - I will have to give this a go in my next lot of tests.
some1 wrote: # if contents of SEEN = POSSIBLES = EXPECTED
# - we have no jumbling,no multilines-TEXT.
I ran the script against my Messages directory and came up with 3 identical files as follows (a cluster of 40 messages):
(Therefore "no jumbling,no multilines-TEXT" I guess?)

Code: Select all

/root/Message/20150702190004_SendersPhNumber.vmg
/root/Message/20150703142734_SendersPhNumber.vmg
/root/Message/20150703143608_SendersPhNumber.vmg
/root/Message/20150703145102_SendersPhNumber.vmg
/root/Message/20150703150341_SendersPhNumber.vmg
/root/Message/20150703151149_SendersPhNumber.vmg
/root/Message/20150703151302_SendersPhNumber.vmg
/root/Message/20150704033934_SendersPhNumber.vmg
/root/Message/20150704034204_SendersPhNumber.vmg
/root/Message/20150704034738_SendersPhNumber.vmg
/root/Message/20150704101034_SendersPhNumber.vmg
/root/Message/20150704102452_SendersPhNumber.vmg
/root/Message/20150704104226_SendersPhNumber.vmg
/root/Message/20150704105157_SendersPhNumber.vmg
/root/Message/20150704110739_SendersPhNumber.vmg
/root/Message/20150704111553_SendersPhNumber.vmg
/root/Message/20150704173318_SendersPhNumber.vmg
/root/Message/20150704173519_SendersPhNumber.vmg
/root/Message/20150704173618_SendersPhNumber.vmg
/root/Message/20150704173903_SendersPhNumber.vmg
/root/Message/20150704174047_SendersPhNumber.vmg
/root/Message/20150704174235_SendersPhNumber.vmg
/root/Message/20150704174715_SendersPhNumber.vmg
/root/Message/20150705113110_SendersPhNumber.vmg
/root/Message/20150705113658_SendersPhNumber.vmg
/root/Message/20150705191705_SendersPhNumber.vmg
/root/Message/20150705192812_SendersPhNumber.vmg
/root/Message/20150705193532_SendersPhNumber.vmg
/root/Message/20150705194401_SendersPhNumber.vmg
/root/Message/20150705195247_SendersPhNumber.vmg
/root/Message/20150705195554_SendersPhNumber.vmg
/root/Message/20150705195911_SendersPhNumber.vmg
/root/Message/20150705200221_SendersPhNumber.vmg
/root/Message/20150705200538_SendersPhNumber.vmg
/root/Message/20150706170304_SendersPhNumber.vmg
/root/Message/20150706170656_SendersPhNumber.vmg
/root/Message/20150706173741_SendersPhNumber.vmg
/root/Message/20150706175602_SendersPhNumber.vmg
/root/Message/20150706184930_SendersPhNumber.vmg
/root/Message/20150706222731_SendersPhNumber.vmg

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#56 Post by some1 »

@greengeek:
I ran the script against my Messages directory and came up with 3 identical files as follows (a cluster of 40 messages):
(Therefore "no jumbling,no multilines-TEXT" I guess?)
Yes - if you dont see any ERRTYPE-files - and the 3 files mentioned above
are identical -then the structure inside the VBODY-tags is:
-----
Date: somedate
TEXT: one line of text
-----

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#57 Post by technosaurus »

Just thought I would mention that as of 1.25 geany (released this month) has a checkbox to allow multiline regex or otherwise uses sed-style matching.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#58 Post by greengeek »

technosaurus wrote:Just thought I would mention that as of 1.25 geany (released this month) has a checkbox to allow multiline regex or otherwise uses sed-style matching.
Thanks, I have also added your comment to the previous thread here for reference. Sounds like a handy new function.

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#59 Post by greengeek »

Attached is an example of an oddball text that doesn't fit the usual pattern. It was sent from an iPhone:

Code: Select all

BEGIN:VMSG
VERSION:1.1
X-IRMC-STATUS:READ
X-IRMC-BOX:INBOX
BEGIN:VCARD
VERSION:2.1
N;CHARSET=UTF-8:倀愀甀氀 挀攀氀氀;;;;
FN;CHARSET=UTF-8:
TEL:+64212xxxxxx
END:VCARD
BEGIN:VENV
BEGIN:VBODY
Date:23.07.2015 19:32:59
TEXT;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:not sure on =E2=80=9Clexmark=E2=80=9D but the red one is with him and the ye=
llow one with me (for later sharing). Paul
END:VBODY
END:VENV
END:VMSG
The "=E2=80=9C" and "=E2=80=9D" look as if they should be quotation marks, but on my phone they display as small solid circles that look like circular bullet points. Maybe the "=" in the middle of "yellow" is some sort of line feed but I'm not sure.

I think the text should read as follows:
not sure on "lexmark" but the red one is with him and the yellow one with me (for later sharing). Paul

I am hoping I can spot these weird things and handle them manually. Looks too complex and potentially variable to compensate for programatically. I haven't run this text through the different extraction methods yet but will post back with anything notable.

EDIT : I had to use Leafpad to redact the phone number in the following vmg so hopefully I have not stuffed up the encoding or format.
.
Attachments
WEIRDTEXT20150723193259_Paul cell.vmg.gz
Ignore and remove the false .gz suffix
(I had to use Leafpad to redact the senders ph number so hopefully this has not stuffed anything up)
(436 Bytes) Downloaded 194 times
Last edited by greengeek on Tue 28 Jul 2015, 18:49, edited 1 time in total.

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#60 Post by greengeek »

I tried running some of the extraction scripts against this one weird message (no other sms files in the directory) and got the following results:

MochiBash:
23.07.2015 19:32:59 not sure on =E2=80=9Clexmark=E2=80=9D but the red one is with him and the ye=
llow one with me (for later sharing). Paul

some1 (LANG=C method):
23.07.2015 19:32:59 TEXT;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:not sure on =E2=80=9Clexmark=E2=80=9D but the red one is with him and the ye=

6502coder dt3.awk:
23.07.2015 19:32:59

seaside July 4 method:
./seasideJuly4method: line 4: $f: ambiguous redirect

mushers July 4th method and some1's July 5th method both captured nothing of the text body or the date/time.

Unfortunately my brain has gone off the boil a bit and I feel I have missed testing some of the contributions. In particular I couldn't be sure i was correctly combining the technosaurus and seaside IFS methods from July 4th 14.29ish.

Post Reply