Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Fri 28 Aug 2015, 18:39
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
SMS backup messages - strip text from .vmg files [SOLVED]
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 5 [69 Posts]   Goto page: 1, 2, 3, 4, 5 Next
Author Message
greengeek

Joined: 20 Jul 2010
Posts: 3218
Location: New Zealand

PostPosted: Thu 02 Jul 2015, 06:23    Post subject:  SMS backup messages - strip text from .vmg files [SOLVED]
Subject description: Save data from .vmg backup sms files - Samsung, Nokia, Blackberry
 

I needed to strip date, time and text data from sms messages backed up from my old Samsung phone (which I can't connect via usb cable and have to rely on backing up messages to micro SD card). Originally I manually extracted the data from the messages, then I realised I could get Geany to do some of the work, then I realised there were other ways (awk, sed etc ) to process an entire directory of backed up sms messages - refer this thread here for the background.

That thread was based on finding ways to tidy up text files that i had partially manually formatted from the original sms files themselves, but it became clear that there were ways to extract the data directly from a directory of the raw text backups themselves. Attached are samples of the original text files:
(The .gz suffix is false - just remove it)

EDIT : The end result that I would like to achieve is a single text file which contains the date, time and text message data combined from all of the individual backed up sms files. The fields should be seperated by a tab for readability similar to this:
Code:
02.07.2015   19:56:17   Test message number 1
02.07.2015   19:56:42   Test message number two
02.07.2015   20:00:05   Yes\, this is another text message and I will make it a long one to see what happens if I go beyond the standard 160 characters and force it to send a very long text
02.07.2015   20:04:17   And here is a text containing other characters: "%&$#*@"


In the posts below I will evaluate methods for extracting the information I need
20150702195617_Greenphone.vmg.gz
Description  (The .gz suffix is false - just remove it)
gz

 Download 
Filename  20150702195617_Greenphone.vmg.gz 
Filesize  300 Bytes 
Downloaded  13 Time(s) 
20150702195642_Greenphone.vmg.gz
Description  (The .gz suffix is false - just remove it)
gz

 Download 
Filename  20150702195642_Greenphone.vmg.gz 
Filesize  302 Bytes 
Downloaded  12 Time(s) 
20150702200005_Greenphone.vmg.gz
Description  (The .gz suffix is false - just remove it)
gz

 Download 
Filename  20150702200005_Greenphone.vmg.gz 
Filesize  444 Bytes 
Downloaded  13 Time(s) 
20150702200417_Greenphone.vmg.gz
Description  (The .gz suffix is false - just remove it)
gz

 Download 
Filename  20150702200417_Greenphone.vmg.gz 
Filesize  335 Bytes 
Downloaded  13 Time(s) 

Last edited by greengeek on Fri 31 Jul 2015, 17:10; edited 5 times in total
Back to top
View user's profile Send private message 
greengeek

Joined: 20 Jul 2010
Posts: 3218
Location: New Zealand

PostPosted: Thu 02 Jul 2015, 06:23    Post subject:  

Quote from previous thread:

Quote:
I am currently trying 6502coders method using the awk.dt file in combination with the following script to process a whole directory of files which contain the raw sms data such as you can see in the image, but just need to fine tune the deletion of the TEXT: and adding an extra backspace to bring that line up onto the previous line.
(the script is working for me but leaving the text data on the next line down from the date/time)

Code:
for i in /root/sms/*
do
awk -f dt.awk "$i" >> data2.txt
done


dt.awk is:
Code:
{   if (substr($1, 1, 5) == "Date:")
    {
        printf( "%s\t%s", substr($1,6), $2);
    }
    else if (substr($1, 1, 5) == "TEXT:")
    {
        printf( "%s\n", substr($0,6));
    }
   
}


next step (tomorrow) - try 6502coders suggestion to overcome the \r\n line feed method used in Microsoft formatted files.

Last edited by greengeek on Thu 02 Jul 2015, 06:39; edited 1 time in total
Back to top
View user's profile Send private message 
greengeek

Joined: 20 Jul 2010
Posts: 3218
Location: New Zealand

PostPosted: Thu 02 Jul 2015, 06:27    Post subject:  

reserved
Back to top
View user's profile Send private message 
MochiMoppel


Joined: 26 Jan 2011
Posts: 754
Location: Japan

PostPosted: Thu 02 Jul 2015, 07:41    Post subject:  

Code:
sed -n '/^Date/ N;s/Date://;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
ExtractTextFrom MultipleFilesInFolder.png
 Description   
 Filesize   45.15 KB
 Viewed   235 Time(s)

ExtractTextFrom MultipleFilesInFolder.png

Back to top
View user's profile Send private message 
musher0


Joined: 04 Jan 2009
Posts: 5767
Location: Gatineau (Qc), Canada

PostPosted: Thu 02 Jul 2015, 11:40    Post subject:  

Hi, greengeek.

Interesting little problem. Do you want each message re-formatted individually,
or grouped in a list?

BFN.

musher0

_________________
"Logical entities must not be multiplied needlessly." / "Il ne faut pas multiplier les êtres logiques inutilement." (Ockham)
Back to top
View user's profile Send private message Visit poster's website 
musher0


Joined: 04 Jan 2009
Posts: 5767
Location: Gatineau (Qc), Canada

PostPosted: Thu 02 Jul 2015, 12:05    Post subject:  

Hello again!

If you don't mind a list, here you go:

Code:
#!/bin/sh
# greengeek_msgs3a.sh
# musher0, July 2nd 2015.
####
cd ~/my-documents
MRM="/tmp/more_readable_msgs.txt"
FST="/tmp/1st-field";SND="/tmp/2nd-field"
#
echo -e "\tDate" > $FST;echo -e "\t\tTEXT" > $SND
for i in `ls -1 *.vmg`
do
awk -F":" '$1=="TEXT" { $1="";print "\n\t"$0 }' $i >> $SND
awk -F":" '$1=="Date" { print "\n"$2 }' $i >> $FST
done
paste $FST $SND > $MRM
clear;cat $MRM;echo # You can comment out this line.


You can add a "t" to the ls command on the "for" line, like so:
Code:
for i in `ls -1t *.vmg`

, if you want your messages in reverse chronological order (most recent
messages first).

I hope that fits the bill. BFN

musher0
reformatted_msgs_2015-07-02.jpg
 Description   
 Filesize   30.88 KB
 Viewed   174 Time(s)

reformatted_msgs_2015-07-02.jpg


_________________
"Logical entities must not be multiplied needlessly." / "Il ne faut pas multiplier les êtres logiques inutilement." (Ockham)
Back to top
View user's profile Send private message Visit poster's website 
some1

Joined: 17 Jan 2013
Posts: 48

PostPosted: Thu 02 Jul 2015, 12:18    Post subject:  

Well - very deceptive picture MochiMoppel -
but there are trailing CRLFś - dont you think ? Rolling Eyes


6502coder wrote some simple, fast and flexible code,
gave sound advice
Go with that greengeek,learn some awk..


Home alone - real men - might ponder this:
Code:

LANG=C awk -F'(\r\n(TEXT:|))' -v RS="Date:" 'NR==2{print $1 "\t" $2}' /pathtosmsdir/* >yourextract
Back to top
View user's profile Send private message 
musher0


Joined: 04 Jan 2009
Posts: 5767
Location: Gatineau (Qc), Canada

PostPosted: Thu 02 Jul 2015, 12:56    Post subject:  

@greengeek:

Second take.

Hours and seconds didn't show correctly in my first take. Sorry about that.

Even here, because of the way the line is structured, I had to substitute the
usual colons with h and m (hours & minutes) in the date line. And we know
that seconds are seconds because of their position. I don't think this way of
noting time is kosher according to the ISO (Int'l Standards Organization), but
it's sort of clear, and ok for casual use, I'd say.

For the "TEXT" line we can get away with
Code:
{ $1="";print "\n\t"$0 }'

which only removes the first field defined by a colon, and we print the rest
as is. With this awk form, colons in the message itself shouldn't matter.

For the date field, colons do matter. It's trickier. So I had to find a work-
around for these "internationally recognized" colons used to note time.

In any case, it's getting better all the time! Smile

Code:
#!/bin/sh
# greengeek_msgs3b.sh
# musher0, July 2nd 2015.
####
cd ~/my-documents
MRM="/tmp/more_readable_msgs.txt"
FST="/tmp/1st-field";SND="/tmp/2nd-field"
#
echo -e "\tDate" > $FST;echo -e "\t\tTEXT" > $SND
for i in `ls -1 *.vmg`
do
awk -F":" '$1=="TEXT" { $1="";print "\n\t"$0 }' $i >> $SND
awk -F":" '$1=="Date" { print "\n"$2"h"$3"m"$4"\t" }' $i >> $FST
done
paste $FST $SND > $MRM
clear;cat $MRM;echo # You can comment out this line.


@some1:
This real man will ponder later, with a rested head! Wink

For now, you remind me of BK Smile with your "LANG=C;" prefix. Would this
work in an internationalized context? Won't this filter out accented characters
in messages written, e.g., in Latin or Germanic languages?

Also, just to be a grouch Wink, I think that you have too many " ' " for a awk
command on that line. Two's company, three's a crowd, you know! Smile

Nice touch, this, though:
Code:
'NR==2{print $1 "\t" $2}'


Now where's the Aspirin bottle...

BFN.

musher0
reformatted_msgs_2015-07-02(1).jpg
 Description   
 Filesize   32.39 KB
 Viewed   169 Time(s)

reformatted_msgs_2015-07-02(1).jpg


_________________
"Logical entities must not be multiplied needlessly." / "Il ne faut pas multiplier les êtres logiques inutilement." (Ockham)
Back to top
View user's profile Send private message Visit poster's website 
greengeek

Joined: 20 Jul 2010
Posts: 3218
Location: New Zealand

PostPosted: Thu 02 Jul 2015, 14:24    Post subject:  

some1 wrote:
6502coder wrote some simple, fast and flexible code, gave sound advice. Go with that greengeek,learn some awk..
I am pleased with the variety of answers proposed. I am picking up a bit of awk, a bit of sed, learning about regex and in general learning to be more accurate with syntax (a very weak point for me). One of the factors that will come into play in terms of my final script format is "can I tailor this to my future needs if something changes".

I have noticed one sms message which contains a non-standard format (I may post that later) and there are occasional oddities that pop up in a number of places (such as \, instead of , in message 3 above) so a script/syntax that is easily modified and also readable to my untrained eye may win out over compactness or processing speed. I am also hoping some of these answers may be useful to others too - and their sms backup formats may be different.

I aim to try every proposed method if I can. I'm learning a lot from this.
Back to top
View user's profile Send private message 
greengeek

Joined: 20 Jul 2010
Posts: 3218
Location: New Zealand

PostPosted: Thu 02 Jul 2015, 14:36    Post subject:  

MochiMoppel wrote:
Code:
sed -n '/^Date/ N;s/Date://;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
Thanks, this works. Would it be easy to replace the space between date and time fields with a tab? (EDIT : or to add a tab, rather than replace the space).
Last edited by greengeek on Thu 02 Jul 2015, 15:24; edited 1 time in total
Back to top
View user's profile Send private message 
greengeek

Joined: 20 Jul 2010
Posts: 3218
Location: New Zealand

PostPosted: Thu 02 Jul 2015, 14:49    Post subject:  

some1 wrote:
Home alone - real men - might ponder this:
Code:

LANG=C awk -F'(\r\n(TEXT:|))' -v RS="Date:" 'NR==2{print $1 "\t" $2}' /pathtosmsdir/* >yourextract
Thanks. I gave this a try and it works against a single message but then I tried to make it scan the whole directory by using this code:
Code:
for i in /root/sms/*
do
LANG=C awk -F'(\r\n(TEXT:|))' -v RS="Date:" 'NR==2{print $1 "\t" $2}' /root/sms/* >>smsout
done
but I got the following output:
Code:
02.07.2015 19:56:17   Test message number 1
02.07.2015 19:56:17   Test message number 1
02.07.2015 19:56:17   Test message number 1
02.07.2015 19:56:17   Test message number 1

I will have a closer look tonight and see what i am doing wrong.
Back to top
View user's profile Send private message 
greengeek

Joined: 20 Jul 2010
Posts: 3218
Location: New Zealand

PostPosted: Thu 02 Jul 2015, 15:15    Post subject:  

musher0 wrote:
For the date field, colons do matter. It's trickier. So I had to find a work-
around for these "internationally recognized" colons used to note time.
Hah, funny. I didn't even notice the colons were there till you mentioned them Smile
The eye sees what it wants to see...
Back to top
View user's profile Send private message 
musher0


Joined: 04 Jan 2009
Posts: 5767
Location: Gatineau (Qc), Canada

PostPosted: Thu 02 Jul 2015, 16:09    Post subject:  

Hi, greengeek & all!

I think this is it. With some fancy footwork! And with the help of my old
buddy replaceit!

Fancy foot work at line 12 where we change "TEXT:" to the much rarer
character "TEXT¤" to avoid disturbing any colons that may be in the message.
The ¤ becomes the field delimiter for awk in the next line, so if there are any
colons in the message that follows, we don't need to worry about it! Smile

(There's probably the odd leprechaun (or leprechaun-ess!) Wink with a mobile
phone out there who texts msgs with lots of "¤" in them on purpose, just to
complicate greengeek's life Smile, as in "¤¤¤000" Smile, but those would be a tiny
minority! Enough joking!)

Also at lines 16-18 where we restore the full ISO 8601 time notation. Please
see -- https://en.wikipedia.org/wiki/ISO_8601 -- for details.

According to this date standard, the date should be noted yyyy-mm-dd, not
yyyy.mm.dd. I have to double-check, but I think the yyyy.mm.dd date
notation is an U.S. or a U.K. standard (meaning: it's ok to use it, lots of
people understand it, but it's a "regional" standard, not a world-side one).

Code:
#!/bin/sh
# greengeek_msgs3c.sh # Dependency: replaceit
# musher0, July 2nd 2015.
#### set -xe
cd ~/my-documents
MRM="/tmp/more_readable_msgs.txt";FST="/tmp/1st-field"
SND="/tmp/2nd-field";MOD="/tmp/2nd-field.mod"
#
echo -e "\tDate" > $FST;echo -e "\t\tTEXT" > $SND
for i in `ls -1 *.vmg`;do
   awk '$1 ~ /TEXT/ { print $0 }' $i > $MOD
   replaceit --input=$MOD "TEXT:" "TEXT¤"
   awk -F"¤" '$1=="TEXT" { $1="";print "\n\t"$0 }' $MOD >> $SND
#
   awk -F":" '$1 ~ /Date/ { print "\n"$2"h"$3"m"$4"\t" }' $i >> $FST
   replaceit --input=$FST . - # We restore the ISO 8601 Standard.
   replaceit --input=$FST h : # Same.
   replaceit --input=$FST m : # Same.
done
paste $FST $SND > $MRM
clear;cat $MRM;echo # You can comment out this line. # set +xe


I'm sure MochiMoppei or other Puppyist will want to replace my replaceit lines
with some sed code? Smile Go ahead! replaceit I know, sed I don't!

Illustration of results included below.

BFN.

musher0
reformatted_msgs_2015-07-02(2).jpg
 Description   
 Filesize   25.62 KB
 Viewed   121 Time(s)

reformatted_msgs_2015-07-02(2).jpg


_________________
"Logical entities must not be multiplied needlessly." / "Il ne faut pas multiplier les êtres logiques inutilement." (Ockham)
Back to top
View user's profile Send private message Visit poster's website 
some1

Joined: 17 Jan 2013
Posts: 48

PostPosted: Thu 02 Jul 2015, 16:50    Post subject:    

Are we having fun?Laughing

I will edit my post above ASAP
Back to top
View user's profile Send private message 
6502coder

Joined: 23 Mar 2009
Posts: 165
Location: Western United States

PostPosted: Thu 02 Jul 2015, 17:09    Post subject:  

Here's a more elegant version of my awk script. It includes the fix for the Microsoft-style line endings. Here it is in action on your 4 test files.

Code:
$ ls *.vmg
20150702195617_Greenphone.vmg*  20150702200005_Greenphone.vmg*
20150702195642_Greenphone.vmg*  20150702200417_Greenphone.vmg*

$ cat dt2.awk
BEGIN { RS="\r\n" }
/^Date:/    {   printf( "%s\t%s", substr($1,6), $2); }
/^TEXT:/    {   printf( "\t%s\n", substr($0,6)); }

$ awk  -f  dt2.awk  *.vmg
02.07.2015      19:56:17        Test message number 1
02.07.2015      19:56:42        Test message number two
02.07.2015      20:00:05        Yes\, this is another text message and I will make it a long one to see what happens if I go beyond the standard 160 characters and force it to send a very long text
02.07.2015      20:04:17        And here is a text containing other characters: "%&$#*@"


If you want column headings these are easily added with a printf statement added to the BEGIN section.
Back to top
View user's profile Send private message 
Display posts from previous:   Sort by:   
Page 1 of 5 [69 Posts]   Goto page: 1, 2, 3, 4, 5 Next
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.1930s ][ Queries: 12 (0.0163s) ][ GZIP on ]