SMS backup messages - strip text from .vmg files [SOLVED]

Message

MochiMoppel · #21 Post by **MochiMoppel** » Fri 03 Jul 2015, 00:17

greengeek wrote:Would it be easy to replace the space between date and time fields with a tab? (EDIT : or to add a tab, rather than replace the space).

sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt

some1 · #22 Post by **some1** » Fri 03 Jul 2015, 01:01

MochiMoppel wrote:
greengeek wrote:Would it be easy to replace the space between date and time fields with a tab? (EDIT : or to add a tab, rather than replace the space).
sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt

@MochiMoppel:
Your code produces trailing CRLF's

Before opening /mnt/home/test/result.txt in geany,
run

Code: Select all

od -c /mnt/home/test/result.txt

and you will see \r\n.

MochiMoppel · #23 Post by **MochiMoppel** » Fri 03 Jul 2015, 02:30

some1 wrote:Your code produces trailing CRLF's

The code preserves the lineend encodings used by the source text.
If the source text would use Unix LF line endings the code would preserve them as well. Unless the user finds CRLFs disturbing or unbearable I see no reason to take the trouble and change them.

.

musher0 · #24 Post by **musher0** » Fri 03 Jul 2015, 07:06

Who needs awk ?!

Who needs sed ?!

Who needs replaceit ?!

(Edited, about an hour later, from here... )
But we need to run the Date and TEXT lines through < tr '\r' '\n' >. Other-
wise the CRLF line ending will eat the first character of the next line in the
Linux text format. That's why we have to do it.

Code: Select all

#!/bin/ash
# greengeek_msgs4c.sh # musher0, July 3rd 2015.
####
cd ~/my-documents;MR="More-readable"
echo -e "\tDate \t\t\tTEXT" > $MR
for i in `ls -1 *.vmg`;do
	DAT="`grep -h 'Date:' $i | cut -d':' -f2-5 | tr '\r' '\n'`"
	TXT="`grep -h 'TEXT:' $i | cut -d':' -f2-10 | tr '\r' '\n'`"
	echo -e "$DAT \t$TXT\n" >> $MR;done
clear;cat $MR

(... to here.)
(16 h 30, Friday July 3rd: DAT and TEXT variables above re-edited to
incorporate a comment below by 6592coder. musher0)
Enjoy!

musher0

greengeek · #25 Post by **greengeek** » Fri 03 Jul 2015, 07:41

Holy Moly, there's enough ideas here to keep me going till christmas!
My keyboards melting.

musher0 · #26 Post by **musher0** » Fri 03 Jul 2015, 08:07

I just edited my script two posts up.

greengeek · #27 Post by **greengeek** » Fri 03 Jul 2015, 08:08

MochiMoppel wrote:sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt

Excellent, that gives just the right format. I will probably differentiate between inbox messages and sentbox messages by adding another extra tab after the time field for sentbox messages so they get indented for clarity. I will post a comparison of your old and new syntax so I can figure out where to add the second tab:

Code: Select all

Puts a "space" after date/time:
sed -n '/^Date/ N;s/Date://;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
Puts a "tab" after date/time:
sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt

greengeek · #28 Post by **greengeek** » Fri 03 Jul 2015, 08:23

musher0 wrote:Who needs awk ?!

Who needs sed ?! :

Code: Select all

#!/bin/ash
# greengeek_msgs4c.sh # musher0, July 3rd 2015.
####
cd ~/my-documents;MR="More-readable"
echo -e "\tDate \t\t\tTEXT" > $MR
for i in `ls -1 *.vmg`;do
	DAT="`grep -h Date $i | cut -d':' -f2-5 | tr '\r' '\n'`"
	TXT="`grep -h TEXT $i | cut -d':' -f2-10 | tr '\r' '\n'`"
	echo -e "$DAT \t$TXT\n" >> $MR;done
clear;cat $MR

Tres bon! It took me a while to realise where the output was

The double spacing will be handy when i graft the sentbox messages in between the inbox messages.
Merci!

greengeek · #29 Post by **greengeek** » Fri 03 Jul 2015, 08:42

6502coder wrote:Here's a more elegant version of my awk script. It includes the fix for the Microsoft-style line endings. Here it is in action on your 4 test files.
Code: Select all
$ cat dt2.awk
BEGIN { RS="\r\n" }
/^Date:/    {   printf( "%s\t%s", substr($1,6), $2); }
/^TEXT:/    {   printf( "\t%s\n", substr($0,6)); }
Code: Select all
awk  -f  dt2.awk  *.vmg

Great - that's another one that works. Thanks, I'm spoilt for choice now.

I should do some tests with larger numbers of files next - then once i have confirmed reliability I will post an sms that seems to have come through with a different format of text line (something to do with encoding i think). I don't know if it will be possible to account for texts of that format or whether i will just have to visually inspect the output to trap them.

Anyway, on with the testing...

greengeek · #30 Post by **greengeek** » Fri 03 Jul 2015, 08:55

some1 wrote:therefore the code shall be:
Code: Select all
LANG=C awk -F'(\r\n(TEXT:|))' -v RS="Date:" 'FNR==2{print $1 "\t" $2}' /pathtosmsdir/* >yourextract 

Thanks, that works well too. I managed to figure out how to add an extra tab between the time field and the text field by using this syntax:

Code: Select all

LANG=C awk -F'(\r\n(TEXT:|))' -v RS="Date:" 'FNR==2{print $1 "\t\t" $2}' /root/sms/* >yourextract

technosaurus · #31 Post by **technosaurus** » Fri 03 Jul 2015, 09:10

Fwiw you can do similar stuff in bash by setting IFS=". " and using read x y z etc and a case statement.

greengeek · #32 Post by **greengeek** » Fri 03 Jul 2015, 09:52

technosaurus wrote:Fwiw you can do similar stuff in bash by setting IFS=". " and using read x y z etc and a case statement.

Google offers this link which I shall come back to:
http://mindspill.net/computing/linux-no ... haracters/

6502coder · #33 Post by **6502coder** » Fri 03 Jul 2015, 18:56

@some1

I enjoyed your clever idea. I haven't done any timings; a very wise man (Jon Bentley, author of "Programming Pearls", a classic book on programming tricks) taught me not to waste time optimizing what already runs fast enough. I have used awk scripts on files with hundreds of thousands of records, and speed has never been a problem.

@greengeek

A word of caution: I have not examined all of the different (and ingenious) solutions in detail, but it seems to me that at least some of them will be tripped up if the strings "Date" or "TEXT" happen to occur in places other than AT THE START OF A LINE.

If you look back at my very first post, you will notice that my test file explicitly included a couple of lines where "Date" and "TEXT" occur, but not at the start of a line, in order to check that the awk script correctly IGNORED these. Of course there's no problem as long as you can guarantee that the words "Date" and "TEXT" can never occur--even by chance--in any other place in an SMS file, but I though it was worth playing it safe.

musher0 · #34 Post by **musher0** » Fri 03 Jul 2015, 19:58

6502coder wrote:(...)
@greengeek

A word of caution: I have not examined all of the different (and ingenious) solutions in detail, but it seems to me that at least some of them will be tripped up if the strings "Date" or "TEXT" happen to occur in places other than AT THE START OF A LINE.

If you look back at my very first post, you will notice that my test file explicitly included a couple of lines where "Date" and "TEXT" occur, but not at the start of a line, in order to check that the awk script correctly IGNORED these. Of course there's no problem as long as you can guarantee that the words "Date" and "TEXT" can never occur--even by chance--in any other place in an SMS file, but I though it was worth playing it safe.

Hi, 6502coder.

Well, in this case we are processing text files, so the same can theoretically
be said for any delimiter, may it be a character or a word. The programmer
is not a diviner, we can't anticipate what people will write.

In a grep context, we can reduce the probability by including the colon, as in:

Code: Select all

grep -h "TEXT:" *.vmg

Reduce the probability, only thing we can do. What's the probability that
someone is going to type exactly: "TEXT:" in capitals with a colon tacked to it
in the text field.

Still, awk is the only sure-fire solution, with its -- $1=="TEXT" -- statement.
This frees the other fields.

BFN.

musher0

greengeek · #35 Post by **greengeek** » Fri 03 Jul 2015, 20:36

6502coder wrote: A word of caution: I have not examined all of the different (and ingenious) solutions in detail, but it seems to me that at least some of them will be tripped up if the strings "Date" or "TEXT" happen to occur in places other than AT THE START OF A LINE.

Fair point. I have just sent a new bunch of messages to my phone and attached them. Here are the results of each of the scripts:

EDIT : The actual contents stripped out of the last text should be:

Code: Select all

Date text DATE TEXT date Text Date: Text: DATE: TEXT: TEXTED: Data: Date. Quick brown fox.

6502:

Code: Select all

# awk  -f  dt2.awk  /root/sms/*.vmg
04.07.2015	07:49:04	I will look back at the last text and see if it contains the word date
04.07.2015	07:50:27	Yes it did contain the word date: is that a bad thing?
04.07.2015	07:51:19	I said dont call me just send a TEXT
04.07.2015	07:53:02	No we are not going on a "Date": it is just carpooling.
04.07.2015	07:56:10	Date text DATE TEXT date Text Date: Text: DATE: TEXT: TEXTED: Data: Date. Quick brown fox.
#

Mochi:

Code: Select all

04.07.2015 	07:49:04	I will look back at the last text and see if it contains the word date
04.07.2015 	07:50:27	Yes it did contain the word date: is that a bad thing?
04.07.2015 	07:51:19	I said dont call me just send a TEXT
04.07.2015 	07:53:02	No we are not going on a "Date": it is just carpooling.
04.07.2015 	07:56:10	 TEXTED: Data: Date. Quick brown fox.

some1:

Code: Select all

04.07.2015 07:49:04		I will look back at the last text and see if it contains the word date
04.07.2015 07:50:27		Yes it did contain the word date: is that a bad thing?
04.07.2015 07:51:19		I said dont call me just send a TEXT
04.07.2015 07:53:02		No we are not going on a "Date": it is just carpooling.
04.07.2015 07:56:10		Date text DATE TEXT date Text

musher:

Code: Select all

	Date 			TEXT
04.07.2015 07:49:04 	I will look back at the last text and see if it contains the word date

04.07.2015 07:50:27 	Yes it did contain the word date: is that a bad thing?

04.07.2015 07:51:19 	I said dont call me just send a TEXT

04.07.2015 07:53:02

No we are not going on a "Date": it is just carpooling. 	No we are not going on a "Date": it is just carpooling.

04.07.2015 07:56:10

Date text DATE TEXT date Text Date: Text: DATE: TEXT 	Date text DATE TEXT date Text Date: Text: DATE: TEXT: TEXTED: Data: Date. Quick brown fox.

6502coder · #36 Post by **6502coder** » Fri 03 Jul 2015, 21:29

musher0 wrote: Hi, 6502coder.

Well, in this case we are processing text files, so the same can theoretically
be said for any delimiter, may it be a character or a word. The programmer
is not a diviner, we can't anticipate what people will write.

Absolutely true, I agree. However--and this relates to the recent poll in the Really Off-topic area about what makes a "good" answer--in this case the possibility of a word like "Date" turning up occurred to me almost immediately, and so having anticipated a possible problem, I felt I shouldn't simply ignore it in my solution. I thought other posters, looking at my test data file and the output from the awk script, would notice what I was up to; but I guess I should have pointed out what I was worried about explicitly.

musher0 · #37 Post by **musher0** » Fri 03 Jul 2015, 21:58

@greengeek:
When will you be handing 6502coder the "Oscar for Best Coder"?

Do we simple mortals get consolation prizes?

Maybe a nomination?

Congrats to 6502coder, BTW!

musher0

musher0 · #38 Post by **musher0** » Fri 03 Jul 2015, 23:02

Hi, greengeek and all.

Spent the last hour trying to figure out why my script produces sort of a
double of my $TXT variable in the last vmg file.

I still can't understand why... but I have to face the reality that it does!

Edit: Found it! Phew.
It's the $DAT variable that picks up any mention of a "Date:". Not the $TXT
variable. So we limit it with a ${DAT:0:19} statement at "echo-ing" time.
Anything additional goes to la-la land. (The original "Date:" field in the vmg
files has a length of 19 characters.)

Code: Select all

#!/bin/ash
# greengeek_msgs4d.sh # musher0, July 3rd 2015.
####
cd ~/my-documents;MR="More-readable"
echo -e "\tDate \t\t\tTEXT" > $MR
for i in `ls -1 *.vmg`;do
	DAT="`grep -h 'Date:' $i | cut -d':' -f2-4 | tr '\r' '\n'`"
	TXT="`grep -E -h 'TEXT:' $i | cut -d':' -f2-9 | tr '\r' '\n'`"
	echo -e "${DAT:0:19} \t$TXT\n" >> $MR
done
clear;cat $MR

TWYL.

musher0

seaside · #39 Post by **seaside** » Sat 04 Jul 2015, 01:29

Code: Select all

for f in /path/to/msgs/* ; do
while read line || [ "$line" ]; do
case $line in
	Date:*) fdate=${line#*:} fdate=${fdate%% *}  ;;
	TEXT:*) ftext=${line#*:} result="$fdate\t$ftext" ;;
esac	
done <$f
echo -e "$result"	

done

I think the technosaurus technique would handle it as well.

Cheers,
s

technosaurus · #40 Post by **technosaurus** » Sat 04 Jul 2015, 02:06

To just format it without removing the year its pretty easy to adapt seasides solution to use IFS

Code: Select all

IFS=":"
while read data_type data  || [ "$data_type" ]; do
case "$data_type" in
   *Date*)printf "$data ";;
   *TEXT*)echo "$data";;
esac
done

if you want to also use "." as a separator (for removing year), you can just use IFS=":. " (extra space for separator after the year) and read additional named fields (note that the last field will consume all data till the end of the line)

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

SMS backup messages - strip text from .vmg files [SOLVED]