SMS backup messages - strip text from .vmg files [SOLVED]

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Message
Author
User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#21 Post by MochiMoppel »

greengeek wrote:Would it be easy to replace the space between date and time fields with a tab? (EDIT : or to add a tab, rather than replace the space).
sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt

some1
Posts: 117
Joined: Thu 17 Jan 2013, 11:07

#22 Post by some1 »

MochiMoppel wrote:
greengeek wrote:Would it be easy to replace the space between date and time fields with a tab? (EDIT : or to add a tab, rather than replace the space).
sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
@MochiMoppel:
Your code produces trailing CRLF's

Before opening /mnt/home/test/result.txt in geany,
run

Code: Select all

od -c /mnt/home/test/result.txt
and you will see \r\n.

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#23 Post by MochiMoppel »

some1 wrote:Your code produces trailing CRLF's
The code preserves the lineend encodings used by the source text.
If the source text would use Unix LF line endings the code would preserve them as well. Unless the user finds CRLFs disturbing or unbearable I see no reason to take the trouble and change them.


.
Last edited by MochiMoppel on Fri 03 Jul 2015, 07:08, edited 2 times in total.

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#24 Post by musher0 »

Who needs awk ?! :lol: Who needs sed ?! :lol: Who needs replaceit ?! :lol:

(Edited, about an hour later, from here... )
But we need to run the Date and TEXT lines through < tr '\r' '\n' >. Other-
wise the CRLF line ending will eat the first character of the next line in the
Linux text format. That's why we have to do it.

Code: Select all

#!/bin/ash
# greengeek_msgs4c.sh # musher0, July 3rd 2015.
####
cd ~/my-documents;MR="More-readable"
echo -e "\tDate \t\t\tTEXT" > $MR
for i in `ls -1 *.vmg`;do
	DAT="`grep -h 'Date:' $i | cut -d':' -f2-5 | tr '\r' '\n'`"
	TXT="`grep -h 'TEXT:' $i | cut -d':' -f2-10 | tr '\r' '\n'`"
	echo -e "$DAT \t$TXT\n" >> $MR;done
clear;cat $MR
(... to here.)
(16 h 30, Friday July 3rd: DAT and TEXT variables above re-edited to
incorporate a comment below by 6592coder. musher0)

Enjoy!

musher0
Last edited by musher0 on Fri 03 Jul 2015, 20:33, edited 3 times in total.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#25 Post by greengeek »

Holy Moly, there's enough ideas here to keep me going till christmas!
My keyboards melting. :-)

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#26 Post by musher0 »

I just edited my script two posts up.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#27 Post by greengeek »

MochiMoppel wrote:sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
Excellent, that gives just the right format. I will probably differentiate between inbox messages and sentbox messages by adding another extra tab after the time field for sentbox messages so they get indented for clarity. I will post a comparison of your old and new syntax so I can figure out where to add the second tab:

Code: Select all

Puts a "space" after date/time:
sed -n '/^Date/ N;s/Date://;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
Puts a "tab" after date/time:
sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#28 Post by greengeek »

musher0 wrote:Who needs awk ?! :lol: Who needs sed ?! :

Code: Select all

#!/bin/ash
# greengeek_msgs4c.sh # musher0, July 3rd 2015.
####
cd ~/my-documents;MR="More-readable"
echo -e "\tDate \t\t\tTEXT" > $MR
for i in `ls -1 *.vmg`;do
	DAT="`grep -h Date $i | cut -d':' -f2-5 | tr '\r' '\n'`"
	TXT="`grep -h TEXT $i | cut -d':' -f2-10 | tr '\r' '\n'`"
	echo -e "$DAT \t$TXT\n" >> $MR;done
clear;cat $MR
Tres bon! It took me a while to realise where the output was :-)
The double spacing will be handy when i graft the sentbox messages in between the inbox messages.
Merci!

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#29 Post by greengeek »

6502coder wrote:Here's a more elegant version of my awk script. It includes the fix for the Microsoft-style line endings. Here it is in action on your 4 test files.

Code: Select all

$ cat dt2.awk
BEGIN { RS="\r\n" }
/^Date:/    {   printf( "%s\t%s", substr($1,6), $2); }
/^TEXT:/    {   printf( "\t%s\n", substr($0,6)); }

Code: Select all

awk  -f  dt2.awk  *.vmg
Great - that's another one that works. Thanks, I'm spoilt for choice now.


I should do some tests with larger numbers of files next - then once i have confirmed reliability I will post an sms that seems to have come through with a different format of text line (something to do with encoding i think). I don't know if it will be possible to account for texts of that format or whether i will just have to visually inspect the output to trap them.

Anyway, on with the testing...

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#30 Post by greengeek »

some1 wrote:therefore the code shall be:

Code: Select all

LANG=C awk -F'(\r\n(TEXT:|))' -v RS="Date:" 'FNR==2{print $1 "\t" $2}' /pathtosmsdir/* >yourextract 
Thanks, that works well too. I managed to figure out how to add an extra tab between the time field and the text field by using this syntax:

Code: Select all

LANG=C awk -F'(\r\n(TEXT:|))' -v RS="Date:" 'FNR==2{print $1 "\t\t" $2}' /root/sms/* >yourextract

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#31 Post by technosaurus »

Fwiw you can do similar stuff in bash by setting IFS=". " and using read x y z etc and a case statement.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#32 Post by greengeek »

technosaurus wrote:Fwiw you can do similar stuff in bash by setting IFS=". " and using read x y z etc and a case statement.
Google offers this link which I shall come back to:
http://mindspill.net/computing/linux-no ... haracters/

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#33 Post by 6502coder »

@some1

I enjoyed your clever idea. I haven't done any timings; a very wise man (Jon Bentley, author of "Programming Pearls", a classic book on programming tricks) taught me not to waste time optimizing what already runs fast enough. I have used awk scripts on files with hundreds of thousands of records, and speed has never been a problem.

@greengeek

A word of caution: I have not examined all of the different (and ingenious) solutions in detail, but it seems to me that at least some of them will be tripped up if the strings "Date" or "TEXT" happen to occur in places other than AT THE START OF A LINE.

If you look back at my very first post, you will notice that my test file explicitly included a couple of lines where "Date" and "TEXT" occur, but not at the start of a line, in order to check that the awk script correctly IGNORED these. Of course there's no problem as long as you can guarantee that the words "Date" and "TEXT" can never occur--even by chance--in any other place in an SMS file, but I though it was worth playing it safe.

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#34 Post by musher0 »

6502coder wrote:(...)
@greengeek

A word of caution: I have not examined all of the different (and ingenious) solutions in detail, but it seems to me that at least some of them will be tripped up if the strings "Date" or "TEXT" happen to occur in places other than AT THE START OF A LINE.

If you look back at my very first post, you will notice that my test file explicitly included a couple of lines where "Date" and "TEXT" occur, but not at the start of a line, in order to check that the awk script correctly IGNORED these. Of course there's no problem as long as you can guarantee that the words "Date" and "TEXT" can never occur--even by chance--in any other place in an SMS file, but I though it was worth playing it safe.
Hi, 6502coder.

Well, in this case we are processing text files, so the same can theoretically
be said for any delimiter, may it be a character or a word. The programmer
is not a diviner, we can't anticipate what people will write.

In a grep context, we can reduce the probability by including the colon, as in:

Code: Select all

grep -h "TEXT:" *.vmg
Reduce the probability, only thing we can do. What's the probability that
someone is going to type exactly: "TEXT:" in capitals with a colon tacked to it
in the text field.

Still, awk is the only sure-fire solution, with its -- $1=="TEXT" -- statement.
This frees the other fields.

BFN.

musher0
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#35 Post by greengeek »

6502coder wrote: A word of caution: I have not examined all of the different (and ingenious) solutions in detail, but it seems to me that at least some of them will be tripped up if the strings "Date" or "TEXT" happen to occur in places other than AT THE START OF A LINE.
Fair point. I have just sent a new bunch of messages to my phone and attached them. Here are the results of each of the scripts:

EDIT : The actual contents stripped out of the last text should be:

Code: Select all

Date text DATE TEXT date Text Date: Text: DATE: TEXT: TEXTED: Data: Date. Quick brown fox.
6502:

Code: Select all

# awk  -f  dt2.awk  /root/sms/*.vmg
04.07.2015	07:49:04	I will look back at the last text and see if it contains the word date
04.07.2015	07:50:27	Yes it did contain the word date: is that a bad thing?
04.07.2015	07:51:19	I said dont call me just send a TEXT
04.07.2015	07:53:02	No we are not going on a "Date": it is just carpooling.
04.07.2015	07:56:10	Date text DATE TEXT date Text Date: Text: DATE: TEXT: TEXTED: Data: Date. Quick brown fox.
# 
Mochi:

Code: Select all

04.07.2015 	07:49:04	I will look back at the last text and see if it contains the word date
04.07.2015 	07:50:27	Yes it did contain the word date: is that a bad thing?
04.07.2015 	07:51:19	I said dont call me just send a TEXT
04.07.2015 	07:53:02	No we are not going on a "Date": it is just carpooling.
04.07.2015 	07:56:10	 TEXTED: Data: Date. Quick brown fox.
some1:

Code: Select all

04.07.2015 07:49:04		I will look back at the last text and see if it contains the word date
04.07.2015 07:50:27		Yes it did contain the word date: is that a bad thing?
04.07.2015 07:51:19		I said dont call me just send a TEXT
04.07.2015 07:53:02		No we are not going on a "Date": it is just carpooling.
04.07.2015 07:56:10		Date text DATE TEXT date Text 
musher:

Code: Select all

	Date 			TEXT
04.07.2015 07:49:04 	I will look back at the last text and see if it contains the word date

04.07.2015 07:50:27 	Yes it did contain the word date: is that a bad thing?

04.07.2015 07:51:19 	I said dont call me just send a TEXT

04.07.2015 07:53:02

No we are not going on a "Date": it is just carpooling. 	No we are not going on a "Date": it is just carpooling.

04.07.2015 07:56:10

Date text DATE TEXT date Text Date: Text: DATE: TEXT 	Date text DATE TEXT date Text Date: Text: DATE: TEXT: TEXTED: Data: Date. Quick brown fox.
Attachments
20150704074904_Greenphone.vmg.gz
(349 Bytes) Downloaded 165 times
20150704075027_Greenphone.vmg.gz
(333 Bytes) Downloaded 177 times
20150704075119_Greenphone.vmg.gz
(315 Bytes) Downloaded 165 times
20150704075302_Greenphone.vmg.gz
(334 Bytes) Downloaded 147 times
20150704075610_Greenphone.vmg.gz
(369 Bytes) Downloaded 139 times

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#36 Post by 6502coder »

musher0 wrote: Hi, 6502coder.

Well, in this case we are processing text files, so the same can theoretically
be said for any delimiter, may it be a character or a word. The programmer
is not a diviner, we can't anticipate what people will write.
Absolutely true, I agree. However--and this relates to the recent poll in the Really Off-topic area about what makes a "good" answer--in this case the possibility of a word like "Date" turning up occurred to me almost immediately, and so having anticipated a possible problem, I felt I shouldn't simply ignore it in my solution. I thought other posters, looking at my test data file and the output from the awk script, would notice what I was up to; but I guess I should have pointed out what I was worried about explicitly.

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#37 Post by musher0 »

@greengeek:
When will you be handing 6502coder the "Oscar for Best Coder"? :)
Do we simple mortals get consolation prizes? :(
Maybe a nomination? :lol:

Congrats to 6502coder, BTW!

musher0
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#38 Post by musher0 »

Hi, greengeek and all.

Spent the last hour trying to figure out why my script produces sort of a
double of my $TXT variable in the last vmg file.

I still can't understand why... but I have to face the reality that it does!

Edit: Found it! :idea: Phew.
It's the $DAT variable that picks up any mention of a "Date:". Not the $TXT
variable. So we limit it with a ${DAT:0:19} statement at "echo-ing" time.
Anything additional goes to la-la land. (The original "Date:" field in the vmg
files has a length of 19 characters.)

Code: Select all

#!/bin/ash
# greengeek_msgs4d.sh # musher0, July 3rd 2015.
####
cd ~/my-documents;MR="More-readable"
echo -e "\tDate \t\t\tTEXT" > $MR
for i in `ls -1 *.vmg`;do
	DAT="`grep -h 'Date:' $i | cut -d':' -f2-4 | tr '\r' '\n'`"
	TXT="`grep -E -h 'TEXT:' $i | cut -d':' -f2-9 | tr '\r' '\n'`"
	echo -e "${DAT:0:19} \t$TXT\n" >> $MR
done
clear;cat $MR
TWYL.

musher0
Attachments
capture20464.jpg
(42.98 KiB) Downloaded 228 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#39 Post by seaside »

Code: Select all

for f in /path/to/msgs/* ; do
while read line || [ "$line" ]; do
case $line in
	Date:*) fdate=${line#*:} fdate=${fdate%% *}  ;;
	TEXT:*) ftext=${line#*:} result="$fdate\t$ftext" ;;
esac	
done <$f
echo -e "$result"	

done
I think the technosaurus technique would handle it as well.

Cheers,
s

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#40 Post by technosaurus »

To just format it without removing the year its pretty easy to adapt seasides solution to use IFS

Code: Select all

IFS=":"
while read data_type data  || [ "$data_type" ]; do
case "$data_type" in
   *Date*)printf "$data ";;
   *TEXT*)echo "$data";;
esac
done
if you want to also use "." as a separator (for removing year), you can just use IFS=":. " (extra space for separator after the year) and read additional named fields (note that the last field will consume all data till the end of the line)
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

Post Reply