sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txtgreengeek wrote:Would it be easy to replace the space between date and time fields with a tab? (EDIT : or to add a tab, rather than replace the space).
SMS backup messages - strip text from .vmg files [SOLVED]
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
@MochiMoppel:MochiMoppel wrote:sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txtgreengeek wrote:Would it be easy to replace the space between date and time fields with a tab? (EDIT : or to add a tab, rather than replace the space).
Your code produces trailing CRLF's
Before opening /mnt/home/test/result.txt in geany,
run
Code: Select all
od -c /mnt/home/test/result.txt
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
The code preserves the lineend encodings used by the source text.some1 wrote:Your code produces trailing CRLF's
If the source text would use Unix LF line endings the code would preserve them as well. Unless the user finds CRLFs disturbing or unbearable I see no reason to take the trouble and change them.
.
Last edited by MochiMoppel on Fri 03 Jul 2015, 07:08, edited 2 times in total.
Who needs awk ?! Who needs sed ?! Who needs replaceit ?!
(Edited, about an hour later, from here... )
But we need to run the Date and TEXT lines through < tr '\r' '\n' >. Other-
wise the CRLF line ending will eat the first character of the next line in the
Linux text format. That's why we have to do it.(... to here.)
(16 h 30, Friday July 3rd: DAT and TEXT variables above re-edited to
incorporate a comment below by 6592coder. musher0)
Enjoy!
musher0
(Edited, about an hour later, from here... )
But we need to run the Date and TEXT lines through < tr '\r' '\n' >. Other-
wise the CRLF line ending will eat the first character of the next line in the
Linux text format. That's why we have to do it.
Code: Select all
#!/bin/ash
# greengeek_msgs4c.sh # musher0, July 3rd 2015.
####
cd ~/my-documents;MR="More-readable"
echo -e "\tDate \t\t\tTEXT" > $MR
for i in `ls -1 *.vmg`;do
DAT="`grep -h 'Date:' $i | cut -d':' -f2-5 | tr '\r' '\n'`"
TXT="`grep -h 'TEXT:' $i | cut -d':' -f2-10 | tr '\r' '\n'`"
echo -e "$DAT \t$TXT\n" >> $MR;done
clear;cat $MR
(16 h 30, Friday July 3rd: DAT and TEXT variables above re-edited to
incorporate a comment below by 6592coder. musher0)
Enjoy!
musher0
Last edited by musher0 on Fri 03 Jul 2015, 20:33, edited 3 times in total.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
Excellent, that gives just the right format. I will probably differentiate between inbox messages and sentbox messages by adding another extra tab after the time field for sentbox messages so they get indented for clarity. I will post a comparison of your old and new syntax so I can figure out where to add the second tab:MochiMoppel wrote:sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
Code: Select all
Puts a "space" after date/time:
sed -n '/^Date/ N;s/Date://;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
Puts a "tab" after date/time:
sed -n '/^Date/ N;s/Date://;s/ / \t/;s/[^0-9]*TEXT:/\t/p' /mnt/home/test/*.vmg > /mnt/home/test/result.txt
Tres bon! It took me a while to realise where the output wasmusher0 wrote:Who needs awk ?! Who needs sed ?! :Code: Select all
#!/bin/ash # greengeek_msgs4c.sh # musher0, July 3rd 2015. #### cd ~/my-documents;MR="More-readable" echo -e "\tDate \t\t\tTEXT" > $MR for i in `ls -1 *.vmg`;do DAT="`grep -h Date $i | cut -d':' -f2-5 | tr '\r' '\n'`" TXT="`grep -h TEXT $i | cut -d':' -f2-10 | tr '\r' '\n'`" echo -e "$DAT \t$TXT\n" >> $MR;done clear;cat $MR
The double spacing will be handy when i graft the sentbox messages in between the inbox messages.
Merci!
Great - that's another one that works. Thanks, I'm spoilt for choice now.6502coder wrote:Here's a more elegant version of my awk script. It includes the fix for the Microsoft-style line endings. Here it is in action on your 4 test files.
Code: Select all
$ cat dt2.awk BEGIN { RS="\r\n" } /^Date:/ { printf( "%s\t%s", substr($1,6), $2); } /^TEXT:/ { printf( "\t%s\n", substr($0,6)); }
Code: Select all
awk -f dt2.awk *.vmg
I should do some tests with larger numbers of files next - then once i have confirmed reliability I will post an sms that seems to have come through with a different format of text line (something to do with encoding i think). I don't know if it will be possible to account for texts of that format or whether i will just have to visually inspect the output to trap them.
Anyway, on with the testing...
Thanks, that works well too. I managed to figure out how to add an extra tab between the time field and the text field by using this syntax:some1 wrote:therefore the code shall be:Code: Select all
LANG=C awk -F'(\r\n(TEXT:|))' -v RS="Date:" 'FNR==2{print $1 "\t" $2}' /pathtosmsdir/* >yourextract
Code: Select all
LANG=C awk -F'(\r\n(TEXT:|))' -v RS="Date:" 'FNR==2{print $1 "\t\t" $2}' /root/sms/* >yourextract
- technosaurus
- Posts: 4853
- Joined: Mon 19 May 2008, 01:24
- Location: Blue Springs, MO
- Contact:
Google offers this link which I shall come back to:technosaurus wrote:Fwiw you can do similar stuff in bash by setting IFS=". " and using read x y z etc and a case statement.
http://mindspill.net/computing/linux-no ... haracters/
@some1
I enjoyed your clever idea. I haven't done any timings; a very wise man (Jon Bentley, author of "Programming Pearls", a classic book on programming tricks) taught me not to waste time optimizing what already runs fast enough. I have used awk scripts on files with hundreds of thousands of records, and speed has never been a problem.
@greengeek
A word of caution: I have not examined all of the different (and ingenious) solutions in detail, but it seems to me that at least some of them will be tripped up if the strings "Date" or "TEXT" happen to occur in places other than AT THE START OF A LINE.
If you look back at my very first post, you will notice that my test file explicitly included a couple of lines where "Date" and "TEXT" occur, but not at the start of a line, in order to check that the awk script correctly IGNORED these. Of course there's no problem as long as you can guarantee that the words "Date" and "TEXT" can never occur--even by chance--in any other place in an SMS file, but I though it was worth playing it safe.
I enjoyed your clever idea. I haven't done any timings; a very wise man (Jon Bentley, author of "Programming Pearls", a classic book on programming tricks) taught me not to waste time optimizing what already runs fast enough. I have used awk scripts on files with hundreds of thousands of records, and speed has never been a problem.
@greengeek
A word of caution: I have not examined all of the different (and ingenious) solutions in detail, but it seems to me that at least some of them will be tripped up if the strings "Date" or "TEXT" happen to occur in places other than AT THE START OF A LINE.
If you look back at my very first post, you will notice that my test file explicitly included a couple of lines where "Date" and "TEXT" occur, but not at the start of a line, in order to check that the awk script correctly IGNORED these. Of course there's no problem as long as you can guarantee that the words "Date" and "TEXT" can never occur--even by chance--in any other place in an SMS file, but I though it was worth playing it safe.
Hi, 6502coder.6502coder wrote:(...)
@greengeek
A word of caution: I have not examined all of the different (and ingenious) solutions in detail, but it seems to me that at least some of them will be tripped up if the strings "Date" or "TEXT" happen to occur in places other than AT THE START OF A LINE.
If you look back at my very first post, you will notice that my test file explicitly included a couple of lines where "Date" and "TEXT" occur, but not at the start of a line, in order to check that the awk script correctly IGNORED these. Of course there's no problem as long as you can guarantee that the words "Date" and "TEXT" can never occur--even by chance--in any other place in an SMS file, but I though it was worth playing it safe.
Well, in this case we are processing text files, so the same can theoretically
be said for any delimiter, may it be a character or a word. The programmer
is not a diviner, we can't anticipate what people will write.
In a grep context, we can reduce the probability by including the colon, as in:
Code: Select all
grep -h "TEXT:" *.vmg
someone is going to type exactly: "TEXT:" in capitals with a colon tacked to it
in the text field.
Still, awk is the only sure-fire solution, with its -- $1=="TEXT" -- statement.
This frees the other fields.
BFN.
musher0
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
Fair point. I have just sent a new bunch of messages to my phone and attached them. Here are the results of each of the scripts:6502coder wrote: A word of caution: I have not examined all of the different (and ingenious) solutions in detail, but it seems to me that at least some of them will be tripped up if the strings "Date" or "TEXT" happen to occur in places other than AT THE START OF A LINE.
EDIT : The actual contents stripped out of the last text should be:
Code: Select all
Date text DATE TEXT date Text Date: Text: DATE: TEXT: TEXTED: Data: Date. Quick brown fox.
Code: Select all
# awk -f dt2.awk /root/sms/*.vmg
04.07.2015 07:49:04 I will look back at the last text and see if it contains the word date
04.07.2015 07:50:27 Yes it did contain the word date: is that a bad thing?
04.07.2015 07:51:19 I said dont call me just send a TEXT
04.07.2015 07:53:02 No we are not going on a "Date": it is just carpooling.
04.07.2015 07:56:10 Date text DATE TEXT date Text Date: Text: DATE: TEXT: TEXTED: Data: Date. Quick brown fox.
#
Code: Select all
04.07.2015 07:49:04 I will look back at the last text and see if it contains the word date
04.07.2015 07:50:27 Yes it did contain the word date: is that a bad thing?
04.07.2015 07:51:19 I said dont call me just send a TEXT
04.07.2015 07:53:02 No we are not going on a "Date": it is just carpooling.
04.07.2015 07:56:10 TEXTED: Data: Date. Quick brown fox.
Code: Select all
04.07.2015 07:49:04 I will look back at the last text and see if it contains the word date
04.07.2015 07:50:27 Yes it did contain the word date: is that a bad thing?
04.07.2015 07:51:19 I said dont call me just send a TEXT
04.07.2015 07:53:02 No we are not going on a "Date": it is just carpooling.
04.07.2015 07:56:10 Date text DATE TEXT date Text
Code: Select all
Date TEXT
04.07.2015 07:49:04 I will look back at the last text and see if it contains the word date
04.07.2015 07:50:27 Yes it did contain the word date: is that a bad thing?
04.07.2015 07:51:19 I said dont call me just send a TEXT
04.07.2015 07:53:02
No we are not going on a "Date": it is just carpooling. No we are not going on a "Date": it is just carpooling.
04.07.2015 07:56:10
Date text DATE TEXT date Text Date: Text: DATE: TEXT Date text DATE TEXT date Text Date: Text: DATE: TEXT: TEXTED: Data: Date. Quick brown fox.
- Attachments
-
- 20150704074904_Greenphone.vmg.gz
- (349 Bytes) Downloaded 165 times
-
- 20150704075027_Greenphone.vmg.gz
- (333 Bytes) Downloaded 177 times
-
- 20150704075119_Greenphone.vmg.gz
- (315 Bytes) Downloaded 165 times
-
- 20150704075302_Greenphone.vmg.gz
- (334 Bytes) Downloaded 147 times
-
- 20150704075610_Greenphone.vmg.gz
- (369 Bytes) Downloaded 139 times
Absolutely true, I agree. However--and this relates to the recent poll in the Really Off-topic area about what makes a "good" answer--in this case the possibility of a word like "Date" turning up occurred to me almost immediately, and so having anticipated a possible problem, I felt I shouldn't simply ignore it in my solution. I thought other posters, looking at my test data file and the output from the awk script, would notice what I was up to; but I guess I should have pointed out what I was worried about explicitly.musher0 wrote: Hi, 6502coder.
Well, in this case we are processing text files, so the same can theoretically
be said for any delimiter, may it be a character or a word. The programmer
is not a diviner, we can't anticipate what people will write.
Hi, greengeek and all.
Spent the last hour trying to figure out why my script produces sort of a
double of my $TXT variable in the last vmg file.
I still can't understand why... but I have to face the reality that it does!
Edit: Found it! Phew.
It's the $DAT variable that picks up any mention of a "Date:". Not the $TXT
variable. So we limit it with a ${DAT:0:19} statement at "echo-ing" time.
Anything additional goes to la-la land. (The original "Date:" field in the vmg
files has a length of 19 characters.)
TWYL.
musher0
Spent the last hour trying to figure out why my script produces sort of a
double of my $TXT variable in the last vmg file.
I still can't understand why... but I have to face the reality that it does!
Edit: Found it! Phew.
It's the $DAT variable that picks up any mention of a "Date:". Not the $TXT
variable. So we limit it with a ${DAT:0:19} statement at "echo-ing" time.
Anything additional goes to la-la land. (The original "Date:" field in the vmg
files has a length of 19 characters.)
Code: Select all
#!/bin/ash
# greengeek_msgs4d.sh # musher0, July 3rd 2015.
####
cd ~/my-documents;MR="More-readable"
echo -e "\tDate \t\t\tTEXT" > $MR
for i in `ls -1 *.vmg`;do
DAT="`grep -h 'Date:' $i | cut -d':' -f2-4 | tr '\r' '\n'`"
TXT="`grep -E -h 'TEXT:' $i | cut -d':' -f2-9 | tr '\r' '\n'`"
echo -e "${DAT:0:19} \t$TXT\n" >> $MR
done
clear;cat $MR
musher0
- Attachments
-
- capture20464.jpg
- (42.98 KiB) Downloaded 228 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
Code: Select all
for f in /path/to/msgs/* ; do
while read line || [ "$line" ]; do
case $line in
Date:*) fdate=${line#*:} fdate=${fdate%% *} ;;
TEXT:*) ftext=${line#*:} result="$fdate\t$ftext" ;;
esac
done <$f
echo -e "$result"
done
Cheers,
s
- technosaurus
- Posts: 4853
- Joined: Mon 19 May 2008, 01:24
- Location: Blue Springs, MO
- Contact:
To just format it without removing the year its pretty easy to adapt seasides solution to use IFS
if you want to also use "." as a separator (for removing year), you can just use IFS=":. " (extra space for separator after the year) and read additional named fields (note that the last field will consume all data till the end of the line)
Code: Select all
IFS=":"
while read data_type data || [ "$data_type" ]; do
case "$data_type" in
*Date*)printf "$data ";;
*TEXT*)echo "$data";;
esac
done
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].