Geany - automate "replace" with regular expressions??

Message

technosaurus · #21 Post by **technosaurus** » Mon 29 Jun 2015, 22:33

greengeek wrote:I just tried it and it works! Although i could do with another tab just before the text field.

... just use \1 \2\t

And I simply cannot figure out how you managed to dump the .2015

([.0-9]*)[.][0-9]*

the [.][0-9]* after the parenthesis

MochiMoppel · #22 Post by **MochiMoppel** » Tue 30 Jun 2015, 01:55

greengeek wrote:how did you end up with a space between the end of the date field and the beginning of the text field in your image above?

Sorry, fooled again by a bug of my Opera browser. The browser adds spaces to text lines copied from the screen, so my sample text ended up to be different from your original text.

greengeek · #23 Post by **greengeek** » Tue 30 Jun 2015, 10:05

technosaurus wrote:
And I simply cannot figure out how you managed to dump the .2015

([.0-9]*)[.][0-9]*
the [.][0-9]* after the parenthesis

Thanks - this works well. I will probably be keeping the .2015 for the messages i save from the inbox (for accurate timestamping) but deleting it for the messages from my sentbox (as the inbox messages already provide the context) so this gives me the syntax I need for both options.

Still a few more trials to do...

greengeek · #24 Post by **greengeek** » Wed 01 Jul 2015, 16:35

Thank you all for the different approaches you have suggested. I am using them all in different ways. Currently I am trying to extend the scripts to operate on the original sms backup files rather than the partially extracted ones that I have collated manually. The image below shows the format of the original raw "backup" of the full text message. If I use the various scripts you all have given me it seems that they behave differently when run against one of these files as there appears to be a character at the end of the line that is disrupting the deletion of the "\nTEXT:" field. (I'm not explaining this very well...)
EDIT : Maybe the \n that works when i search the data in geany is actually a different character in the original file - maybe a CR or LF??

Anyway, I did some tests to determine if instead of searching for the string "\nTEXT:" I could just search for "TEXT:" and replace it with as many backspaces as I needed. What I tried (manually) was to use Geany's replace function to replace "TEXT:" with two backspaces. I tried to use the regular expression of \\ in the hope it represented two backspaces but of course it doesn't - it represents a backslash. So how do I create two backspaces?

After a bit of googling I tried to replace TEXT: with \u0008 or \b but that gives me the strange character you can see in the picture. So is there a way to form a regular expression that Geany understands, that will allow me to apply multiple backspaces sequentially (followed by one or two tabs)?

EDIT 2 : Maybe I need to be searching for \RTEXT: as \R apparently matches any type of line feed, rather than \n which is more limited.

I am currently trying 6502coders method using the awk.dt file in combination with the following script to process a whole directory of files which contain the raw sms data such as you can see in the image, but just need to fine tune the deletion of the TEXT: and adding an extra backspace to bring that line up onto the previous line.
(the script is working for me but leaving the text data on the next line down from the date/time)

Code: Select all

for i in /root/sms/*
do
awk -f dt.awk "$i" >> data2.txt
done

dt.awk is:

Code: Select all

{   if (substr($1, 1, 5) == "Date:")
    {
        printf( "%s\t%s", substr($1,6), $2);
    }
    else if (substr($1, 1, 5) == "TEXT:")
    {
        printf( "%s\n", substr($0,6));
    }
    
}

technosaurus · #25 Post by **technosaurus** » Wed 01 Jul 2015, 18:28

sed only does 1 line at a time so \n will never match... same thing for awk unless you change RS (or is it IRS?) in BEGIN{}

to fix sed, you can pipe it through tr before and after to change newline characters to something sed can handle and then back

little nuances can sometimes make a big difference

musher0 · #26 Post by **musher0** » Wed 01 Jul 2015, 19:07

Hi, greengeek.

Ever tried replaceit? Perfect for
this, I would think.

BFN.

musher0

greengeek · #27 Post by **greengeek** » Wed 01 Jul 2015, 19:16

musher0 wrote:Ever tried replaceit?

Thanks musher - I just had a look at your link and it sounds as if it would suit a number of uses but there is a note that may render it unsuitable for the way I am working

NOTE- ReplaceIt is a line-by-line file parser, so, it cannot (at this point) process across multiple \n[\r] terminated lines.

I am still unsure whether or not Geany has this ability - it appears to be able to work around \n and \R boundaries in searching, but not in replacing (or at least not when replacing something with a backspace).

6502coder · #28 Post by **6502coder** » Wed 01 Jul 2015, 19:39

It would help if you'd run "od" on one of your files so we can see exactly what the "end-of-line" sequence is:

Code: Select all

od  -c  oneOfYourMsgs.txt

Probably the message files are using the Microsoft "end-of-line" convention, which is "\r\n" as opposed to the Linux convention of just "\n". If so, as technosaurus says, you can adjust the awk script by inserting the following line at the start of dt.awk

Code: Select all

BEGIN { RS="\r\n" }

BTW you don't need a for-loop to process all the files in a directory. If you give awk a list of files, it processes them one after another; and of course you can use the shell's wildcarding. So you can simply say

Code: Select all

awk  -f  dt.awk  /root/sms/*   > data2.txt

greengeek · #29 Post by **greengeek** » Wed 01 Jul 2015, 19:46

6502coder wrote:It would help if you'd run "od" on one of your files so we can see exactly what the "end-of-line" sequence is:
Code: Select all
od  -c  oneOfYourMsgs.txt

Thanks - here is the result:

Code: Select all

# od -c sms.txt
0000000   B   E   G   I   N   :   V   M   S   G  \r  \n   V   E   R   S
0000020   I   O   N   :   1   .   1  \r  \n   X   -   I   R   M   C   -
0000040   S   T   A   T   U   S   :   R   E   A   D  \r  \n   X   -   I
0000060   R   M   C   -   B   O   X   :   I   N   B   O   X  \r  \n   B
0000100   E   G   I   N   :   V   C   A   R   D  \r  \n   V   E   R   S
0000120   I   O   N   :   2   .   1  \r  \n   N   ;   C   H   A   R   S
0000140   E   T   =   U   T   F   -   8   : 344 204 200 346 204 200 347
0000160 210 200 346 274 200 346 270 200 344 234 200 345 210 200 343 240
0000200 200 344 260 200 344 204 200 345 234 200 344 270 200   ;   ;   ;
0000220   ;  \r  \n   F   N   ;   C   H   A   R   S   E   T   =   U   T
0000240   F   -   8   :  \r  \n   T   E   L   :   +   6   4   2   2   4
0000260   7   8   x   x   x   x  \r  \n   E   N   D   :   V   C   A   R
0000300   D  \r  \n   B   E   G   I   N   :   V   E   N   V  \r  \n   B
0000320   E   G   I   N   :   V   B   O   D   Y  \r  \n   D   a   t   e
0000340   :   2   6   .   0   6   .   2   0   1   5       1   5   :   0
0000360   0   :   0   0  \r  \n   T   E   X   T   :   T   h   e   r   e
0000400       h   a   s       j   u   s   t       g   o   t       2    
0000420   b   e       a       b   e   t   t   e   r       w   a   y  \r
0000440  \n   E   N   D   :   V   B   O   D   Y  \r  \n   E   N   D   :
0000460   V   E   N   V  \r  \n   E   N   D   :   V   M   S   G  \r  \n
0000500
#

I hope to try your code suggestions after work tonight.

6502coder · #30 Post by **6502coder** » Wed 01 Jul 2015, 19:50

Yup, good old Microsoft \r\n just as I suspected.
The fix in my previous post should take care of this.

MochiMoppel · #31 Post by **MochiMoppel** » Thu 02 Jul 2015, 02:43

technosaurus wrote:sed only does 1 line at a time so \n will never match...

...but it can combine 2 lines, using the N command, and then it can match \n, or better - if the character before TEXT is unknown - it can match whatever character precedes TEXT.

@greengeek: You should attach an example of your raw messages, otherwise we don't know what kind of characters you are dealing with.

You should also clarify if you are still searching for a way to do it with Geany's Find & Replace or without Geany. Ironically Geany's regex patterns can be more powerful than what you can use with sed, but since you can't chain these commands as you would with sed, Geany is limited to one command - not enough for complex tasks.

musher0 · #32 Post by **musher0** » Thu 02 Jul 2015, 05:54

Hi, greengeek.

You're right, replaceit could do only part of the job. So here's my take on it,
double-spaced!

Code: Select all

#!/bin/ash
# greengeek_msgs2.sh
# musher0, July 2nd 2015.
####
cd ~/my-documents
GRGK="essai-greengeek.txt";NH="/tmp/no-header"
MRM="/tmp/more_readable_msgs.txt"
FST="/tmp/1st-field";SND="/tmp/2nd-field"
#
cat "$GRGK" | awk -F":" '{ print $2 }' > $NH
awk '$1 !~ /02/ { print "\n\t" $0 }' $NH > $SND
awk '$1 ~ /02/ { print "\n" $0 }' $NH > $FST
paste $FST $SND > $MRM
clear;cat "$MRM";echo # You can comment out this line.

Illustrated below. BFN.

musher0

greengeek · #33 Post by **greengeek** » Thu 02 Jul 2015, 09:49

musher0 wrote:So here's my take on it,
double-spaced!

salut musher0
double interligne est bon. Je peux interleave le "sentbox" messages dans l'espace
Veuillez excuser la grammaire de la babelfish s'il vous plait!
merci bien.

EDIT :Il y a aussi un autre défi:
Le fichier texte complet contient plus de données à ignorer

Je m'excuse!

greengeek · #34 Post by **greengeek** » Thu 02 Jul 2015, 10:29

MochiMoppel wrote:You should attach an example of your raw messages, otherwise we don't know what kind of characters you are dealing with.

You should also clarify if you are still searching for a way to do it with Geany's Find & Replace or without Geany. Ironically Geany's regex patterns can be more powerful than what you can use with sed, but since you can't chain these commands as you would with sed, Geany is limited to one command - not enough for complex tasks.

Well, I started with manual extraction of the data from each raw sms text file, then I realised the search/replace functions of Geany could make the job so much easier - then I read about regex functions and thought that might help me with getting the most out of Geany - then as a result of the various comments on this thread i realised that I don't have to process one file at a time - I can handle a whole directory of sms files, but it is unlikely that using Geany will be the best way to handle an entire directory of raw sms backups (although it does appear to have the ability to handle an entire "session" rather than just a single document).

So I am at the point where i would like to continue this thread for the purposes of learning more about the power of regex functions within Geany (I'm sure i will need better understanding of this), but I think I should also start a new thread to look at various methods of handling the specific sms backup files that have been the original source of the textfiles I started with in this thread.

The new thread evaluating methods of extracting the required data from the backed up sms files is here

6502coder · #35 Post by **6502coder** » Thu 02 Jul 2015, 21:12

I have a new version of the dt.awk script. See my post in your new thread.

greengeek · #36 Post by **greengeek** » Wed 22 Jul 2015, 19:27

Here is an important note from technosaurus:

technosaurus wrote:Just thought I would mention that as of 1.25 geany (released this month) has a checkbox to allow multiline regex or otherwise uses sed-style matching.

Quoted from other thread here

MochiMoppel · #37 Post by **MochiMoppel** » Mon 10 Aug 2015, 15:37

Today I could take a first look at this new "multiline regex" thing. I'm not impressed. Unless I did something wrong the way to perform multiline searches can get terribly complicated. While in normal regex patterns something like foo.*bar will match foo and bar and any characters between them, provided that foo and bar are on the same line, the multiline option doesn't remove this limitation. The '.' wildcard wouldn't include linefeeds.

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

Geany - automate "replace" with regular expressions??