Geany - automate "replace" with regular expressions??

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Message
Author
User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#21 Post by technosaurus »

greengeek wrote:I just tried it and it works! Although i could do with another tab just before the text field.
... just use \1 \2\t
And I simply cannot figure out how you managed to dump the .2015
([.0-9]*)[.][0-9]*
the [.][0-9]* after the parenthesis
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#22 Post by MochiMoppel »

greengeek wrote:how did you end up with a space between the end of the date field and the beginning of the text field in your image above?
Sorry, fooled again by a bug of my Opera browser. The browser adds spaces to text lines copied from the screen, so my sample text ended up to be different from your original text.

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#23 Post by greengeek »

technosaurus wrote:
And I simply cannot figure out how you managed to dump the .2015
([.0-9]*)[.][0-9]*
the [.][0-9]* after the parenthesis
Thanks - this works well. I will probably be keeping the .2015 for the messages i save from the inbox (for accurate timestamping) but deleting it for the messages from my sentbox (as the inbox messages already provide the context) so this gives me the syntax I need for both options.

Still a few more trials to do...

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#24 Post by greengeek »

Thank you all for the different approaches you have suggested. I am using them all in different ways. Currently I am trying to extend the scripts to operate on the original sms backup files rather than the partially extracted ones that I have collated manually. The image below shows the format of the original raw "backup" of the full text message. If I use the various scripts you all have given me it seems that they behave differently when run against one of these files as there appears to be a character at the end of the line that is disrupting the deletion of the "\nTEXT:" field. (I'm not explaining this very well...)
EDIT : Maybe the \n that works when i search the data in geany is actually a different character in the original file - maybe a CR or LF??

Anyway, I did some tests to determine if instead of searching for the string "\nTEXT:" I could just search for "TEXT:" and replace it with as many backspaces as I needed. What I tried (manually) was to use Geany's replace function to replace "TEXT:" with two backspaces. I tried to use the regular expression of \\ in the hope it represented two backspaces but of course it doesn't - it represents a backslash. So how do I create two backspaces?

After a bit of googling I tried to replace TEXT: with \u0008 or \b but that gives me the strange character you can see in the picture. So is there a way to form a regular expression that Geany understands, that will allow me to apply multiple backspaces sequentially (followed by one or two tabs)?

EDIT 2 : Maybe I need to be searching for \RTEXT: as \R apparently matches any type of line feed, rather than \n which is more limited.

I am currently trying 6502coders method using the awk.dt file in combination with the following script to process a whole directory of files which contain the raw sms data such as you can see in the image, but just need to fine tune the deletion of the TEXT: and adding an extra backspace to bring that line up onto the previous line.
(the script is working for me but leaving the text data on the next line down from the date/time)

Code: Select all

for i in /root/sms/*
do
awk -f dt.awk "$i" >> data2.txt
done
dt.awk is:

Code: Select all

{   if (substr($1, 1, 5) == "Date:")
    {
        printf( "%s\t%s", substr($1,6), $2);
    }
    else if (substr($1, 1, 5) == "TEXT:")
    {
        printf( "%s\n", substr($0,6));
    }
    
} 
Attachments
backspace_character.jpg
(45.34 KiB) Downloaded 174 times

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#25 Post by technosaurus »

sed only does 1 line at a time so \n will never match... same thing for awk unless you change RS (or is it IRS?) in BEGIN{}

to fix sed, you can pipe it through tr before and after to change newline characters to something sed can handle and then back

little nuances can sometimes make a big difference
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#26 Post by musher0 »

Hi, greengeek.

Ever tried replaceit? Perfect for
this, I would think.

BFN.

musher0
Attachments
replaceit-1.0.0.pet
(7.54 KiB) Downloaded 120 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#27 Post by greengeek »

musher0 wrote:Ever tried replaceit?
Thanks musher - I just had a look at your link and it sounds as if it would suit a number of uses but there is a note that may render it unsuitable for the way I am working
NOTE- ReplaceIt is a line-by-line file parser, so, it cannot (at this point) process across multiple \n[\r] terminated lines.
I am still unsure whether or not Geany has this ability - it appears to be able to work around \n and \R boundaries in searching, but not in replacing (or at least not when replacing something with a backspace).

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#28 Post by 6502coder »

It would help if you'd run "od" on one of your files so we can see exactly what the "end-of-line" sequence is:

Code: Select all

od  -c  oneOfYourMsgs.txt
Probably the message files are using the Microsoft "end-of-line" convention, which is "\r\n" as opposed to the Linux convention of just "\n". If so, as technosaurus says, you can adjust the awk script by inserting the following line at the start of dt.awk

Code: Select all

BEGIN { RS="\r\n" }
BTW you don't need a for-loop to process all the files in a directory. If you give awk a list of files, it processes them one after another; and of course you can use the shell's wildcarding. So you can simply say

Code: Select all

awk  -f  dt.awk  /root/sms/*   > data2.txt

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#29 Post by greengeek »

6502coder wrote:It would help if you'd run "od" on one of your files so we can see exactly what the "end-of-line" sequence is:

Code: Select all

od  -c  oneOfYourMsgs.txt
Thanks - here is the result:

Code: Select all

# od -c sms.txt
0000000   B   E   G   I   N   :   V   M   S   G  \r  \n   V   E   R   S
0000020   I   O   N   :   1   .   1  \r  \n   X   -   I   R   M   C   -
0000040   S   T   A   T   U   S   :   R   E   A   D  \r  \n   X   -   I
0000060   R   M   C   -   B   O   X   :   I   N   B   O   X  \r  \n   B
0000100   E   G   I   N   :   V   C   A   R   D  \r  \n   V   E   R   S
0000120   I   O   N   :   2   .   1  \r  \n   N   ;   C   H   A   R   S
0000140   E   T   =   U   T   F   -   8   : 344 204 200 346 204 200 347
0000160 210 200 346 274 200 346 270 200 344 234 200 345 210 200 343 240
0000200 200 344 260 200 344 204 200 345 234 200 344 270 200   ;   ;   ;
0000220   ;  \r  \n   F   N   ;   C   H   A   R   S   E   T   =   U   T
0000240   F   -   8   :  \r  \n   T   E   L   :   +   6   4   2   2   4
0000260   7   8   x   x   x   x  \r  \n   E   N   D   :   V   C   A   R
0000300   D  \r  \n   B   E   G   I   N   :   V   E   N   V  \r  \n   B
0000320   E   G   I   N   :   V   B   O   D   Y  \r  \n   D   a   t   e
0000340   :   2   6   .   0   6   .   2   0   1   5       1   5   :   0
0000360   0   :   0   0  \r  \n   T   E   X   T   :   T   h   e   r   e
0000400       h   a   s       j   u   s   t       g   o   t       2    
0000420   b   e       a       b   e   t   t   e   r       w   a   y  \r
0000440  \n   E   N   D   :   V   B   O   D   Y  \r  \n   E   N   D   :
0000460   V   E   N   V  \r  \n   E   N   D   :   V   M   S   G  \r  \n
0000500
# 
I hope to try your code suggestions after work tonight.
Last edited by greengeek on Thu 02 Jul 2015, 17:43, edited 1 time in total.

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#30 Post by 6502coder »

Yup, good old Microsoft \r\n just as I suspected.
The fix in my previous post should take care of this.

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#31 Post by MochiMoppel »

technosaurus wrote:sed only does 1 line at a time so \n will never match...
...but it can combine 2 lines, using the N command, and then it can match \n, or better - if the character before TEXT is unknown - it can match whatever character precedes TEXT.

@greengeek: You should attach an example of your raw messages, otherwise we don't know what kind of characters you are dealing with.

You should also clarify if you are still searching for a way to do it with Geany's Find & Replace or without Geany. Ironically Geany's regex patterns can be more powerful than what you can use with sed, but since you can't chain these commands as you would with sed, Geany is limited to one command - not enough for complex tasks.

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#32 Post by musher0 »

Hi, greengeek.

You're right, replaceit could do only part of the job. So here's my take on it,
double-spaced! :)

Code: Select all

#!/bin/ash
# greengeek_msgs2.sh
# musher0, July 2nd 2015.
####
cd ~/my-documents
GRGK="essai-greengeek.txt";NH="/tmp/no-header"
MRM="/tmp/more_readable_msgs.txt"
FST="/tmp/1st-field";SND="/tmp/2nd-field"
#
cat "$GRGK" | awk -F":" '{ print $2 }' > $NH
awk '$1 !~ /02/ { print "\n\t" $0 }' $NH > $SND
awk '$1 ~ /02/ { print "\n" $0 }' $NH > $FST
paste $FST $SND > $MRM
clear;cat "$MRM";echo # You can comment out this line.
Illustrated below. BFN.

musher0
Attachments
capture31318.jpg
(12.46 KiB) Downloaded 183 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#33 Post by greengeek »

musher0 wrote:So here's my take on it,
double-spaced! :)
salut musher0
double interligne est bon. Je peux interleave le "sentbox" messages dans l'espace
Veuillez excuser la grammaire de la babelfish s'il vous plait!
merci bien.

EDIT :Il y a aussi un autre défi:
Le fichier texte complet contient plus de données à ignorer

Je m'excuse!
Attachments
context.jpg
(27.33 KiB) Downloaded 153 times
double-espace.jpg
(9.19 KiB) Downloaded 145 times

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#34 Post by greengeek »

MochiMoppel wrote:You should attach an example of your raw messages, otherwise we don't know what kind of characters you are dealing with.

You should also clarify if you are still searching for a way to do it with Geany's Find & Replace or without Geany. Ironically Geany's regex patterns can be more powerful than what you can use with sed, but since you can't chain these commands as you would with sed, Geany is limited to one command - not enough for complex tasks.
Well, I started with manual extraction of the data from each raw sms text file, then I realised the search/replace functions of Geany could make the job so much easier - then I read about regex functions and thought that might help me with getting the most out of Geany - then as a result of the various comments on this thread i realised that I don't have to process one file at a time - I can handle a whole directory of sms files, but it is unlikely that using Geany will be the best way to handle an entire directory of raw sms backups (although it does appear to have the ability to handle an entire "session" rather than just a single document).

So I am at the point where i would like to continue this thread for the purposes of learning more about the power of regex functions within Geany (I'm sure i will need better understanding of this), but I think I should also start a new thread to look at various methods of handling the specific sms backup files that have been the original source of the textfiles I started with in this thread.

The new thread evaluating methods of extracting the required data from the backed up sms files is here

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#35 Post by 6502coder »

I have a new version of the dt.awk script. See my post in your new thread.

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#36 Post by greengeek »

Here is an important note from technosaurus:
technosaurus wrote:Just thought I would mention that as of 1.25 geany (released this month) has a checkbox to allow multiline regex or otherwise uses sed-style matching.
Quoted from other thread here

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#37 Post by MochiMoppel »

Today I could take a first look at this new "multiline regex" thing. I'm not impressed. Unless I did something wrong the way to perform multiline searches can get terribly complicated. While in normal regex patterns something like foo.*bar will match foo and bar and any characters between them, provided that foo and bar are on the same line, the multiline option doesn't remove this limitation. The '.' wildcard wouldn't include linefeeds.

Post Reply