How to extract a sub-string in the middle of a main string?

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

How to extract a sub-string in the middle of a main string?

#1 Post by musher0 »

Hello all.

As the title says.
I am creating a new thread on this subject because I cannot find BK's original.
(# Edit, an hour later: please see below, MochiMoppei found it.)

Barry asked an interesting question, because we often need to fish out some info
in the middle of a main string, and it seems there is no straightforward way to do it
in Bash.

I did provide a tentative solution at the time, based on the delimiter in Barry's
example, which was a double-colon, IIRC.
(# Edit, an hour later: Some other developers did as well.)

The following is NOT based on a delimiter, but please see the concluding note in
the example below.

Feedback welcome.

BFN

~~~~~~~~~~~~~~~~~~~~~

Code: Select all

###############
-- Trying to mimic "Intersection" in Boolean logic --

(Please pardon my French, "Intersection" may have another name in English.
I mean the C part when circles A and B overlap.)

x=2 ########################## How many char. we want (approx.) out of the
Z="ba;be;bi;bo;bu;by" ######## middle of this string. (Acts as an "accordeon".)
############################## Note 1) : x should not be 1. 
############################## Note 2) : not considering delimiters. The
############################## string could as well be "Fish and potatoes",
############################## which has the same length.

a="${#Z}" #################### We fetch the length of the string.
echo $a

b="`echo "($a/$x)-1" | bc`" ## Will give us the length of the string divided
############################## by the number of characters that we want minus
echo $b ###################### one because position one of the string in human
############################## terms is position zero as bash understands it.

c="`echo "$a-($x*$b)" | bc`" # This is the actual number of characters that
echo $c ###################### we will get from the middle of the string.

echo "${Z:$b:$c}" ############ If x=1 above, we get the last character 

# b="`expr $b + $c`" ######### Same intersection starting from the end of the 
# echo "${Z: -$b:$c}" ######## string, with the new $b to compensate for the
############################## backwards calculation.

Results:
17
7 (char. 8 in human terms)
3 (size of sub-string)
i;b (the contents in the middle of the sting)

############################## About delimiters: to provide a more general
############################## solution, this script should contain a "delimiter 
############################## detector". This was the idea in my 1st example;
############################## is the delimiter "b" or ";"?
############################## (With a view of confusing the script and the 
############################## script writer!) 
#
############################## Once we have this "delimiter selector" (ideally),
############################## the script could use awk or cut.
Last edited by musher0 on Sun 04 Mar 2018, 00:38, edited 1 time in total.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#2 Post by MochiMoppel »


musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#3 Post by musher0 »

Thanks for the reference, MochiMoppei.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#4 Post by MochiMoppel »

Not sure if I comprehend what you are trying to achieve, but when I try x=3 it extracts
ba;be;bi;bo;bu;by

Shouldn't it be
ba;be;bi;bo;bu;by ?

And if your intention is to use bash then there is no point to use bc:

Code: Select all

b="`echo "($a/$x)-1" | bc`"
should be the same as the pure bash arithmetic

Code: Select all

b=$((a/x-1))

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#5 Post by musher0 »

Thanks, MochiMoppei.

I'll give it another look.

~~~~~~~~~~~
@all:

The following could be a complement to the above, or be used separately. I think it is
fairly well commented, but if you have questions, please ask them.

I spent the night on this, so I am tired and I will not do much introduction at this point. It
will have to wait until I am rested.

Here goes.

Code: Select all

#!/bin/bash
# /opt/local/bin/FishPotatoes.sh # Alpha version.
# (Or place this script in any "/bin" IN YOUR $PATH.)
#
# Goal: Find the delimiter in a text or csv file.
#
# Uses Puppy-provided utilities: awk, paste, seq, sort. (I.e., no outside dependencies.)
#
# Usage: 1) copy your *.txt or *.csv file to, or paste it into,
# a general file called "text" in /root;
# 2) run this script from terminal.
#
# Example: open a terminal, copy your *.txt or *.csv file to "text" and type
# < FishPotatoes.sh >. # (That's it!)
#
# IMPORTANT -- Maximum size for the input text or csv file : 2Mb.
# CAUTION -- This script looks somewhat complete, and it is, but it is not perfect.
# E.g., I still do not know why a "|" delimiter shows up in the list when there is none
# in the originating text or string.  It does the job; however I feel that it has not
# been tested enough. So... NO guarantees whatsoever.  You are more than
# welcome to do some tests with it and report back on the thread: TIA.
#
# © Christian L'Écuyer, Gatineau (Qc), Canada, 2018-03-04. GPL3.
# (Alias musher0 [forum Puppy].) #
#################   # https://opensource.org/licenses/GPL-3.0
#    This program is free software: you can redistribute it and/or modify it under the
#    terms of the GNU General Public License as published by the Free Software
#    Foundation, either version 3 of the License, or  (at your option) any later version.
#         This program is distributed in the hope that it will be useful, but WITHOUT ANY
#    WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
#    A PARTICULAR PURPOSE. See the GNU General Public License for more details.
#         You should have received a copy of the GNU General Public License along with
#   this program. If not, see <http://www.gnu.org/licenses/>.
##########
#   Ce programme est libre : vous pouvez le redistribuer ou modifier selon les termes
#   de la Licence Publique Générale GNU publiée par la Free Software Foundation (v. 3
#   ou toute version ultérieure choisie par vous).
#       Ce programme est distribué dans l'espoir qu'il sera utile, mais SANS AUCUNE
#   GARANTIE, ni explicite ni implicite, y compris des garanties de commercialisation
#   ou d'adaptation à un but spécifique. Pour plus de détails, veuillez vous reporter
#   au texte officiel de cette licence à https://opensource.org/licenses/GPL-3.0, à
#   http://www.linux-france.org/article/these/gpl.html pour une traduction et, pour une
#   explication en français, à https://fr.wikipedia.org/wiki/Licence_publique_générale_GNU.
################ # set -xe
# Input # Getting the text externally is more practical.
Text="";Text="`paste -sd'\0' ~/text`"  # Text="ba;be;bi;bo;bu;by" # For test.
echo -e "The text is:\n`cat ~/text`\n"  # echo -e "The text is:\n$Text" # For test.

# Process
delim="";delim=(';' ',' '|' '\t' ' ' ':' '-') ## Standard delimiters that one comes across in csv files.
Sommaire="";champ=0;fois=0
for i in `seq ${#delim[@]}`;do
     fois="`echo -e "$Text" | tr "${delim[$champ]}" "\n" | wc -l`" # Gives us N occurrences.
     fois="`expr $fois - 1`" # Removing the LF.
     Sommaire="$Sommaire$fois -${delim[$champ]}-\n" # Gathers data for the report.
     champ="`expr $champ + 1`" # We prepare to query the next delimiter.
done

# Report
echo -e "~~~~~~~~~~~~~~~~~~~~~~\n\nDelimiter statistics:\n$Sommaire"

Several="`echo -e "$Sommaire" | awk '$1 > 0 { print }' | wc -l`"
if [ "$Several" -gt "1" ];then # In case there is more than one delimiter in this text.
     echo -e "~~~~~~~~~~~~~~~~~~~~~~\n\nThere are several standard delimiters in this text:"
     echo -e "$Sommaire" | awk '$1 > 0 { print }' | sort -n -k 1 -r
fi

echo -e "\n~~~~~~~~~~~~~~~~~~~~~~\n\nThe main delimiter is:"
MainDlmtr="`echo -e $Sommaire | sort -n -k 1 -r | head -1 | cut -d" " -f2`"
if [ "${MainDlmtr}" = "-" ];then
     echo "$MainDlmtr -" # To make obvious the < space > delimiter.
else
     echo "$MainDlmtr"
fi
echo -e "\n~~~~~~~~~~~~~~~~~~~~~~" # set +xe
exit
### 30 ###

#######################################
# To bring a chuckle out of you (hopefully!), here is the text
# that I used for my main test.  The apparent "confusion" between
# punctuation marks and standard csv delimiters is intentional.
#
# In case you are wondering, this part does not need to be
# commented out because of the exit command above.
#
   -- MUSHER0'S DINER --

--- Today's Dinner Menu ---

Appetizer -- Your choice of
soup, or salad, or vegetable juice;

Main course -- Your choice of
Dad's generous steak and mashed potatoes
with pepper gravy, or Aunt Audette's
beautiful whitefish and carrots in lemon
sauce;

Dessert -- Your choice of
Cousin Norman's "Drip" (vanilla ice cream
with caramel sauce), a slice of Mom's tasty
fruit cake, or one of my succulent donuts in
maple syrup;

Beverage -- Your choice of
coffee, tea, or home-made spruce beer.

Please ask your waitress or waiter if you wish
to see our full menu.
#
# @Sailor Enceladus: I had fun writing it! ;)
#
# BTW, first dinner will be free for visiting Puppyists if ever I
# open that diner North of Lake Superior, near Montreal River
# and Highway 17. :-)
#######################################
BFN.
Attachments
Diner-Menu.txt.jpg
The top is missing, but I am sure that you get the idea.
(232.06 KiB) Downloaded 148 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#6 Post by musher0 »

Hi.

This is where I'm at.

This version provides the true middle of the text and the middle of the lines.
(Please see attached illustration.)

It works fine with txt and csv files, as far as I can tell. However, I feel that the results
are not reliable with quasi-text file types such as xml and sh files.

Feedback welcome.

BFN.
~~~~~~~~~~~~~~~~~~~~~~~~

Code: Select all

#!/bin/bash
# /opt/local/bin/FishInTheMiddle.sh # (Or place this
# script in any "/bin" directory in your $PATH.)
#
# Goal -- Find the delimiter in a text or csv file and show the middle field(s)
# -- of the entire text and of the individual lines.
#
# Uses Puppy-provided utilities: awk, paste, seq, sort. (I.e., no outside dependencies.)
#
# Usage -- In terminal, type < FishInTheMiddle.sh filename >,
# or just < FishInTheMiddle.sh >, and answer the prompt.
#
# Limitations at this time: plain *.txt and *.csv files; not reliable for *.xml or *.sh files.
# (Other quasi-text file types may also display strange results.)
#
# IMPORTANT -- Maximum size of the input file: 2Mb.
#
# © Christian L'Écuyer, Gatineau (Qc), Canada, 2018-03-04 and 07. GPL3.
# (Alias musher0 [forum Puppy].) #
#################   # https://opensource.org/licenses/GPL-3.0
#    This program is free software: you can redistribute it and/or modify it under the
#    terms of the GNU General Public License as published by the Free Software
#    Foundation, either version 3 of the License, or  (at your option) any later version.
#         This program is distributed in the hope that it will be useful, but WITHOUT ANY
#    WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
#    A PARTICULAR PURPOSE. See the GNU General Public License for more details.
#         You should have received a copy of the GNU General Public License along with
#   this program. If not, see <http://www.gnu.org/licenses/>.
##########
#   Ce programme est libre : vous pouvez le redistribuer ou modifier selon les termes
#   de la Licence Publique Générale GNU publiée par la Free Software Foundation (v. 3
#   ou toute version ultérieure choisie par vous).
#       Ce programme est distribué dans l'espoir qu'il sera utile, mais SANS AUCUNE
#   GARANTIE, ni explicite ni implicite, y compris des garanties de commercialisation
#   ou d'adaptation à un but spécifique. Pour plus de détails, veuillez vous reporter
#   au texte officiel de cette licence à https://opensource.org/licenses/GPL-3.0, à
#   http://www.linux-france.org/article/these/gpl.html pour une traduction et, pour une
#   explication en français, à https://fr.wikipedia.org/wiki/Licence_publique_générale_GNU.
################ # set -xe
Text="$1"
if [ "$Text" = "" ];then
     echo -e "\n\t\e[36m\e[4m\e[1mPlease type the filename of the text that\e[24m
\t \e[4myou wish to examine. Type the full path
\t\t\e[4mif not in this directory.\e[0m"
     read Text
fi
if [ ! -f "$Text" ];then
     echo -e "\n\t\e[1m\e[5m\e[7m\e[31mFile $Text does not exist. Please retry.\e[0m\n"
     sleep 5;clear
     exit
fi
echo

Pasted="`paste -sd'\0' $Text`"  # Text="ba;be;bi;bo;bu;by" # For test.
echo -e "The text is:\n"
more "$Text"
sleep 2s

# Process
delim="";delim=(';' ',' '|' ' ' ':' '-') ## Standard delimiters that one comes across in csv files.
Sommaire="";champ=0;fois=0
for i in `seq ${#delim[@]}`;do
     fois="`echo -e "$Pasted" | tr "${delim[$champ]}" "\n" | wc -l`" # Gives us N occurrences.
     fois="`expr $fois - 1`" # Removing the LF.
     Sommaire="$Sommaire$fois -${delim[$champ]}-\n" # Gathers data for the report.
     champ="`expr $champ + 1`" # To query the next delimiter.
done

# Report
echo -e "~~~~~~~~~~~~~~~~~~~~~~\n\nDelimiter statistics:\n$Sommaire"

Several="`echo -e "$Sommaire" | awk '$1 > 0 { print }' | wc -l`"
if [ "$Several" -gt "1" ];then # In case there is more than one delimiter in this text.
     echo -e "~~~~~~~~~~~~~~~~~~~~~~\n\nThere are several standard delimiters in this text:"
     echo -e "$Sommaire" | awk '$1 > 0 { print }' | sort -n -k 1 -r
fi

echo -e "\n~~~~~~~~~~~~~~~~~~~~~~\n\nThe main delimiter is:"
MainDlmtr="`echo -e $Sommaire | sort -n -k 1 -r | head -1 | cut -d" " -f2`"
if [ "${MainDlmtr}" = "-" ];then
     MainDlmtr="- -"
     echo "$MainDlmtr" # To make obvious the < space > delimiter.
else
     echo "$MainDlmtr"
fi
echo -e "\n~~~~~~~~~~~~~~~~~~~~~~"
sep="${MainDlmtr:1:1}"
Fish="`echo "$Pasted" | awk -F"$sep" '{print $((NF/2)+1)}'`"
MidLine="`cat "$Text" | awk -F"$sep" '{print "\t"$((NF/2)+1)}'`"
echo -e "\nThe field in the middle of the text is:\n\n\t$Fish"
echo -e "\n~~~~~~~~~~~~~~~~~~~~~~\n"
if [ "`wc -l < $Text`" -gt "1" ];then
     echo -e "The fields in the middle of each line are:\n\n$MidLine"
     echo -e "\n~~~~~~~~~~~~~~~~~~~~~~\n"
fi # set +xe
exit
### 30 ###
~~~~~~~~~~~~~~~~~~~~~~~~

The test texts were:
AA;BB;CC;DD;EE
---;---;---;---;---
ba;ze;ci;xo;du
bark;meow;howl;buzz;snort
along with the "restaurant menu" previously provided.
Attachments
Where-Im-at.jpg
(139.06 KiB) Downloaded 124 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

slavvo67
Posts: 1610
Joined: Sat 13 Oct 2012, 02:07
Location: The other Mr. 305

#7 Post by slavvo67 »

@Musher0

Interesting. I never thought to have the computer search for delimiters. Generally, I know them going in but an interesting concept for sure.

Thanks,

Slavvo67

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#8 Post by musher0 »

Hi slavvo67.

You're welcome.

Like you, I generally know what the delimiter is in a csv file. And if I don't, I can screen
the file with cat or more to find out.

The goal of these two scripts is to try to discover the "middle" of a text file or string
automatically. We have the head and tail utilities for tops and bottoms, but nothing for
middles, AFAIK.

Now, to get to the middle, one method is to know the delimiter, to parse the text
coherently. As in this second script, and it turns out to be quite precise.

The first script tried a different method, based on the length of the file, plus the
proportion (or fraction) of the middle that the user wants to see. That first script worked
like an accordeon or elastic or small window relative to the entire text. Less precise, but
for cursive texts (not databases), this "elastic" approach may perhaps produce more
"meaning".

I'm still not sure if these scripts will be useful. For now I consider them "Studies".
(Chopin wrote more beautiful ones, I know!!!)

BFN.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

slavvo67
Posts: 1610
Joined: Sat 13 Oct 2012, 02:07
Location: The other Mr. 305

#9 Post by slavvo67 »

Really cool stuff. It's funny because I see from the past that you and I look at very similar things like file conversions, for example and how to make them work well in Puppy or Linux in general.

I like seeing your work. Thanks...

Slavvo67

Post Reply