AWK: match($2,/^(.*[^:digit:])([:digit:]*$|$)/,pkg_split)

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
s243a
Posts: 2580
Joined: Tue 02 Sep 2014, 04:48
Contact:

AWK: match($2,/^(.*[^:digit:])([:digit:]*$|$)/,pkg_split)

#1 Post by s243a »

I want to use AWK to mach libcN where 'N' is the major version number. I'm using AWK on a puppy database file for a package repo. These repo files follow the petspet format where the second field (i.e. $2) is the package name. To the best of my understanding the regular expression to do this should be:

Code: Select all

match($2,/^(.*[^:digit:])([:digit:]*$|$)/,pkg_split)
https://www.gnu.org/software/gawk/manua ... tions.html

but for some inexplicable reason it appears to be matching 'g' as a numeric digit even though the docs say the following:
A character class is only valid in a regexp inside the brackets of a bracket expression. Character classes consist of ‘[:’, a keyword denoting the class, and ‘:]’. Table 3.1 lists the character classes defined by the POSIX standard.
....
[:digit:] Numeric characters
https://www.gnu.org/software/gawk/manua ... xpressions

Here is my debugging output which shows the awk program:

Code: Select all

++ cat /var/packages/repo/Packages-devuan-ascii-non-free
++ awk '    BEGIN{FS="|"}
    {
      match($2,/^(.*[^:digit:])([:digit:]*$|$)/,pkg_split)
      if ( pkg_split[1] == "libc" ) {
        print
      }
    }'
+ awk_result='libcg_3.1.0013-2+b1|libcg|3.1.0013-2+b1||BuildingBlock|11609K|pool/DEBIAN/non-free/n/nvidia-cg-toolkit|libcg_3.1.0013-2+b1_i386.deb|+libc6&ge2.3.6-6|Nvidia Cg core runtime library|devuan|ascii|'
+ '[' '!' -z 'libcg_3.1.0013-2+b1|libcg|3.1.0013-2+b1||BuildingBlock|11609K|pool/DEBIAN/non-free/n/nvidia-cg-toolkit|libcg_3.1.0013-2+b1_i386.deb|+libc6&ge2.3.6-6|Nvidia Cg core runtime library|devuan|ascii|' ']'

The debugging output is produced as follows:

Code: Select all

bash -x /usr/sbin/pkg-list-alias libc 2>&1 | tee pkg_list_alias.log
and my script can be found at:
https://pastebin.com/Yb7gNV2r

which is an updated version of a script which I discussed at:
http://murga-linux.com/puppy/viewtopic. ... 47#1037047

Here is the line of code which calls the AWK program:

Code: Select all

awk_result="$(cat $aRepoDB | awk "$AWK_PRG")"
Last edited by s243a on Sat 21 Sep 2019, 12:55, edited 6 times in total.
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

s243a
Posts: 2580
Joined: Tue 02 Sep 2014, 04:48
Contact:

Re: AWK: match($2,/^(.*[^:digit:])([:digit:]*$|$)/,pkg_split)

#2 Post by s243a »

delete
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#3 Post by technosaurus »

It may save you some time to test it here:
https://regex101.com/

When you use parens, you can usually print out the matches with \N for debugging (where N is the Nth set of parens), I don't recall how to do it in awk though.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#4 Post by MochiMoppel »

Please post a simple example. Your input string and the expected output.
Your regex pattern looks wrong as you may need an additional set of square brackets.

Burunduk
Posts: 80
Joined: Sun 21 Aug 2011, 21:44

#5 Post by Burunduk »

As MochiMoppel has pointed out, you definitely need additional square brackets here. [^:digit:] is the same as [^:dgit] and it matches anything but 'g' and the other four characters.
So, if all you want to match is something like AAANNN where A is not a digit and N is, then this will do:

Code: Select all

/([^0-9]+)([0-9]*)/
or using POSIX character classes:

Code: Select all

/([^[:digit:]]+)([[:digit:]]*)/
Note also that an array argument is a GAWK extension not supported by the busybox awk.

s243a
Posts: 2580
Joined: Tue 02 Sep 2014, 04:48
Contact:

#6 Post by s243a »

Burunduk wrote:As MochiMoppel has pointed out, you definitely need additional square brackets here. [^:digit:] is the same as [^:dgit] and it matches anything but 'g' and the other four characters.
So, if all you want to match is something like AAANNN where A is not a digit and N is, then this will do:

Code: Select all

/([^0-9]+)([0-9]*)/
or using POSIX character classes:

Code: Select all

/([^[:digit:]]+)([[:digit:]]*)/
Thankyou. That was most helpful :). Now I get the correct debugging output:

Code: Select all

++ cat /var/packages/repo/Packages-devuan-ascii-main
++ awk '    BEGIN{FS="|"}
    {
      match($2,/^(.*[^[:digit:]])([[:digit:]]*$|$)/,pkg_split)
      if ( pkg_split[1] == "libc" ) {
        print
      }
    }'
+ awk_result='libc6_2.24-11+deb9u4|libc6|2.24-11+deb9u4||BuildingBlock|9579K|pool/DEBIAN/main/g/glibc|libc6_2.24-11+deb9u4_i386.deb|+libgcc1|GNU C Library: Shared libraries|devuan|ascii|'
At fist I didn't read your post carefully enough so I only fixed the first set of square brackets. [^[:digit:]] and didn't realize that I also needed to double up on the second set of square brackets [[:digit:]]. In hindsight, I can see how this would make parsing easier for awk.
Note also that an array argument is a GAWK extension not supported by the busybox awk.
I was wondering that. This is good to know. We can do something similar with grep, if we have the full version of grep but not awk. However, awk is more efficient for this application.

Edit: an updated version of the original script (with the bracket fix) can be found at: https://pastebin.com/KtUikhdS
Last edited by s243a on Sat 21 Sep 2019, 17:44, edited 1 time in total.
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#7 Post by MochiMoppel »

Burunduk wrote:As MochiMoppel has pointed out, you definitely need additional square brackets here.
Yes, there :lol:
Note also that an array argument is a GAWK extension not supported by the busybox awk.
Methinks that s243a doesn't need array arguments at all. For pulling out a leading non numeric string something like this should do

Code: Select all

       sub(/[0-9].*/,"",$2)
       print $2
but I don't know what his strings look like and what he wants to achieve.

s243a
Posts: 2580
Joined: Tue 02 Sep 2014, 04:48
Contact:

#8 Post by s243a »

MochiMoppel wrote:
Burunduk wrote:As MochiMoppel has pointed out, you definitely need additional square brackets here.
Yes, there :lol:
Note also that an array argument is a GAWK extension not supported by the busybox awk.
Methinks that s243a doesn't need array arguments at all. For pulling out a leading non numeric string something like this should do

Code: Select all

       sub(/[0-9].*/,"",$2)
       print $2
but I don't know what his strings look like and what he wants to achieve.
I want the entire repo db record. So something like the following might also work (untested):

Code: Select all

pkg=gensub(/[0-9].*/,"","g",$2)
if ( pkg = libc ) {
  print
}
https://www.gnu.org/software/gawk/manua ... tions.html

**Note that I like the array syntax because it is more general and more efficient than the gensub approach even if it isn't as widely supported.

As a side note I thought the '6' in libc6 was the major package version but I see from above that the package version is 2.24-11+deb9u4. However, I note that if I look at the file names in the package that the actual lib is called:

Code: Select all

/lib/i386-linux-gnu/libc.so.6
https://packages.debian.org/stretch/i386/libc6/filelist

by linux standards the '6' should be the version of the lib rather than the version of the package:
3.1.1. Shared Library Names

Every shared library has a special name called the ``soname''. The soname has the prefix ``lib'', the name of the library, the phrase ``.so'', followed by a period and a version number that is incremented whenever the interface changes (as a special exception, the lowest-level C libraries don't start with ``lib''). A fully-qualified soname includes as a prefix the directory it's in; on a working system a fully-qualified soname is simply a symbolic link to the shared library's ``real name''.
http://tldp.org/HOWTO/Program-Library-H ... aries.html

I didn't realize that the package versions were different than the lib versions for some linux packages. I will think about the implications of this distinction.

I will note that some package repos do not use the lib version as a suffix in the package name.
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#9 Post by technosaurus »

Burunduk wrote:As MochiMoppel has pointed out, you definitely need additional square brackets here. [^:digit:] is the same as [^:dgit] and it matches anything but 'g' and the other four characters.
So, if all you want to match is something like AAANNN where A is not a digit and N is, then this will do:

Code: Select all

/([^0-9]+)([0-9]*)/
or using POSIX character classes:

Code: Select all

/([^[:digit:]]+)([[:digit:]]*)/
Note also that an array argument is a GAWK extension not supported by the busybox awk.
\d is also the same as the [[:digit:]] in some regex engines
You can emulate n-dimensional arrays in busybox awk with a separator - usually a comma.
... so instead of array[j][k] you'd use array[i,j,k] (works in gawk&make too)

The link I posted is useful to build and test your regex, but I used to always just do a Ctrl+h in geany.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

s243a
Posts: 2580
Joined: Tue 02 Sep 2014, 04:48
Contact:

#10 Post by s243a »

technosaurus wrote:It may save you some time to test it here:
https://regex101.com/

When you use parens, you can usually print out the matches with \N for debugging (where N is the Nth set of parens), I don't recall how to do it in awk though.
I found an example where awk behaves differently than this test program.

Code: Select all

# echo ac | awk "{match(\$1,/(:?a|b)(c|d)/,matches); print matches[1]}"
a
If AWK supported non-capturing groups than the result would be "c". AWK doesn't appear to support non-capturing groups. The following link seems to agree with my claim about AWK's limitation here:
https://comp.unix.programmer.narkive.co ... gawk-regex
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

Post Reply