fold-BB-NOTUSED - IT SHOULD BE USED!

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

fold-BB-NOTUSED - IT SHOULD BE USED!

#1 Post by MochiMoppel »

I sometimes wonder about these strange ...-BB-NOTUSED symlinks in bin and sbin directories. Apparently these link to busybox versions of otherwise equally named utilities. But why NOTUSED? Are their full featured cousins, e.g. those contained in coreutils, always considered preferable?

I understand that BB's utilities are bare bone, but what if one of its utilities offers exactly the same options as their counterparts or just the options that the user needs? Wouldn't it be preferable then to use BB? After all - at least in frugal installs - busybox is already running, and calling one of its functions seems so much more efficient than executing the often heavy corutils binary.

Lately I tried to use the fold utility (coreutil version 8.19) to wrap Japanese text. Without any options fold wraps text into colums 80 characters wide. When I piped the text through fold in gtkdialog, I received an error: Gtk-CRITICAL **: gtk_text_buffer_emit_insert: assertion g_utf8_validate (text, len, NULL) failed. Sometimes it worked OK, but most often it did not. I couldn't find a pattern. I then turned to busybox fold and it never failed. It appears that the coreutils version can't handle UTF-8 properly while busybox can.

Here a test case with UTF-8 symbols instead of Japanese characters. This triggers a segmentation fault error and Leafpad will not run:

Code: Select all

echo '☂☃☄★☆☇☈☉☊☋☌

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#2 Post by musher0 »

Hi MochiMoppei.

Your reasoning is sound, and we developers should indeed use the most
efficient tool for the job.

Except the coreutils is "big" only if you choose to compile it in one chunk.

Please find attached a tree of my compilation of coreutils-8.27: there are
105 of them, the smallest being 14 Kb and the largest, 60 Kb.

I think twice about using BB utilities: some BB utils are so trimmed down
they are almost useless. For ex., the less and the lsof replacements
offered by BB are really awful.

Good find, though, this fold utility. I normally use fmt for this purpose.

As to the designation "BB-NOTUSED", it's an "editorial decision" by BarryK,
inventor of PuppyLinux, and that's all it is.

BFN.
Attachments
coreutils-bin.sort.zip
(801 Bytes) Downloaded 95 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

Sailor Enceladus
Posts: 1543
Joined: Mon 22 Feb 2016, 19:43

Re: fold-BB-NOTUSED - IT SHOULD BE USED!

#3 Post by Sailor Enceladus »

[quote="MochiMoppel"]Here a test case with UTF-8 symbols instead of Japanese characters. This triggers a segmentation fault error and Leafpad will not run:

Code: Select all

echo '☂☃☄★☆☇☈☉☊☋☌

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

Re: fold-BB-NOTUSED - IT SHOULD BE USED!

#4 Post by MochiMoppel »

Sailor Enceladus wrote:Using -w15 instead of -w5 made fold show up like busybox,
Are you really sure? Look closer. Here the result is plain wrong. Wraps after 5, not 15 characters. Busybox wraps after 15 characters (of course).

I can tell you what does work for me in both versions. When I remove 1 character from my string and run fold without any options, i.e fold would have no other function than passing the string to leafpad unchanged. Useless, but successful:

Code: Select all

echo '☂☃☄★☆☇☈☉☊☋☌

Sailor Enceladus
Posts: 1543
Joined: Mon 22 Feb 2016, 19:43

Re: fold-BB-NOTUSED - IT SHOULD BE USED!

#5 Post by Sailor Enceladus »

MochiMoppel wrote:
Sailor Enceladus wrote:Using -w15 instead of -w5 made fold show up like busybox,
Are you really sure? Look closer. Here the result is plain wrong. Wraps after 5, not 15 characters. Busybox wraps after 15 characters (of course).
Haha yes, that's what I meant. fold with -w15 gave me the same as busybox fold with -w5 for those symbols... :lol:

edit: Even unicode does "literal bytes" with the full fold using that syntax it seems, I had to use 10 to wrap by 5

Code: Select all

echo '
Attachments
capture25954.png
(36.86 KiB) Downloaded 348 times

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

Re: fold-BB-NOTUSED - IT SHOULD BE USED!

#6 Post by MochiMoppel »

[quote="Sailor Enceladus"]edit: Even unicode does "literal bytes" with the full fold using that syntax it seems, I had to use 10 to wrap by 5

Code: Select all

echo '

User avatar
misko_2083
Posts: 114
Joined: Tue 08 Nov 2016, 13:42

Re: fold-BB-NOTUSED - IT SHOULD BE USED!

#7 Post by misko_2083 »

^ fold is useless here.
awk on the other hand could do the work

Code: Select all

echo '☂☃☄★☆☇☈☉☊☋☌

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#8 Post by musher0 »

Difference between fold and fmt. using a silly sentence:
Attachments
difference.jpg
(87.32 KiB) Downloaded 301 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

Re: fold-BB-NOTUSED - IT SHOULD BE USED!

#9 Post by MochiMoppel »

misko_2083 wrote:^ fold is useless here.
coreutil fold is useless here.
awk on the other hand could do the work
Sure, if you try hard enough you will always find a way to make simple things complicated :lol:

I could coerce pure bash to do the job, but what's the point?

Code: Select all

function foldme { for ((c=0;c<=${#2};c+=$1)); do echo "${2:$c:$1}" ;done ;}
foldme 5 '☃☄★☆☇☈☉☊☋☌

User avatar
misko_2083
Posts: 114
Joined: Tue 08 Nov 2016, 13:42

Re: fold-BB-NOTUSED - IT SHOULD BE USED!

#10 Post by misko_2083 »

MochiMoppel wrote:
awk on the other hand could do the work
Sure, if you try hard enough you will always find a way to make simple things complicated :lol:
It depends from a perspective. Some people use the straw to drink the joghurt. Some people use the spoon to eat the soup.
The point is I like complications. :lol:

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#11 Post by MochiMoppel »

musher0 wrote:Difference between fold and fmt. using a silly sentence:
Different tools for different purposes produce different results ...

Coreutils' fmt can't split strings after a defined length and - again the coreutils show stopper - like fold it can't handle Unicode.

Folding lines at spaces is not my topic here. If needed busybox fold can do the job using the -s switch. Still this is folding and not formatting. Spaces are not replaced by newlines, they are preserved and may end up at line starts.

Code: Select all

# echo 'abcde fghijkl mnopqrstuvwxyz' | fmt -w20
abcde fghijkl
mnopqrstuvwxyz

# echo '☂☃☄★☆☇ ☈☉☊☋☌

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#12 Post by musher0 »

Hi, MochiMoppei.

Why is it then that fmt from coreutils works for the French language
with a LANG=fr_CA.utf8 environment?

Example -- Some news about violent winds, taken from Radio-Canada.ca:

Code: Select all

echo "Des vents très violents, possiblement une tornade, ont complètement détruit une résidence de la municipalité d'Hébertville, au Saguenay-Lac-Saint-Jean, ainsi qu'une autre habitation à Sainte-Anne-du-Lac, dans les Laurentides, dimanche. Une journée chaude et humide où le temps instable a donné lieu à une série d'alertes de tornades et d'orages de la part d'Environnement Canada." | fmt -w 80
Result:
Des vents très violents, possiblement une tornade, ont complètement détruit
une résidence de la municipalité d'Hébertville, au Saguenay-Lac-Saint-Jean,
ainsi qu'une autre habitation à Sainte-Anne-du-Lac, dans les Laurentides,
dimanche. Une journée chaude et humide où le temps instable a donné lieu à
une série d'alertes de tornades et d'orages de la part d'Environnement Canada.

Code: Select all

echo "Des vents très violents, possiblement une tornade, ont complètement détruit une résidence de la municipalité d'Hébertville, au Saguenay-Lac-Saint-Jean, ainsi qu'une autre habitation à Sainte-Anne-du-Lac, dans les Laurentides, dimanche. Une journée chaude et humide où le temps instable a donné lieu à une série d'alertes de tornades et d'orages de la part d'Environnement Canada." | fmt -w 60
Result:
Des vents très violents, possiblement une tornade, ont
complètement détruit une résidence de la municipalité
d'Hébertville, au Saguenay-Lac-Saint-Jean, ainsi
qu'une autre habitation à Sainte-Anne-du-Lac, dans les
Laurentides, dimanche. Une journée chaude et humide où
le temps instable a donné lieu à une série d'alertes
de tornades et d'orages de la part d'Environnement Canada.
Perhaps Japanese is using utf16? (Sorry if I sound ignorant. I would not
know this sort of thing.)

~~~~~~~
Various remarks:
-- Isn't utf8 chosen (or not) for one's language when one sets up the Puppy?

-- coreutils can be compiled with a "disable-nls" parameter... This means
that the developer can choose to have all his compiled coreutils
completely ignore the utf8 environment.

-- if you want to accelerate a sort or do a sort without taking into account
utf8 characters, you set LC_ALL=C and you set LC_ALL="" back on when
finished.

This is a relatively well documented trick. It also works if you wish to
greatly speed up some section of a bash script, whether this section has
some data to sort or not.

-- there is a report about the "cut" utility from coreutils misbehaving in an
utf8 environment here:
https://unix.stackexchange.com/question ... -utf-aware

Thus hoping to contribute to the discussion although I do not know your
language.

Best regards.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#13 Post by MochiMoppel »

OK, let's change "can't handle Unicode" to "can't handle Unicode reliably". Does this make it any better?

UTF-8 doesn't care if you use French, Greek, Russian or Japanese. What makes the difference is the number of bytes it uses to represent each character set. For French you never needed Unicode. Extended ASCII could handle it and coreutils should have no problems with French even if it is not UTF-8 aware.

UTF-8 includes the basic 128 ASCII characters (1byte per character), all of the former extended ASCII variants (incl.French!) and some more (2 bytes per character), all kind of symbols and - from a Western point of view - "exotic" languages like Korean or Japanese (3 bytes per character), and lastly there are even 4-byte characters, e.g. less frequently used Japanese Kanji. I expect a text manipulating tool to handle all of these characters.

fmt handles only 1 and 2-byte characters flawlessly. Take your example and change only 1 character to a symbol and you already might end up with an unexpected result.

Code: Select all

# echo "Des vents très violents , possiblement une tornade, ont complètement détruit une résidence de la municipalité" | fmt -w 53
Des vents très violents , possiblement une tornade,
ont complètement détruit une résidence de la
municipalité

# echo "Des vents très violents , possiblement une t☠rnade, ont complètement détruit une résidence de la municipalité" | fmt -w 53
Des vents très violents , possiblement une
t☠rnade, ont complètement détruit une résidence
de la municipalité
Now, to end the discussion about fmt, here is another reason why I need fold and not fmt: fmt only wraps at spaces. Doesn't help me since Japanese text doesn't include space characters.

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#14 Post by musher0 »

Hello, MochiMoppei.

Many thanks for the detailed explanation of utf8.
I learned something today.

Again a couple of thoughts:
-- I don't think of Japanese as "exotic", only different. Different Civilizations
make this Planet richer.

-- I believe that you should bring this 3-character bug to the attention of
the authors of coreutils fold at the GNU Foundation. It seems obvious that
they don't have testers for the Japanese language whereas the BusyBox
people do.

BFN.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
misko_2083
Posts: 114
Joined: Tue 08 Nov 2016, 13:42

#15 Post by misko_2083 »

In python that would be trivial with the textwrap library.

Code: Select all

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import textwrap

strs = str("☎


User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#17 Post by MochiMoppel »

musher0 wrote: I believe that you should bring this 3-character bug to the attention of
the authors of coreutils fold at the GNU Foundation..
Bring to the attention?
bug-coreutils is already full of reports. The oldest I could find relates to version 5.94 and dates back 11 years ago!
misko_2083 wrote:by the way wc has the same behaviour in coreutils and busybox.
Good! Both seem to work fine :wink:
Counts only bytes
Don't be fooled by the odd nomenclature:
-l counts lines
-w counts words
-c counts ...no, not characters, it counts bytes
-m counts ...well, characters

Code: Select all

# echo -n "☎
Last edited by MochiMoppel on Tue 20 Jun 2017, 10:40, edited 1 time in total.

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#18 Post by musher0 »

MochiMoppel wrote:
musher0 wrote: I believe that you should bring this 3-character bug to the attention of
the authors of coreutils fold at the GNU Foundation..
Bring to the attention?
bug-coreutils is already full of reports. The oldest I could find relates to version 5.94 and dates back 11 years ago!
(...)
My God! It's high time we topple that government! :lol:

What I mean is: someone who knows his stuff should rewrite the utility
with proper multi-byte incorporation.and shame gnu.org with it.

Linuxians should not tolerate incompetence.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

Post Reply