Shutdown hang when CIFS shares are mounted and potential fix

Please post any bugs you have found
Post Reply
Message
Author
ldolse
Posts: 367
Joined: Fri 23 Oct 2009, 16:33

Shutdown hang when CIFS shares are mounted and potential fix

#1 Post by ldolse »

I've been having problems with many recent Puppy builds I've been using hanging during shutdown/reboot. I finally decided to track it down and found that it was related to rc.shutdown's handling of stray mounted CIFS filesystems. Specifically it was hanging on the final line of this block of code during the end of the shutdown:

Code: Select all

STRAYPARTL="`echo "$MNTDPARTS" |grep -v "/dev/pts" |grep -v "/proc" |grep -v "/sys" |grep -v "tmpfs" |grep -v "rootfs" |grep -v 'on / ' | grep -v "/dev/root" | grep -v "usbfs" | grep -v "unionfs" | grep -v "/initrd"`"
STRAYPARTD="`echo $STRAYPARTL | cut -f 1 -d " " | tr "\n" " "`"
for ONESTRAY in $STRAYPARTD
do
 echo "Unmounting $ONESTRAY..."
 #091117 weird bug, no processes but when run this, x restarts...
 xFUSER="`fuser -m $ONESTRAY 2>/dev/null`" #091117 do this first, seems to fix it.
I'm guessing fuser is hanging because the network was already taken down much earlier in rc.shutdown. I'm not sure why this is occurring lately - the last Puppy I used heavily in the same way was Turbopup, so either this bug has been around since change #091117 or perhaps there is something different about fuser and/or how it works with newer kernels - I've mostly (maybe always) been using dpups when I've seen this.


Here is my proposed fix:

Code: Select all

	# unmount network shares before taking down the network
	for MOUNTPOINT in `mount -l | grep ^// | cut -d  '' -f 3`
	do
		umount -f $MOUNTPOINT
	done
That goes just before the lines in rc.shutdown where the network is taken down:

Code: Select all

	#100301 brought down below call to 'stop' service scripts, needed for lamesmbxplorer.
	#bring down network interfaces (prevents shutdown sometimes)...
	[ "`pidof wpa_supplicant`" != "" ] && wpa_cli terminate #100309 kills any running wpa_supplicant.
	if [ "`grep 'net-setup.sh' /usr/local/bin/defaultconnect`" = "" ];then #see connectwizard and connectwizard_2nd.
		for ONENETIF in `ifconfig | grep -E '^wifi[0-9]|^wlan[0-9]|^eth[0-9]' | cut -f 1 -d ' ' | tr '\n' ' '`
 		do
			ifconfig $ONENETIF down 2> /dev/null
 			[ "`iwconfig | grep "^${ONENETIF}" | grep "ESSID"`" != "" ] && iwconfig $ONENETIF essid off #100309
 			dhcpcd --release $ONENETIF 2>/dev/null #100309
		done
	else
		/etc/rc.d/rc.network stop
	fi

gcmartin

#2 Post by gcmartin »

Thanks. I had been using a script to umount those LAN mounts. Hope this is seen elsewhere.

User avatar
BarryK
Puppy Master
Posts: 9392
Joined: Mon 09 May 2005, 09:23
Location: Perth, Western Australia
Contact:

#3 Post by BarryK »

Idolse,
One problem, you are showing a code snippet from an old version of rc.shutdown.

You should get the rc.shutdown out of the latest Woof, or a Puppy built from recent Woof, such as Racy, Wary or Slacko.

The particular section of code that you have shown now looks like this:

Code: Select all

#091117 110928 if partition mounted, when choose shutdown, pc rebooted. found that param given to fuser must be mount-point, not /dev/*...
STRAYPARTL="`echo "$MNTDPARTS" | grep ' /mnt/' |grep -v -E '/dev/pts|/proc|/sys|tmpfs|rootfs|on / |/dev/root|usbfs|unionfs|aufs|/initrd'`"
STRAYPARTD="`echo "$STRAYPARTL" | cut -f 1 -d ' ' | tr '\n' ' '`"
STRAYMNT="`echo "$STRAYPARTL" | cut -f 3 -d ' ' | tr '\n' ' '`"
for ONESTRAY in $STRAYMNT
do
 #echo "`eval_gettext \"Unmounting \\\${ONESTRAY}...\"`"
 echo "Unmounting $ONESTRAY..."
 xFUSER="`fuser -m $ONESTRAY 2>/dev/null`"
 [ "$xFUSER" != "" ] && fuser -k -m $ONESTRAY 2>/dev/null
 killzombies #v3.99
 sync
 umount -r $ONESTRAY
done
...which might perhaps have solved the problem. It did solve another shutdown problem.
[url]https://bkhome.org/news/[/url]

ldolse
Posts: 367
Joined: Fri 23 Oct 2009, 16:33

#4 Post by ldolse »

I'll give the latest rc.shutdown a shot. I see that the call to "fuser -m" is still in the latest you pasted, and that was the line that was causing the issue - anyway will test with Racy/Wary and confirm again - I see that you're passing a bit different info to fuser.

Hadn't tried with the very latest Woof because for some reason it wasn't building a working dpup for me, and I hadn't got around to figuring out why - I've also been making changes to rc.shutdown and other scripts for my puplet, so each time I sync to a new Woof I've got to manually merge those changes. Will Woof's version control system actually let me branch changes ala Bazaar/git? I haven't tried to test that out so wasn't sure.

ldolse
Posts: 367
Joined: Fri 23 Oct 2009, 16:33

#5 Post by ldolse »

Ok, just tested. With the change to rc.shutdown that you pasted it now converts //<hostname/<sharename> to the actual mount point and passes the mountpoint to fuser per the changelog. However it still hangs on fuser.


If I let it sit for several minutes (roughly 5 or so) it seem like fuser eventually gives up and the shutdown continues/completes, which is another reason why I think the network being down is likely the reason fuser fails, and the timeout is just really long.

gcmartin

#6 Post by gcmartin »

That's one of the issues that's encountered at shutdown. Sometimes, either the remote is down or the LAN is down. This requires some force for smoothly shutdown. When everything (LAN) is in place, the shutdown/script processes without issue. But ...

Hmmm.... how best to handle this condition..... (AND if so, could there be a script/menu/network options to force umount of remote/all resources when necessary. This way no matter whether the need arises when running or when in Shutdown, the remote resources could be "un-connected" from the running system.

Hope this helps

ldolse
Posts: 367
Joined: Fri 23 Oct 2009, 16:33

#7 Post by ldolse »

The shutdown script intentionally shuts down the LAN early because on some hardware variants it apparently can cause a hang if the network interfaces aren't explicitly taken down - this behavior has been around forever.

The command calling fuser (which is what hangs) was added toward the end of 2009 to fix an unrelated problem - I don't believe the unmount itself is the issue , I think it's the attempt to gather information about the mountpoint that is down. Puplets using shutdown scripts before this fix don't exhibit the hang - e.g. Turbopup and likely other 4.2 variants.

The fix I proposed is to unmount network shares before taking down the network. This allows both original fixes to operate for their respective purposes and eliminates this behavior which is apparently a regression that's been around for quite some time. An alternate fix would be to exclude any mountpoint starting with '//' from being passed to the fuser command.

I only noticed it because I use network shares pretty religiously, and this is one of the main snags I hit when migrating from Turbopup to the latest and greatest puplets.

User avatar
BarryK
Puppy Master
Posts: 9392
Joined: Mon 09 May 2005, 09:23
Location: Perth, Western Australia
Contact:

#8 Post by BarryK »

Ok, I have implemented something, your suggested alternative solution. See if that does the job. Attached.
Attachments
rc.shutdown.gz
(8.76 KiB) Downloaded 603 times
[url]https://bkhome.org/news/[/url]

ldolse
Posts: 367
Joined: Fri 23 Oct 2009, 16:33

#9 Post by ldolse »

Thanks, I gave it a shot and I've got good news and bad news.

The good news is the update allows shutdown to get past the fuser command, so that particular change works as expected.

The bad news is it still hangs, it just hangs later on the final command in the shutdown script:

Code: Select all

busybox umount -ar > /dev/null 2>&1
That command hasn't changed at all since the 4.2 days, but I believe busybox has changed several times. The dpup I'm using is running is busybox version 1.17.2. It hangs for 5 minutes again, just like the hang with fuser. Not sure if Busybox can be called in a different way or if unmounting the network shares before taking down the network is the only way to get rid of the hang. I tried adding -f and -l just to see if it helped, no dice.

ldolse
Posts: 367
Joined: Fri 23 Oct 2009, 16:33

#10 Post by ldolse »

I take it back about the problem not existing in 4.2 puppies, but it's not nearly as pronounced/obvious - I just re-tested with Turbopup to get better idea of what happens, as the older code will execute a umount for each ONESTRAY item without running fuser - so in those older scripts the share would have been unmounted before getting to the final busybox umount command.

It turns out the hang actually seems to be present when the umount command is executed here as well, but it's not as obvious because the network timeout is much lower, maybe only around 1 minute (edit,probably only 30 seconds, just re-tested). Not sure where the timer is defined, if that's a lower level kernel function or what.

User avatar
BarryK
Puppy Master
Posts: 9392
Joined: Mon 09 May 2005, 09:23
Location: Perth, Western Australia
Contact:

#11 Post by BarryK »

ldolse wrote:Thanks, I gave it a shot and I've got good news and bad news.

The good news is the update allows shutdown to get past the fuser command, so that particular change works as expected.

The bad news is it still hangs, it just hangs later on the final command in the shutdown script:

Code: Select all

busybox umount -ar > /dev/null 2>&1
That command hasn't changed at all since the 4.2 days, but I believe busybox has changed several times. The dpup I'm using is running is busybox version 1.17.2. It hangs for 5 minutes again, just like the hang with fuser. Not sure if Busybox can be called in a different way or if unmounting the network shares before taking down the network is the only way to get rid of the hang. I tried adding -f and -l just to see if it helped, no dice.
Ok, I have put in your original solution. See attached.
Attachments
rc.shutdown.gz
(8.86 KiB) Downloaded 604 times
[url]https://bkhome.org/news/[/url]

User avatar
Karl Godt
Posts: 4199
Joined: Sun 20 Jun 2010, 13:52
Location: Kiel,Germany

#12 Post by Karl Godt »

--- /root/my-documents/tmp/rc.shutdown1 2011-12-27 11:43:00.000000000 +0100
+++ /root/my-documents/tmp/rc.shutdown2 2011-12-27 11:43:36.000000000 +0100
@@ -57,6 +57,7 @@
#110928 fixed, reboots when choose shutdown. very old bug, dates back to 2009.
#110928 modified i18n conversion, only for echo to /dev/console.
#111106 do not execute fuser if network share mount.
+#111107 ldolse: unmount network shares before taking down the network

#110923
. /usr/bin/gettext.sh # enables use of eval_gettext (several named variables) and ngettext (plurals)
@@ -180,6 +181,13 @@ if [ "$ACTIVE_INTERFACE" ];then
fi
fi

+#111107 ldolse: unmount network shares before taking down the network
+#(see 111106, need to do it sooner, but 111106 will remount read-only if failed to umount here)
+for MOUNTPOINT in `mount | grep '^//' | cut -d ' ' -f 3 | tr '\n' ' '`
+do
+ umount -f $MOUNTPOINT
+done
+
#v2.16 some packages have a service script that requires stopping...
for service_script in /etc/init.d/*
do
The above is the diff of both rc.shutdown .
I cannot say anything about cifs and network share .

#

BUT

rc.shutdown has got a new bug in the STRAYPARTSLIST :

due to

Code: Select all

STRAYPARTandMNT="`echo "$STRAYPARTL" | cut -f 1,3 -d ' ' | tr ' ' '|' | tr '\n' ' '`"
the list would look like
"
/dev/sda1|/mnt/sda1
/dev/sda2|/mnt/sda2
/dev/sda3|/mnt/sda3"


The code goes further with

Code: Select all

for ONESTRAY in $STRAYPARTandMNT
do
 FLAGCIFS="`echo -n ${ONESTRAY} | grep '^//'`"
 ONESTRAYMNT="`echo -n ${ONESTRAY} | cut -f 2 -d '|'`"
 #echo "`eval_gettext \"Unmounting \\\${ONESTRAY}...\"`"
 echo "Unmounting $ONESTRAY..."
 if [ "$FLAGCIFS" = "" ];then
  xFUSER="`fuser -m $ONESTRAY 2>/dev/null`"
  [ "$xFUSER" != "" ] && fuser -k -m $ONESTRAYMNT 2>/dev/null
 fi
 killzombies #v3.99
 sync
AND

Code: Select all

 umount -r $ONESTRAY
done
THE problem is ONESTRAY becoming "/dev/sda1|/mnt/sda1"
which i am not used to know to work .

ONESTRAY should be either

"/dev/sda1" OR
"/mnt/sda1" OR probably
"/dev/sda1 /mnt/sda1" #not tested this third possibility

NOT /dev/sda1|/mnt/sda1 !!!

The problem is the delimiter becoming '|' staff , not '[[:space:]] ' .

I don't think that the directory or file dev/sda1|/mnt/sda1 exist in the /dev/ directory .

[UNDER CONSTRUCTION]
[after cleaned up i think i will provide a correct diff in some time]

SOLUTION :
#diff -up /mnt/+JUMP-10+puppy_slacko_5.3.1.sfs/etc/rc.d/rc.shutdown /etc/rc.d/rc.shutdown

Code: Select all

--- /mnt/+JUMP-10+puppy_slacko_5.3.1.sfs/etc/rc.d/rc.shutdown	2011-12-10 08:06:11.000000000 +0100
+++ /etc/rc.d/rc.shutdown	2011-12-26 22:39:32.000000000 +0100
@@ -523,14 +523,14 @@ do
  FLAGCIFS="`echo -n ${ONESTRAY} | grep '^//'`"
  ONESTRAYMNT="`echo -n ${ONESTRAY} | cut -f 2 -d '|'`"
  #echo "`eval_gettext \"Unmounting \\\${ONESTRAY}...\"`"
- echo "Unmounting $ONESTRAY..."
+ echo "Unmounting $ONESTRAY..." >/dev/console
  if [ "$FLAGCIFS" = "" ];then
   xFUSER="`fuser -m $ONESTRAY 2>/dev/null`"
   [ "$xFUSER" != "" ] && fuser -k -m $ONESTRAYMNT 2>/dev/null
  fi
  killzombies #v3.99
  sync
- umount -r $ONESTRAY
+ umount -r $ONESTRAYMNT
 done
 
 swapoff -a #works only if swaps are in mtab or ftab 
@@ -539,7 +539,7 @@ STRAYPARTD="`cat /proc/swaps | grep "/de
 for ONESTRAY in $STRAYPARTD
 do
  #echo "`eval_gettext \"Swapoff \\\${ONESTRAY}\"`"
- echo "Swapoff $ONESTRAY"
+ echo "Swapoff $ONESTRAY" >/dev/console
  swapoff $ONESTRAY
 done
 sync
NOTE : THE IMPORTANT PART is

- umount -r $ONESTRAY
+ umount -r $ONESTRAYMNT

[edit]
there is still
xFUSER="`fuser -m $ONESTRAY 2>/dev/null`"
should also become
xFUSER="`fuser -m $ONESTRAYMNT 2>/dev/null`"

I also altered
ZOMBIES="`ps -H -A | grep '<defunct>' | sed -e 's/ /|/g' | grep -v '|||' | cut -f 1 -d ' ' | tr '\n' ' '`"
TO

Code: Select all

 ZOMBIES="`ps -H -A | grep '<defunct>' | sed 's/^[[:blank:]]*//g' | cut -f 1 -d ' ' | tr '\n' ' '`"
[edit2]

Code: Select all

 ZOMBIES="`ps -H -A | grep '<defunct>' | sed 's/^[[:blank:]]*//g' | cut -f 1 -d ' ' | sort -gr | tr '\n' ' '`"
which would kill all zombies with the highest pids first
[/edit2]
because i was getting a bunch of "kill: : arguments must be process or job ID" on the screen by the killzombies function

[edit2]
AND in the
the ABSPUPHOME part !!

ABSPUPHOME=""
if [ "`busybox mount | grep "$ABSPUPHOME"`" != "" ];then


would always grep everything left mounted like /proc AND /sys .

AND
BADPIDS="`fuser -m $ABSPUPHOME 2>/dev/null`"
would be something like
BADPIDS="`fuser -m '' 2>/dev/null`"
AND because of directing the error output of "fuser" [--help] to /dev/null
this would be ok ,
BUT
also the killzombies function would run again .
[/edit2]

[/edit]

[edit2]
Ok : killzombies function wants only to grep parent-less zombies .

First minor problem :
busybox init --help
Init is the parent of all processes

If i assume that the toplevel parents are not meant like busybox init OR
"||2 ?||||00:00:00 kthreadd"
BUT
the second level parents
like
"|165 ?||||00:00:00| udevd"

these would have not been killed by filtering through grep -v '|||' .

HERE

" 3102 tty1|| 00:00:00| xwin"
" 3236 tty1|| 00:00:00|| xinit"
" 3237 tty4|| 00:01:53||| X"
" 3288 tty1|| 00:00:01||| jwm"
" 3362 tty1|| 00:00:15|||| pup_event_front"
"18890 tty1|| 00:00:00||||| sleep"
" 3353 tty1|| 00:00:00|||| jwm <defunct>"

only xwin and xinit would have been grep'd .

cut -f 1 -d ' ' would provide an empty space " " instead of "3102" OR "3236" .

I have other pids that would become like

"||5 ?||||00:00:00| kworker/u:0"
"| 11 ?||||00:00:00| khelper"
"|165 ?||||00:00:00| udevd"

Without looking for the '||||' which would've been filtered by grep -v '|||'
cut -f 1 -d " " would assign "||5" " " "|165" to the list of ZOMBIES to be killed .

Here the output of the kill command for '||5' :

kill '||5'
bash: kill: ||5: arguments must be process or job IDs

Now i've tinkered around with the ps output which is somewhat unusable like cat -n : nice for the eye but disgusting for usage in shell-scripts :

Code: Select all

ZOMBIES="`ps -H -A |sed 's/^[[:blank:]]*//g;s/\([0-9]*\)\ \([[:alnum:][:punct:]]*\)\([[:blank:]]*\)\(.*\)/\1 \2 \4/g' | grep '<defunct>' | sed -e 's/  /|/g' | grep -v '|' | cut -f 1 -d ' ' | tr '\n' ' '`"
My explanation :
ps -A -H for
-A all processes including the ? tty ie session leaders
-H process hierarchy
sed 's/^[[:blank:]]*//g
because using cut -f 1 -d ' ' later instead of awk '{print $1}' would probably not grep a pid but a white space
s/\([0-9]*\)\ \([[:alnum:][:punct:]]*\)\([[:blank:]]*\)\(.*\)/\1 \2 \4/g'
should translate the formatted spaces between
19973 pts/6 00:00:00 ps-FULL
3236 tty1 00:00:00 xinit
into
22305 pts/6 00:00:00 ps-FULL
3236 tty1 00:00:00 xinit
by ignoring the group 3 \([[:blank:]]\) leaving everything behind this group unformatted (group 4) .
The hierarchy output uses two spaces to show the hierarchy stairs .
These two spaces would now edited by sed -e 's/ /|/g' like in the original .
Now every line not containing a staff would become a parent :
"3236 tty1 00:00:00 xinit"
"3288 tty1 00:00:00| jwm"
"3362 tty1 00:00:07|| pup_event_front"
"24494 tty1 00:00:00||| sleep"
"2581 tty1 00:00:00|| jwm <defunct>"

In the above output "jwm <defunct>" would have not been killed because the two staffs '|| ' would indicate two parents .
[/edit2]

[/UNDER CONSTRUCTION]

Post Reply