"ssh | dd" pipeline breaks for cron-initiated job (Solaris 2.6)

Sun Jul 16 19:41:02 PDT 2006

[Caveat: long message: about 16 kB -- dhw]

OK; this is a little off-the-wall, and I realize that Solaris 2.6 is not
exactly cutting-edge technology.  Trust me:  the SS5/170 on which it
runs is also not cutting-edge technology.  :-)  And this is at home,
rather than ${paying_job}; I'm sure we're all familiar with adages about
"shoemaker's children" and all that, so there's little to be gained by
rehashing that sort of thing (IMO).

In the process of chasing this problem down, I have discovered a thing
or two I found not-so-intuitive, and sharing them here is about the
only way I have to thank those with the patience to wade through
the below prose.  :-}

Summary:  local "dd" process in an "ssh ${host} ... | dd ... >/dev/rmt/0n"
pipeline:

* (rarely, but) sometimes works
* sometimes hangs (and may thus be killed)
* often locks up, using >99% CPU so effectively that it is unkillable
  and "reboot" doesn't; I need to resort to forcibly power-cycling the
  machine to stop it and allow normal work to take place

on my SPARCstation 5/170 running Solaris 2.6 -- but *only* when the
process is invoked via "cron."

Whenever I invoke it "by hand," it works flawlessly.

I'd appreciate some help fixing the problem.

Yes, I am aware that "cron" runs things with a rather restricted (and
often different) environment; also, different shells tend to behave
differently.  More on this below.

Further, this was never a problem until I migrated the OpenSSH running
on the SS5 from 2.5.1p2 to 4.3p2.  Yes, I've checked for reported
problems with pipelines hanging under OpneSSH-4.3p2, and I've tried
applying a couple of the patches I found for the PR that I found;
neither appears to have had the slightest effect on the observed
symptoms.

Historical background: The subject process is a Perl script (that
originally dates from late 1994) that runs in the wee small hours of the
morning to create and store backup images and log what happens.  Its
design (such as it is) involves 3 categories of machine:

* 1 "controller" -- this is the machine where the Perl script runs,
  kicked off by cron, and where logs are written.

* 1 "tape host" -- this is the machine where the tape drive is.  In the
  current scenario, the above are the same machine -- the SS5 running
  Solaris 2.6.

* >0 "client hosts" -- the machines that house the file systems to be
  backed up.  In the current scenario, the SS5 is one of these, as are a
  couple of x86-arch boxen running FreeBSD (4.11 & 6.1).

The original version of the script used "rsh" for accessing the "remote"
hosts; I didn't migrate to using OpenSSH until early 2002.  At that
time, the version of OpenSSH I installed on the SS5 was 2.5.1p2.  I
didn't know a huge amount about SSH at the time, and allowed nearly
everything to use default values -- so protocol 1 was the protocol used.
I did set things up to use PK ("public key") authentication (without
a passphrase), since the exercise was intended for unattended
operation.

It has always used a non-root user for its execution; the name of that
user in this environment is "backup."

The normal mode of operation is to create level 0 "{,ufs}dump" images
earlyt Monday morning; the other days of the week, it creates level
5 "{,ufs}dump" images (based on the level 0 images).

In operation, the script takes a list of hosts; for each host, in the
specified order, it:

* Queries the host (via "uname -s -r") as to what it's running.

* Based on the results of the above, determines:
  * A program to use to display the current capacity of each mounted
    file system, selecting only those file systems that are backed
    by disk partitions (vs. swap space, for example).
  * A program to use to create a backup image at the desired level,
    as well as any flags to use when invoking it.  (FreeBSD 5.x and
    above, for UFS2 file systems, support "snapshots," and the -L
    flag for dump in such systems causes a snapshot to be created and
    the dump to be run from the snapshot.  This is useful for file
    systems that are currently mounted as writable.)
  * For full backups, determines some commands to run (e.g. "sysinfo
    -level all") to run to document the hardware configuration, to
    the extent that that is feasible.

  Note that the script is generally written to use complete pathnames for
  the above commands, as well as most others (possibly all; I haven't
  checked thoroughly recently).

* Queries the host to find the currently-mounted, disk-partition-backed
  file systems.

* For each such file system found:
  * Create a backup image (using SSH to access the host unless the
    current host happens to be the "controller" host), directing the
    resulting stream of bytes to standard output (thus, incidentally,
    avoiding a lot of problems with {,ufs}dump guessing the capacity of
    the tape).
  * pipe the standard output of the above to a "dd" process on the tape
    host, the output of the dd process being directed to the (no rewind
    on close) tape device), specifying a blocksize of 32 kB.  (More on
    this below).

There are the usual reality checks on the results, with provisions for
re-doing things, possibly after switching tapes.

One of the "client hosts" was the box to which SSH connection requests
from the Internet were being forwarded; since (at the time) I was still
allowing password authentication, I restricted the allowed sources of
SSH connection requests to "believed to be friendly" IP ranges on the
packet filter.

A few months ago, my wife drove up into Oregon & Idaho; I used this as a
catalyst to suggest to her that she might be able to check her email
from where she was if she were to migrate from using password
authentication to using PK authentication.  (Each of us reads mail by
logging in to one of the boxen either directly or via SSH.)  After
some discussion, she agreed to do that.

Once that appeared to work, I set about:

* disallowiong password authentication (finally!), permitting only PK
  authentication;

* removing the packet filter restriction on SSH requests;

* upgrading the OpenSSH-2.5.1p2 on the SS5 to 4.3p2;

* ensuring that the OpenSSH configuration on each machine used SSH
  protocol 2 (as a server, and preferred protocol 2 as a client).

My (interactive) tests all appeared to work as expected; I was
reasonably pleased with the outcome at this point.

The following morning, I found that the backup of the first file system
on the first-specified remote host had failed with a

	write: Invalid argument

and

	select: Bad file number

Huh??!?  This script, has -- modulo a few minor tweaks here & there --
been running Just Fine for over a flaming DECADE!  What could cause such
breakage?

I spent a couple of weeks "banging my head against" that wall, trying
all kinds of evasive maneuvers -- to no avail (of course).

I couldn't even determine with any semblence of assurance which program
was issuing those messages, or on which machine.

I finally(!) re-re-reviewed the Solaris 2.6 man page for "dd," and found
something rather counter-intuitive:

     When dd reads  from  a  pipe,  using  the  ibs=X  and  obs=Y
     operands,  the  output  will  always be blocked in chunks of
     size Y.  When bs=Z is used, the output blocks will be  what-
     ever was available to be read from the pipe at the time.

Now, the reason I noticed this is that up to the change I made after
reading that passage, the "dd" invocation had used "bs=32k" with the
intent to specify that both input and output block sizes were to be 32
kB.

And I now realized that the whines about "write: Invalid argument" were
from dd on the SS5, complaining (bitterly) when told to write to the
tape device with an oddball number of bytes.

Accordingly, I changed the "dd" invocation to specify "ibs=32k obs=32k,"
and ensured that "{,ufs}dump invocations wrote 32 kB "records."

As a result, those errors stopped occurring.

The problems that then surfaced, however, gave me reason to wax
nostalgic for the above-described problem.  Here's a cut/paste of an RCS
log entry for the next salient change I made to the script; I think it
expresses things fairly well:

----------------------------
revision 1.29
date: 2006/06/02 11:40:37;  author: david;  state: Exp;  lines: +1 -1
Change the PATH to be exactly the same directories, in exactly the same
order, as we get via "sudo - backup; exec tcsh; echo $path" -- because
once I do that, runing the script from the command line always works,
while the script (as run from cron) always fails on the first or second
"remote" file system.  Usually, the mode of failure involves a "dd"
process on the tape host consuming CPU like there's no tomorrow, and
being unkillable -- to the point that while I can issue commands (VERY
slowly), a "reboot" has no effect after waiting 15 minutes.

I have been resorting to physically powering off the machine, then
relying on the fsck to clean things up -- and there's no polite way
to express just how undesirable that is.

I think I'm getting desperate.
----------------------------

That didn't help.  Neither did tweaking the input block size for dd.

Nor did "conv=sync."

I recently augmented the "debugging" flag's output to include a dump
of the %ENV hash; here's what it reports when the script is invoked
via cron:

Environment:
HOME=/usr/local/etc/backup
LOGNAME=backup
PATH= /usr/local/sbin:/usr/local/bin:/usr/ucb:/usr/sbin:/usr/bin:/sbin:/bin:/usr
/local/etc/backup
SHELL=/usr/bin/sh
TZ=US/Pacific

Here's what it reports when I invoke the script interactively from the
Bourne shell (which is the login shell for the user in question), via
"sudo su - backup":

Environment:
HOME=/usr/local/etc/backup
LOGNAME=backup
PATH= /usr/local/sbin:/usr/local/bin:/usr/ucb:/usr/sbin:/usr/bin:/sbin:/bin:/usr
/local/etc/backup
SHELL=/bin/sh
TERM=xterm
USER=backup

Finally, here's what it reports when I invoke the script interactively
from tcsh, which is the way I normally invoke it for a restart (via
"sudo su - backup" followed by "exec tcsh"):

Environment:
GROUP=nogroup
HOME=/usr/local/etc/backup
HOST=pogo
HOSTTYPE=sun4
LOGNAME=backup
MACHTYPE=sparc
OSTYPE=solaris
PATH= /usr/local/sbin:/usr/local/bin:/usr/ucb:/usr/sbin:/usr/bin:/sbin:/bin:/usr
/local/etc/backup
PWD=/usr/local/etc/backup
REMOTEHOST=bunrab.catwhisker.org
SHELL=/bin/sh
SHLVL=1
TERM=screen
USER=backup
VENDOR=sun

At this stage, I have been unable to re-create the failure symptoms if I
invoke the script interactively.  I always get a failure -- though not
always on the first remote file system; sometimes it's the 2nd -- when
the script is invoked via cron.  There does not appear to be an issue of
interactive use allowing reporting & recovery from an error condition;
it Just Works.

Here's a sample script output from a failing attempt:

===========================================================

 #####   #    #  #    #  #####     ##    #####
 #    #  #    #  ##   #  #    #   #  #   #    #
 #####   #    #  # #  #  #    #  #    #  #####
 #    #  #    #  #  # #  #####   ######  #    #
 #    #  #    #  #   ##  #   #   #    #  #    #
 #####    ####   #    #  #    #  #    #  #####

bunrab runs "FreeBSD 4.11-STABLE"
Using "/bin/df -kt ufs" for filesystem query on bunrab
Using "/sbin/dump -" to perform the dump(s) for bunrab
Backups for bunrab on Sun Jul 16 03:08:21 PDT 2006

Filesystem  1K-blocks    Used    Avail Capacity  Mounted on
/dev/ad0s2a    161262   43006   105356    29%    /
/dev/ad0s2e   1903014  747254  1003520    43%    /usr
/dev/ad0s3e   2064302  292154  1607004    15%    /var
/dev/ad0s3f  25286020 2633974 20629166    11%    /common

Dump level for bunrab: 5

/dev/ad0s2a    161262   43006   105356    29%    / ...
 dev=/dev/ad0s2a; kb=161262; used=43006; avail=105356; cap=29%; mount=/; fudged=
51607.2

Starting dump of /dev/ad0s2a at Sun Jul 16 03:08:33 PDT 2006
Vendor 'ARCHIVE ' Product '4586XX 28887-XX' tape drive:
   sense key(0x0)= No Additional Sense   residual= 0   retries= 0
   file no= 95   block no= 0

Dumping /dev/ad0s2a on tape 1, file 95 (bunrab:/)
Trying to issue "ssh bunrab /bin/df -kt ufs /"
Filesystem  1K-blocks  Used  Avail Capacity  Mounted on
/dev/ad0s2a    161262 43006 105356    29%    /
Issuing command "ssh bunrab /sbin/dump -5uf - -b32 /dev/ad0s2a | /bin/dd of=/dev
/rmt/0n conv=sync ibs=32k obs=32k"
  DUMP: Date of this level 5 dump: Sun Jul 16 03:08:46 2006
  DUMP: Date of last level 0 dump: Mon Jul 10 04:31:12 2006
  DUMP: Dumping /dev/ad0s2a (/) to standard output
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 162 tape blocks.
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
  DUMP: DUMP: 190 tape blocks
  DUMP: finished in less than a second
  DUMP: level 5 dump on Sun Jul 16 03:08:46 2006
  DUMP: DUMP IS DONE
0+18 records in
18+0 records out

[And at this point, the "dd" process on the control/tape host hung, but
was killable.]

I then interactively re-started the backup:

Backup script restarted as
        /usr/local/etc/backup/script -h /usr/local/etc/backup/hostslist -vxjl /v
ar/log/dumplog.incremental -r bunrab:/

on Sun Jul 16 04:27:54 PDT 2006
Capacity=7500000; density=54000; length=36000; add_fudge=0; mult_fudge=1.2
Current tape position is now 96
Saved file number is 95; current tape position is 96
Issuing /bin/mt -f /dev/rmt/0n bsf 2
Current tape position is now 95
^L===========================================================

 #####   #    #  #    #  #####     ##    #####
 #    #  #    #  ##   #  #    #   #  #   #    #
 #####   #    #  # #  #  #    #  #    #  #####
 #    #  #    #  #  # #  #####   ######  #    #
 #    #  #    #  #   ##  #   #   #    #  #    #
 #####    ####   #    #  #    #  #    #  #####

bunrab runs "FreeBSD 4.11-STABLE"
Using "/bin/df -kt ufs" for filesystem query on bunrab
Using "/sbin/dump -" to perform the dump(s) for bunrab
Dump level for bunrab: 5

/dev/ad0s2a    161262   43006   105356    29%    / ...
 dev=/dev/ad0s2a; kb=161262; used=43006; avail=105356; cap=29%; mount=/; fudged=
51607.2

Starting dump of /dev/ad0s2a at Sun Jul 16 04:28:27 PDT 2006
Vendor 'ARCHIVE ' Product '4586XX 28887-XX' tape drive:
   sense key(0x0)= No Additional Sense   residual= 0   retries= 0
   file no= 95   block no= 0

Dumping /dev/ad0s2a on tape 1, file 95 (bunrab:/)
Trying to issue "ssh bunrab /bin/df -kt ufs /"
Filesystem  1K-blocks  Used  Avail Capacity  Mounted on
/dev/ad0s2a    161262 43006 105356    29%    /
Issuing command "ssh bunrab /sbin/dump -5uf - -b32 /dev/ad0s2a | /bin/dd of=/dev
/rmt/0n conv=sync ibs=32k obs=32k"
  DUMP: Date of this level 5 dump: Sun Jul 16 04:28:37 2006
  DUMP: Date of last level 0 dump: Mon Jul 10 04:31:12 2006
  DUMP: Dumping /dev/ad0s2a (/) to standard output
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 162 tape blocks.
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
  DUMP: DUMP: 190 tape blocks
  DUMP: finished in 1 seconds, throughput 190 KBytes/sec
  DUMP: level 5 dump on Sun Jul 16 04:28:37 2006
  DUMP: DUMP IS DONE
0+17 records in
17+0 records out
Finished dump of bunrab:/dev/ad0s2a at Sun Jul 16 04:28:41 PDT 2006 (bunrab:/) [
tape 1, file 95]
Current tape position is now 96
Vendor 'ARCHIVE ' Product '4586XX 28887-XX' tape drive:
   sense key(0x0)= No Additional Sense   residual= 0   retries= 0
   file no= 96   block no= 0
...

[and so the re-started backup completed normally....]

Insights for diagnosing -- or even better, resolving -- this problem
would be most welcome.  Requests for additional information should
probably be directed to me, rather than the list; I've set Reply-To
accordingly, and promise to follow up with anything useful or
interesting that's relevant.

Thanks!

Peace,
david
-- 
David H. Wolfskill				david at catwhisker.org
Doing business with spammers only encourages them.  Please boycott spammers.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <http://www.baylisa.org/pipermail/baylisa/attachments/20060716/2b5df3a0/attachment.bin>