Adventures in Hell
Last modified: Jan 29, 2008
This page documents (in short form) some cases where software
misbehavior caused massive timeless, especially cases where there is no
obvious way to fix the underlying problem, and where someone less
persistent than I would have simply given up. I'm documenting
them
here in the hope that you got here because of a similar problem, and
this will help you.
Adventure 2: (Unix) Diagnosing System Misbehavior
Symptom: A Unix virtual
private server is occasionally killing random processes and issuing
"memory alerts" in the control panel.
This is more serious that it sounds - completely randomly selected
processes which needs to allocate a little memory is - once in a while
- being killed instead. The system is otherwise running fine, but
the overall effect is something like standing inside a building, and
periodically knocking another hole in a wall. Eventually,
something really bad will happen. This story illustrates
how difficult and dangerous the Unix environment can be.
First, to start at the end, I finally
found the
problem by reading through all the obscure log files I could find
scattered around on the server. I could not have done this
without root access. What I found was a "failed, ran out of
memory" note in the logs for qmail. This message was associated
with a mail filter I had written and debugged a month before.
The intended output of this filter is to note bouncing email addresses
in a database. This filter is run as user 'vpopmail' as
part of the mail delivery process, but it's an ordinary user script,
requiring no particular privileges to run. Most Unix system
permit such scripts to be authored by ordinary users. As part of
its "good citizen" contract, this script logs what it does, and
any problems it encounters, in a text file.
In the process of developing and debugging this script, I had
discovered that it needed write access to the log file for user
'vpopmail', which I granted on an ad-hoc basis by changing the write
permissions on the file. I knew at the time that this wasn't a
permanent solution, but I forgot that before I wrapped the project.
Some days later, my log file rotator created a new log file, without
the ad-hoc write permission. This caused the mail filter to
fail because it couldn't write the log file.
-- but wait -- being a "good
citizen", the program, it tried to log the fact that it couldn't open
the log file. You can guess what happened next.
... so the NEXT thing that happened is that the mail filter went into a
recursive dive, used up all the virtual memory available to it, and
eventually was killed by the system. There was no entry in
my log file though, and because I
wasn't actively expecting a flood of bounced email messages, I didn't
notice the
absence of a few
new bounced email notations in my database. What I did
notice was that once ins a while, some-process-or-other had
mysteriously died and needed to be started... I also
noticed that this mysterious unreliability was a new problem, but it
took a long time, chasing red herrings and pointing fingers at innocent
parties, before I found the smoking gun that allowed it to be fixed.
Chapter 2: (yes,
there's more!)
The other end of this same procedure is sending email. I have a
script that sends these emails, and I've always used it sparingly: I
don't want to be considered to be spamming my players.
Since I'm such a good citizen, and have been running a web site for a
number of years, I've accumulated quite a few email address which do
NOT bounce. On the most recent occasion when I sent a mass
email to all (~3000 now) addresses, instead of the usual trickle of new
bounce messages, I received a flood, and my server's email send/receive
functions seemed to be severly wedged, which I eventually fixed by
rebooting.
Another protracted investigation ensued, which eventually led to the
observation that when I sent my 3000 messages (one at a time, but as
fast as possible), Unix apparently did it's best to fire up 3000
concurrent "send" processes, which resulted in all sorts of resource
exhaustion possibilities, in this case the thing that hit the wall
first was the number of open files. Who knew there even was a
limit?
.. and, after more investigation, I found that Qmail is configurable
wiht a "concurrency limit" which can be set to a suitable value, which
was actually set to 255.
At least there was already a configuration varaible I could tweak.
Adventure 1: (Windows) Network Activity Graphs
Symptom: DuMeter stopped
working. DuMeter is a windows application that graphs network activity,
which I find entertaining, and occasionally really useful. One
day
I noticed it was reporting no activity, when I knew for sure there was
a
lot. DuMeter wasn't giving any clues why it didn't see
any network traffic, and there was no way to tell when it had stopped
working . I tried reinstalling DuMeter and installing several other
network metering programs, with results as follows:
- DuMeter is just dead
- NetMeter is just dead.
- Network Activity Diagram
reports "Nothing to Monitor"
- Netmeter (b) works.
Netmeter (b) is a demo application for WinpCap,
a non-microsoft unix compatible metering package. I had to
compile this application myself.
Some email help from the NAD developers led me to
iphlpapi.dll, microsoft's
recommended network helper API. I wrote a test program and found
that the
GetIfTable function
was not reporting the existence of my network adapter. So I
concluded that NAD for sure, and probably the other non functioning
programs, were being misled by this. Some poking around in the
windows registry revealed that my network card had migrated into an
unusual position, as adapter number 4, and there were no adapters
number
2 and 3. This odd state of affairs probably arose as a
result of adding and removing
PGPnet and
VMware which create virtual
adapters. I'm guessing that something of this kind is what is confusing
iphlpapi.dll. Theoretically this problem could be fixed by
some small rearrangement of the registry, but it's all completely
opaque
and undocumented, so I used the "big hammer" approach:. I removed
all my network adapter drivers (both real and virtual) and reinstalled
them. Problem fixed.
No thanks to Microsoft, which produced the buggy API through some
complex interaction of it's slopware manipulations of the windows
registry. Microsoft would make
me
pay
them for the privilege of
reporting this bug to them, and I've spent too much time already.
June 2003, Windows 2000