This article was originally published in Visionen Issue 5/2019, the official magazine of VIS.
A little bug hunt
Some weeks ago, just about the time of the launch of the great new VIS website the CAT approached me to help with a very weird bug. I wasn’t actually involved in development before, so things had to be grim. This is the report of a bug hunt which became a long journey from neat “high level programming languages” down to low level GNU/Linux. But let’s start at the beginning…
The message I received from CAT was roughly “If we start the website with docker, cinit, gunicorn, a custom worker and gevent, it hangs during startup. Running without cinit works. Running without docker works as well. And running without gevent was how we developped the site in the beginning, so this works as well but is slow”. So this was quite a lot of unknown words. If you don’t know docker 1, think of it as a small hard drive (“image”) where you install just a single program on, in our case the website. When running the drive it behaves like a VM (but without an own kernel) and again only has running, what the program needs to run (“container”). Then there is cinit 2. This is a small UNIX init system the CIT wrote because both VIS and VSETH decided to run entire apps inside one docker container instead of only a single process. cinit takes care of starting up the processes. gunicorn 3 is a Python WSGI HTTP server which glues together python programs like the website’s code and the HTTP protocol. The CAT went one step further and used a custom implementation of a so-called worker which is the child process responsible for the actual HTTP request handling. The last of these names was gevent 4 which was completely new to me and is a python library for asynchronous programming. So the battlefield was layed out nicely.
I started by staring at the problem. If the website was run inside docker with all the named components active I observed a normal startup until the worker processes of gunicorn spawn from their parent and do - nothing. After a timeout the parent notices that its children are stalling, prints an error message like “TIMEOUT” and kills them (which is best practice in education I guess?). However it wouldn’t terminate itself but in turn stalled execution of the container idling around which I thought was just a symptom of this strange state. So I was able to reproduce the problem, which is good. Running the gunicorn inside docker without cinit starting gunicorn for me, the workers didn’t stall and everything went fine. So was cinit to blame? The CAT also ran the website with cinit outside of docker and this also worked fine. So the question was who made the workers stall? We needed more information.
As with all CAT projects having strange problems I first was suspicious of their code (please don’t tell them). So I got rid of the custom worker they implemented and remained with the gevent worker which they inherited (in an OOP sense) from. But removing the custom code didn’t fix the problem. So this eliminated quite a large source of errors. But where to continue?
Since I really wanted to know what the worker was doing while waiting, I started
instrumenting the gunicorn/gevent code by hand. Armed with vim and grep inside
the container (I was desperate, ok?) I inserted
fork() and just write code like
stdlib.fork = gevent_custom_implementation. So we focussed on gevent monkey patching and the
CAT Presi David was motivated enough to try all the flags offered by
patch_all() 6 to find out that only the flag
subprocess=True made our bug
occur. If we disabled patching
subprocess it worked. OK so was gevent to
blame? We even found an issue 7 pointing that direction, saying gevent doesn’t
In case you didn’t make it to the great operating systems lecture (which I heard
is optional by now): What is this SIGCHLD? In UNIX-like operating systems a
process creates a new process simply by cloning itself, a task known as
forking. This creates a so-called child process which is still tied to the
original parent process by the operating system. If the child is done with its
work it terminates itself and the parent is notified using the signal SIGCHLD.
This indicates that the parent is now supposed to
wait() on his child to yield
its result. If the parent (for some reason, say gevent) doesn’t receive SIGCHLD
it might very well be that it enters a state of grief and refuses to continue.
So this is where I left the world of shiny python code and entered the gory
details of UNIX programming. To determine whether SIGCHLD is involved in this, I
strace which can run other programs and tracks all the syscalls (anything
the process wants the operating system to do for it, like read/write files and
fork. You should really attend the OS lecture) and signals that happen while
that program runs. So I ran the website twice in the container, once with strace
on the gunicorn process and once on the cinit process (which starts gunicorn for
me). This gave me two ~6MB text files of system interaction craving to be
inspected. Armed with
less I started looking for SIGCHLD signals and
immediately found some in the working setting, where each worker forks itself to
ldconfig -p (you can see this, because
execve calls as
well), and receives SIGCHLD once
ldconfig is done. Tracing the syscalls with
cinit gave me - no SIGCHLDs at all. So haven’t they been sent? Or is the
operating system to blame in the end (VERY unlikely)?
A quick google told me that SIGCHLD can be masked by a process, meaning that the
process tells the OS “don’t bother me with that one”. In that case no SIGCHLD is
sent. There is also a different way to deal with signals, which is a signal
file descriptor. Instead of the operating system interrupting the process to
handle the signal immediately, the process can open a special file and read
signals from there. This has the advantage that the process can decide by itself
when it wants to deal with received signals. Then the process masks the signals
because it has other means to receive them. But since
signalfd is a syscall
and I didn’t find it in the
strace log, this was not the case for gunicorn. So
SIGCHLD should definitely be unmasked for gunicorn to work. But was it actually
masked? As a nice man page told me, every process’s signal mask can be viewed
/proc/<PID>/status which is a special file showing the state of a process
from the perspective of the OS. For the gunicorn workers it contained a line
SigIgn: 0000000000014006 in the “Signals Ignored” field representing a
hexadecimal number. This is a bitmask, where each bit is one signal. An expert
might realise that bit 17 which represents SIGCHLD is set to 1. I used a
calculator. So workers had this signal blocked, which is sad.
It turned out even the gunicorn parent had this signal masked. This was very
suspicious, because it definitely needs this signal to coordinate its workers.
Moreover it is not involved in monkey patching at all, this only happens in the
workers. This turned my focus to cinit, which as already mentioned is also a
self-written application by CIT members (including me). And it does precisely
what I described above to handle its signals: It uses a signal file descriptor
and masks all the signals present in the bitmask. So looking at the manpage
man 2 sigprocmask which obviously is the function used to mask signals I found the
magic sentence: “A child created via fork(2) inherits a copy of its parent’s
signal mask; the signal mask is preserved across execve(2).”. So was cinit the
culprit? I patched cinit to reset the signal mask before calling
it’s own forked children 8 and the website was running again. hurray. Then
all left to do was to write a test for cinit covering that case 8 and finally
publish a fixed version of cinit.
Maybe you enjoyed this little bug review, maybe you even learned something new (as I definitely did). And maybe you are so excited about all that stuff by now that you would like to participate and also become a member of CIT or CAT and help us do a good job for the entire VIS.