A Little Bug Hunt

2019-11-01 on this blog

This article was originally published in Visionen Issue 5/2019, the official magazine of VIS.

A little bug hunt

Some weeks ago, just about the time of the launch of the great new VIS website the CAT approached me to help with a very weird bug. I wasn't actually involved in development before, so things had to be grim. This is the report of a bug hunt which became a long journey from neat "high level programming languages" down to low level GNU/Linux. But let's start at the beginning...

The message I received from CAT was roughly "If we start the website with docker, cinit, gunicorn, a custom worker and gevent, it hangs during startup. Running without cinit works. Running without docker works as well. And running without gevent was how we developped the site in the beginning, so this works as well but is slow". So this was quite a lot of unknown words. If you don't know docker ¹, think of it as a small hard drive ("image") where you install just a single program on, in our case the website. When running the drive it behaves like a VM (but without an own kernel) and again only has running, what the program needs to run ("container"). Then there is cinit ². This is a small UNIX init system the CIT wrote because both VIS and VSETH decided to run entire apps inside one docker container instead of only a single process. cinit takes care of starting up the processes. gunicorn ³ is a Python WSGI HTTP server which glues together python programs like the website's code and the HTTP protocol. The CAT went one step further and used a custom implementation of a so-called worker which is the child process responsible for the actual HTTP request handling. The last of these names was gevent ⁴ which was completely new to me and is a python library for asynchronous programming. So the battlefield was layed out nicely.

I started by staring at the problem. If the website was run inside docker with all the named components active I observed a normal startup until the worker processes of gunicorn spawn from their parent and do - nothing. After a timeout the parent notices that its children are stalling, prints an error message like "TIMEOUT" and kills them (which is best practice in education I guess?). However it wouldn't terminate itself but in turn stalled execution of the container idling around which I thought was just a symptom of this strange state. So I was able to reproduce the problem, which is good. Running the gunicorn inside docker without cinit starting gunicorn for me, the workers didn't stall and everything went fine. So was cinit to blame? The CAT also ran the website with cinit outside of docker and this also worked fine. So the question was who made the workers stall? We needed more information.

As with all CAT projects having strange problems I first was suspicious of their code (please don't tell them). So I got rid of the custom worker they implemented and remained with the gevent worker which they inherited (in an OOP sense) from. But removing the custom code didn't fix the problem. So this eliminated quite a large source of errors. But where to continue?

Since I really wanted to know what the worker was doing while waiting, I started instrumenting the gunicorn/gevent code by hand. Armed with vim and grep inside the container (I was desperate, ok?) I inserted print statements everywhere (luckily in Python I don't have to recompile) and looked at the last log message the worker sent before stalling and inserted more print statements after each subsequent instruction. Some appeared when running, some didn't. In parallel I discovered the workers to print more log messages after they died. This was not possible but smelled like a buffering issue. And indeed setting a magic environment variable PYTHONUNBUFFERED ⁵ gave a more chronological view of the log. While instrumenting I learned quite a lot about the internals of gunicorn and gevent. Most notably gevent performs heavy monkey patching ⁶ on the python standard library (stdlib). This roughly means taking a well-known function, say fork() and just write code like stdlib.fork = gevent_custom_implementation. So we focussed on gevent monkey patching and the CAT Presi David was motivated enough to try all the flags offered by patch_all() ⁶ to find out that only the flag subprocess=True made our bug occur. If we disabled patching subprocess it worked. OK so was gevent to blame? We even found an issue ⁷ pointing that direction, saying gevent doesn't respect SIGCHLD.

Excursion SIGCHLD

In case you didn't make it to the great operating systems lecture (which I heard is optional by now): What is this SIGCHLD? In UNIX-like operating systems a process creates a new process simply by cloning itself, a task known as forking. This creates a so-called child process which is still tied to the original parent process by the operating system. If the child is done with its work it terminates itself and the parent is notified using the signal SIGCHLD. This indicates that the parent is now supposed to wait() on his child to yield its result. If the parent (for some reason, say gevent) doesn't receive SIGCHLD it might very well be that it enters a state of grief and refuses to continue.

So this is where I left the world of shiny python code and entered the gory details of UNIX programming. To determine whether SIGCHLD is involved in this, I used strace which can run other programs and tracks all the syscalls (anything the process wants the operating system to do for it, like read/write files and fork. You should really attend the OS lecture) and signals that happen while that program runs. So I ran the website twice in the container, once with strace on the gunicorn process and once on the cinit process (which starts gunicorn for me). This gave me two ~6MB text files of system interaction craving to be inspected. Armed with less I started looking for SIGCHLD signals and immediately found some in the working setting, where each worker forks itself to run ldconfig -p (you can see this, because strace logs execve calls as well), and receives SIGCHLD once ldconfig is done. Tracing the syscalls with cinit gave me - no SIGCHLDs at all. So haven't they been sent? Or is the operating system to blame in the end (VERY unlikely)?

A quick google told me that SIGCHLD can be masked by a process, meaning that the process tells the OS "don't bother me with that one". In that case no SIGCHLD is sent. There is also a different way to deal with signals, which is a signal file descriptor. Instead of the operating system interrupting the process to handle the signal immediately, the process can open a special file and read signals from there. This has the advantage that the process can decide by itself when it wants to deal with received signals. Then the process masks the signals because it has other means to receive them. But since signalfd is a syscall and I didn't find it in the strace log, this was not the case for gunicorn. So SIGCHLD should definitely be unmasked for gunicorn to work. But was it actually masked? As a nice man page told me, every process's signal mask can be viewed in /proc/<PID>/status which is a special file showing the state of a process from the perspective of the OS. For the gunicorn workers it contained a line SigIgn: 0000000000014006 in the "Signals Ignored" field representing a hexadecimal number. This is a bitmask, where each bit is one signal. An expert might realise that bit 17 which represents SIGCHLD is set to 1. I used a calculator. So workers had this signal blocked, which is sad.

It turned out even the gunicorn parent had this signal masked. This was very suspicious, because it definitely needs this signal to coordinate its workers. Moreover it is not involved in monkey patching at all, this only happens in the workers. This turned my focus to cinit, which as already mentioned is also a self-written application by CIT members (including me). And it does precisely what I described above to handle its signals: It uses a signal file descriptor and masks all the signals present in the bitmask. So looking at the manpage man 2 sigprocmask which obviously is the function used to mask signals I found the magic sentence: "A child created via fork(2) inherits a copy of its parent's signal mask; the signal mask is preserved across execve(2).". So was cinit the culprit? I patched cinit to reset the signal mask before calling execvpe in it's own forked children ⁸ and the website was running again. hurray. Then all left to do was to write a test for cinit covering that case ⁸ and finally publish a fixed version of cinit.

Maybe you enjoyed this little bug review, maybe you even learned something new (as I definitely did). And maybe you are so excited about all that stuff by now that you would like to participate and also become a member of CIT or CAT and help us do a good job for the entire VIS.

https://www.docker.com/

https://j.njsm.de/git/veenj/cinit

https://gunicorn.org/

⁴

http://www.gevent.org/

⁵

https://docs.python.org/3/using/cmdline.html#cmdoption-u

⁶

http://www.gevent.org/api/gevent.monkey.html

⁷

https://github.com/gevent/gevent/issues/622#issuecomment-135898056

⁸

https://j.njsm.de/git/veenj/cinit/issues/38 then follow the commit link