TIL Memory Fragmentation

2020/08/22

I was recently sent down a very nice rabbit hole of low-level operating system knowledge, which i would like to share here.

What the hell is happening?

We’re using Concourse as the main driver for pretty much all of our deployment automations in our team at my job. During some time frame in the end of 2019 and the beginning of 2020, we sometimes encountered rather strange errors in pipeline jobs, which then caused those jobs to fail. The pattern was always looking something like

iptables: vmalloc: allocation failure, allocated 0 of XXXX bytes

where the number at XXX was always different. But it was never something gigantic and the worker node where the job was running, definitely had enough memory available in total at this point in time, to serve that allocation request from iptables. And usually, the next run of that job went through just fine. So this was all very confusing and strange in the beginning.

Lets ask the internet

As it is common nowadays, we just put the error message from above into $search_engine and discovered, that we’re not the only ones with that problem. Somebody already opened and issue on the GitHub project of Concourse: https://github.com/concourse/concourse/issues/3127

As @circosta pointed out at https://github.com/concourse/concourse/issues/3127#issuecomment-484895481, this all seems to be caused by a bug in iptables: https://bugzilla.kernel.org/show_bug.cgi?id=200651 Basically, there was (at least for some time) a very restrictive flag set, when requesting memory via kmalloc (search for GFD_NORETRY on https://www.kernel.org/doc/htmldocs/kernel-api/API-kmalloc.html), which caused iptables to fail hard instantly. This was then fixed, as i understood it, by changing the flag to a bit less restrictive one.

OK, so memory was short and iptables decided to fail because it couldn’t get the amount it wanted. But wait… our monitoring did not show that the worker node was this much under memory pressure, that it couldn’t satisfy the request in question. Well, at this point in time, there was no good explanation for me. But we accepted that it’s basically a kernel bug, that was fixed in a later version. So we made sure our workers where running on a recent enough kernel version and where happy that those cryptic iptables errors disappeared. And i shoved all of this on some area in the backside of my brain.

…aaaand down the rabbit hole

Back then, i subscribed to the GitHub issue. Because i was majorly annoyed by the problem and i also have a weakness for such low-level issues :-) And recently i got an e-mail that somebody posted a new comment there: https://github.com/concourse/concourse/issues/3127#issuecomment-663520233

As @blairboy362 stated, this all seems to be a memory fragmentation problem. There was potentially enough total memory available on the host, but no big enough continuous chunk.

The mentioning of /proc/buddyinfo was the first step of this fall into the rabbit hole. Because i didn’t know it existed up until that moment. If you run cat on it, you’ll get a result that will look somehow like

$ cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      1      2      1      1      0      1      1      3
Node 0, zone    DMA32      7      8      7      7     13     10      9     10      3      8    419
Node 0, zone   Normal   1795  19012  11279   6581   5116   2335   1008    416    136     62   5294

…OK, what the fuck am i looking at? I did a quick search for /proc/buddyinfo in the interwebs, and stumbled upon this very nice explanation: https://www.uninformativ.de/blog/postings/2017-12-23/0/POSTING-en.html tl;dr: /proc/buddyinfo gives you information how fragmented your memory is. The Linux kernel allocates memory in chunks of 4KB, called pages. Each column shows how many units of pages of different sizes are available. The leftmost column being basic 4KB pages, the next column being 4^2KB pages….and so on. Meaning the last column shows the largest available continuous chunks of memory the kernel can allocate. Low or even none available blocks of memory on the right side are bad, indicating very fragmented RAM, which lead to failing larger allocation requests. Like with iptables in Concourse in the example above.

It’s pretty logical that memory can get fragmented, but i honestly never gave it a thought before. And i didn’t know i could check something in /proc for information about the state of fragmentation. Definitly makes sense to keep an eye on weird issues, because you most likely learn something new :-)