RE: https://haunted.computer/@binarygolf/115878068970692998
Last day of BGGP6 today!
The British are to blame for this aren’t they
A glimpse into what a kernel engineer debugs for enterprise customers.
A bank is running a "security" solution that installs kprobes to intercept, among other things, calls to do_execveat_common(), and monitors all the arguments that could have been passed to execveat(). As do_execveat_common() can be triggered not only by userspace, but also by call_usermodehelper_exec(), a kprobe crafted with poor assumptions may result in an erroneous double dereference of what it thinks points to argv**, causing a General Protection Fault.
The kernel is not dumb however. If a GPF is triggered by a kprobe, it is handled gracefully, and nothing happens, and kprobe just returns a safe value. For a GPF to be triggered however, the CPU has to really try to read the wrong memory address first. The address is pretty random each time, meaning it can point to memory regions that are not mapped by kernel, but have some special meaning for a platform.
Enter the platform. It is configured by the hardware vendor in such a way that if an unaligned access to an MMIO region happens, an MCE is generated. And it is not some MCE for a correctable error, but an MCE indicating process context corruption, in other words, it's fatal. So, once it happens, the system dies with a kernel panic.
And this is exactly what the customer experienced. A socket() syscall caused modprobe to be invoked via call_usermodehelper_exec() → do_execveat_common() chain to load the ipv6 module. This triggered a kprobe that dereferenced wrong memory pointer twice provoking a GPF. The kernel began to gracefully handle the GPF, but the platform saw that the second dereference resulted in accessing the MMIO region, and this was an unaligned access, hence the platform threw MCE. And the system died.
It was fun to investigate this and to explain to the customer that three legitimate things in their system being hit together can trigger a crash.
And of course we joked we should have moved the whole case to the networking team, because it's always IPv6.
I recently had to deploy a change to #Debian Code Search to limit the amount of memory used during indexing a single package — because of #Firefox, which now ships as 388_859 files, totaling 1.78 GB! The resulting search index is 2.76 GB. Doing this entire indexing in one go is just too much for typical servers.
So now we flush into intermediate index files and merge them in the end: https://github.com/Debian/dcs/commit/8e76d5b9408cd12cfb6b728c1f1f3a96a9775310
The resulting drop in max heap usage is nicely visible on the graph by now :)
But then I discovered https://github.com/builtbybel/Winslop. A tiny tool helped a lot in disabling all of that crap. Nice and simple and it gave me a feeling of control. Thank you Belim! :)
(Yes, this is a lesson in empathy with the user and pretty much accidentally a metaphor that somehow connects to my employer and their main product. No, I will not take any further questions.
P.S. I would prefer arkenfox as a config system over any kind fork)
2/2