@shortridge While working tech support, I got a call on a Monday. Some VPNs which had been working on Friday were no longer working. After a little digging, we found the negotiation was failing due to a certificate validation failure.
The certificate validation was failing because the system couldn’t check the certificate revocation list (CRL).
The system couldn’t check the CRL because it was too big. The software doing the validation only allocated 512kB to store the CRL, and it was bigger than that. This is from a private certificate authority, though, and 512kB is a *LOT* of revoked certificates. Shouldn’t be possible for this environment to hit within a human lifespan.
Turns out the CRL was nearly a megabyte! What gives? We check the certificate authority, and it’s revoking and reissuing every single certificate it has signed once per second.
The revocations say all the certificates (including the certificate authority’s) are expired. We check the expiration date of the certificate authority, and it’s set to some time in 1910. What? It was around here I started to suspect what had happened.
The certificate authority isn’t valid before some time in 2037. It was waking up every second, seeing the current date was after the expiration date and reissuing everything. But time is linear, so it doesn’t make sense to reissue an expired certificate with an earlier not-valid-before date, so it reissued all the certs with the same dates and went to sleep. One second later, it woke up and did the whole process over again. But why the clearly invalid dates on the CA?
The CA operation log was packed with revocations and reissues, but I eventually found the reissues which changed the validity dates of the CA’s certificate. Sure enough, it reissued itself in 2037 and the expiration date was set to 2037 plus ten years, which fell victim to the 2038 limitation. But it’s not 2037, so why did the system think it was?
The OS running the CA was set to sync with NTP every 120 seconds, and it used a really bad NTP client which blindly set the time to whatever the NTP server gave it. No sanity checking, no drifting. Just get the time, set the time. OS logs showed most of the time, the clock adjustment was a fraction of a second. Then some time on Saturday, there was an adjustment of tens of thousands of seconds forward. The next adjustment was hundreds of thousands of seconds forward. Tens of millions of seconds forward. Eventually it hit billions of seconds backwards, taking the system clock back to 1904 or so. The NTP server was racing forward through the 32-bit timestamp space.
At some point, the NTP server handed out a date in 2037 which was after the CA’s expiration. It reissued itself as I described above, and a date math bug resulted in a cert which expired before it was valid. So now we have an explanation for the CRL being so huge. On to the NTP server!
Turns out they had an NTP “appliance” with a radio clock (i.e, a CDMA radio, GPS receiver, etc.). Whoever built it had done so in a really questionable way. It seems it had a faulty internal clock which was very fast. If it lost upstream time for a while, then reacquired it after the internal clock had accumulated a whole extra second, the server didn’t let itself step backwards or extend the duration of a second. The math it used to correct its internal clock somehow resulted in dramatically shortening the duration of a second until it wrapped in 2038 and eventually ended up at the correct time.
Ultimately found three issues:
• An OS with an overly-simplistic NTP client
• A certificate authority with a bad date math system
• An NTP server with design issues and bad hardware
Edit: The popularity of this story has me thinking about it some more.
The 2038 problem happens because when the first bit of a 32-bit value is 1 and you use it as a signed integer, it’s interpreted as a negative number in 2’s complement representation. But C has no protection from treating the same value as signed in some contexts and unsigned in others. If you start with a signed 32-bit integer with the value -1, it is represented in memory as 0xFFFFFFFF. If you then use it as an unsigned integer, it becomes the value 4,294,967,296.
I bet the NTP box subtracted the internal clock’s seconds from the radio clock’s seconds as signed integers (getting -1 seconds), then treated it as an unsigned integer when figuring out how to adjust the tick rate. It suddenly thought the clock was four billion seconds behind, so it really has to sprint forward to catch up!
In my experience, the most baffling behavior is almost always caused by very small mistakes. This small mistake would explain the behavior.
Voices of Open Source: The European regulators listened to the Open Source communities! https://blog.opensource.org/the-european-regulators-listened-to-the-open-source-communities/
Open Source Entwickler doch nicht für Sicherheitslücken (etc) in ihrer Software haftbar wie kommerzielle Entwickler. Das war auch *extrem* weltfremd.
Im Fall „Modern Solution“-Hacking (haha) hat der verurteilte „Hacker“ (haha) Berufung eingelegt.
25 years ago, my mentor at uni showed me how to interrupt autoconfig runs at just the right time so the generated scripts that yielded wrong results wouldn't be deleted and we could check and fix them.
Today, a friend looks for just the right time to intercept Ansible Tower execution environments so he can debug the podman containers that yield wrong results and fail a deployment.
25 years of "progress" and we still run into the same terrible stuff.