Conversation
Edited 18 hours ago

I feel like a lot of the "it was DNS" is conceptually oversimplifying, and masking true root cause(s).

It makes sense to centralize naming. It makes sense to abstract naming away from IP addresses. People choose the risk of DNS centralization for understandable reasons.

But many of these outages are "this other thing broke, and it made DNS break". Most folks would never say "it was DNS" when it was a network problem that was preventing reaching the DNS servers. But a lot of the time, this isn't much different from that.

Don't get me wrong, every outage is an opportunity to learn and improve, both locally and centrally. I just want to shift the conversation to "it was DNS, because ...", and help people make informed risk trade-offs.

Approach it like the NTSB would: keep going until you know, and have recommendations for, all of the failure modes.

(And yes, I know a lot of it is humor. But we have to make sure that even the humor doesn't influence people to stop short in their analysis, or come to the wrong conclusion)

5
2
0
@tychotithonus As I see "this other thing" usually doesn't break everything. But if it breaks DNS, everything breaks, so DNS has a significance IMO, esp. because people often overlook this component.
1
0
1

I feel like outages in, and access to, the Internet itself are a good analogy here.

The Internet is very useful, and it makes sense to understand what happens when it's not available and ensure that you are consciously choosing what to make sure keeps working when it's gone, and how to do so, and how to test it.

This also applies to DNS.

0
1
0

@buherator Oh absolutely, I didn't mean to downplay the importance of DNS. I just want people to keep going, instead of hitting a DNS "wall" in the process of tracing root cause, and then just mentally stopping. 😅

1
1
1

@tychotithonus Loved the "masking the true root cause" which Amazon likely is investigating, resulting in changed processes and procedures (one hopes).

1
1
0

@tychotithonus

According to Amazon's issue page:

"Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1." & "we recommend flushing your DNS caches."

So this was NOT caused by DNS but a bad change by Amazon themselves. DNS worked as intended and provided the information as designed.

You don't blame the database for wrongly entered information either.

It seems this outage was generated by Pilot Error and the system and procedures failing to mitigate or prevent pilot error.

1
1
0

@itisiboller I also hope everyone downstream is similarly diversifying, and by extension understanding their infrastructure better in the process!

0
2
0

@tychotithonus

However the "It's DNS" thing is a valid meme. Oftentimes ment to be humorous and ironic.

1
1
0

@simonzerafa It's a fair point, I don't mean to be Debbie Downer on that!

1
1
0

@tychotithonus @buherator "It was DNS, because the hardware running DNS was plugged out of the Internet. By a 🚜 tractor." 😅

0
0
0

@tychotithonus

I've seen one or two toots today where I wonder if those people are aware of the meme or are taking it far too seriously 🙂🖖

0
1
0