Everything looks ok to me, the error must be on your end!

Earlier this week I got a couple of emails from different people, both telling me that they'd just created a new user account, but hadn't yet received the validation email. In both cases my mail server logs showed that the destination SMTP server had rejected the incoming message due to a DNS error while trying to resolve the hostname part of the envelope sender. Since I knew very well that my DNS setup was working at the time, I was already ready to reply with something along the lines of "It's just some kind of a transient error somewhere, just wait a while and it'll sort itself out".

But I decided to check the outgoing mail queue just in case. It contained 2700 messages, going around 5 days back, most with error messages that looked at least a little bit DNS-related.

Oops.

Now, 5 days for this server is usually something like 50-60k outgoing email messages, so those 2700 queued messages represented a pretty decent chunk of traffic. The mail logs suggested that the errors had started weeks ago, around November 12th. And indeed while an A query for the hostname was working fine, a MX query returned no results.

I didn't touch anything, it just broke!

I was completely sure that the MX setup used to work just fine. And I had not done any kind of DNS changes at all for at least a year. Any computer setup will rot eventually, but it shouldn't be happening quite this fast.

Wait... Was that date November 12th? That's when I bought a new SSL certificate through my registrar, who is also doing my DNS hosting. Hmm... And I even chose to use DNS-based domain ownership validation rather than the 'email a confirmation code to hostmaster@example.com' method, and allowed my registrar to automatically create and delete the temporary authentication record.

Ok, so technically I did make a change, even if it was just to authorize another system to make an automated DNS configuration change on my behalf. But clearly my registrar must have screwed up these automated config changes, and completely deleted the MX record along the way!

That config looks valid, just apply it!

Well kind of, but not really.

When I logged into the DNS management UI yesterday, it turned out that the MX record was still there, but it was marked with a validation error complaining about a conflict with the CNAME. When I did the original configuration, I'd set up the relevant host with both MX and CNAME records. That is apparently not best practice and can cause problems with some mail servers. And who am I to argue with that, even if it had seemingly worked for the past year.

I changed the CNAME to an appropriate A record, the validation error was cleared, and as if by magic my queries were now working. 5 minutes later the outgoing mail queue was draining rapidly.

So clearly my service provider had added some helpful functionality to prevent bogus DNS setups from being exported. The old zone files would still be there working properly during the transition period, but the next time the user would make changes they'd need to fix the setup before it'd be exported.

That's a reasonable design, and I'm sure it would work marvelously most of the time. But in this case the zone file export was actually triggered by an automated process, so there was no way for me to notice that the configuration was now considered erroneous, or to fix it. The DNS host was serving a mostly functional zone; it was just missing one record, and even that one record only mattered for a fraction of my outgoing mail making the problem even harder to spot. (Just a fraction of my mail, since it looks like most mail servers either don't validate the sender domain, or validate it with some other kind of query).

There's a bit of guesswork involved in the last couple of paragraphs. The error can no longer be replicated, since management UI will no longer allow me to create a setup analogous to the original. So it's hard to be completely certain of the mechanisms on that management UI's side of the story. But I'm still fairly confident that this is at least a pretty close approximation of what happened. The timings of me buying the certificate and the start of a spike in DNS-related mail delivery errors match up way too well for any other explanation to be credible.

Of course frobnicating the wibblerizer could break the dingit! Everyone knows that...

There's all kinds of morals one could draw from this story. Proper monitoring would have detected this immediately, the registrar should have accounted for this corner case, I should maybe not have the default assumption that the other party is at fault when something breaks, you should always check the results of any kind of automated config change when it's done for the first time, and probably many other excellent lessons in either life or systems engineering.

But really I'm just telling this story because I find the endpoints in the chain of causality completely hilarious. In a sensible world my action A really should not have led to the final result B, but it did. It's unfortunate that the title of this blog post ended up looking like linkbait of the worst kind, when it's actually the best 10 word summary I could think of :-)