At Sensson we use Zabbix to monitor a variety of systems, environments and other variables that could potentially impact the service to our customers. It helps us to diagnose issues quickly and in most cases we know about an issue before you do. So far so good.
But then we were notified of a complete meltdown of nearly every system we monitor.
We quickly came to the conclusion that there was no outage that affected our customers and Zabbix was in fact sending false positives about nearly our entire infrastructure. Nearly. As it turned out only those hosts that are set to DNS monitoring were affected.
We are using Google DNS to resolve hostnames on our monitoring server and that turned out to be the wrong decision.
We noticed the following when we were monitoring our DNS traffic:
07:44:57.719922 IP google-public-dns-a.google.com.domain > p-m1.51136: 6480 Refused 0/0/0 (39)
07:44:57.720010 IP p-m1.34733 > google-public-dns-a.google.com.domain: 11009+[|domain] 07:44:58.285993 IP p-m1.45455 > google-public-dns-a.google.com.domain: 23443+ AAAA? subdomain.fqdn.net. (36)
07:44:58.416281 IP p-m1.42427 > google-public-dns-a.google.com.domain: 65510+ AAAA? subdomain.fqdn.fqdn.net. (49)
Aside from the fact that we see two IPv6 lookups, one other thing popped up: Google refused a DNS lookup. It turned out that this wasn’t a one-time failure, Google DNS was actively refusing all our DNS lookups. While looking into this issue a bit further we found that Zabbix does a lot of lookups and unfortunately doesn’t do any DNS caching at all. As a result Google has likely rate-limited our requests or temporarily blocked them when noticing that many of them came in. The result: Zabbix couldn’t reach what it was supposed to monitor and said it was down.
It would be great if Zabbix could cache the results of a DNS lookup. Until that’s implemented it’s probably safer to use your own resolvers or to monitor servers by their IP address.