DNS Hell in Kubernetes

Here at Codacy, everyone has been working really hard in the last few months to move all of our services to Kubernetes. And it has been a bumpy road… From having to run a nfs-server provisioner to be able to share files between pods, to launching our own scheduler to avoid scaling issues, we can say that we’re hitting some interesting problems along the way, giving us the opportunity to do and learn new things everyday–which is pretty cool.

In this blog post I will talk about one of the most common issues that almost everyone, including us, seems to struggle with when moving to Kubernetes: DNS intermittent delays

One of our components, simply called worker, runs as a short-living pod and is responsible for spawning other pods that run the analysis tools (ESlint, Pylint, etc.), gather their data, and save it. Because of some pitfalls of our architecture, we agreed that this was going to be one of the toughest parts of the system to put on K8s and therefore, we decided to tackle it first. During one of our first attempts to put these workers running on K8s into production, we noticed a problem after a few minutes that we hadn’t seen during our tests on the development and staging environments. Some workers were throwing UnknownHostExceptions while trying to access graylog and the databases that were running outside the cluster. These exceptions seemed to increase as the number of running workers on the nodes increased. After some research, we found lots of users complaining about DNS intermittent delays of ~5s in this GitHub issue. This was a problem, since the default timeout for the DNS is 5 seconds. We went through almost all of the solutions referred in the issue thread:

  1. arp table overflow on the nodes (arp -n showing more than 1000 entries). Increasing the limits solved the problem
    We checked this on our nodes and this was not a problem for us, since we had around 50 entries in all of the nodes.
  2. dnsPolicy: Default works without delays
    Well, this was actually a bit confusing for us because despite this dnsPolicy being called Default, it is not the Default policy in k8s. The default DNS policy is “ClusterFirst”, i.e., “any DNS query that does match the configured cluster domain suffix is forwarded to the upstream nameserver inherited from the node,” while the “Default” DNS policy just “inherits the name resolution configuration from the node that the pods run on.” We tested this running some test pods trying to resolve www.google.com and this configuration decreased the delay from an average of ~5s to ~2s.  This is still a long time to resolve the name but we decided to try it on our workers. Unfortunately, after a few minutes we still got the UnknownHostExceptions while trying to access the external services on startup.
  3. “Use fully-qualified names when possible”
    As explained by Marco Pracucci in his blog, if you’re trying to resolve a name of something outside of the cluster in K8s with the default configuration on /etc/resolv.conf, any request for resolution that contains fewer than 5 dots will cycle through all of the search domains as well in an attempt to resolve. For example, to resolve codacy.com, codacy.com.kube-system.svc.cluster.local., codacy.com.svc.cluster.local., codacy.com.cluster.local., codacy.com.ec2.internal. and finally codacy.com. must be looked up, for both A and AAAA records.
    While this didn’t solve our issue, it was something good to be aware to tweak our apps and get a better performance.
  4. using the option single-request-reopen on /etc/resolv.conf fixed the problem
    Fortunately, the guys from weaveworks and also from XING investigated this problem in depth and explain in detail why this can be a solution. Basically, the root cause of these delays is the Linux connection tracking mechanism, aka conntrack, which is implemented as a kernel module and is inherently racy. According to the man page for resolv.conf, the single-request-reopen option enables sequential lookups using different sockets for the A and AAAA requests, reducing the race conditions.
    We also tried this but, after some minutes, we continued to see workers failing and while running conntrack -S on the nodes, the insert_failed counter was still increasing.
  5. use tcp for DNS requests
    Since UDP is lossy, one of the UDP packets might get dropped by the kernel due to the races, so the client will try to re-send it after a timeout (5 seconds). Therefore, we decided to try using TCP for DNS as a workaround for this issue, as it was also one of the workarounds suggested by Pavithra Ramesh and Blake Barnett in their recent talk at Europe’s Kubecon. Despite this being a bit slower because we are now using TCP, it actually solved the problem for us and the pods stopped failing.

Some good news is that these DNS problems caused by race conditions in the connection tracking mechanism (fun fact: this is briefly “documented” in Linux’s source code) already have two patches to fix them. (If you’re brave enough, you can take a look at them here and here.) However, the most recent patch is only available since version 5 of the Linux kernel, and it’s not always possible to control the kernel version of the nodes where your pods will run. In our case, since we are running on EKS, we are using Amazon Linux 2 and the most recent update (7/18/2019) only supports the 4.19 kernel, which only contains one of the patches.

Meanwhile, the most proper solution seems to be the usage of a NodeLocal DNScache (beta in k8s v1.15), already detailed in the k8s official documentation. This solution aims to improve the overall DNS performance on the cluster by running DNS caching agents on every node as a DaemonSet so pods can reach out to these agents running on the same node, thereby reducing the number of upstream trips which would still use conntrack and increase the probability of being impacted by the race conditions previously referred.

We did not try other solutions, such as using a sidecar on every pod running tc to delay DNS packages, since they seemed more complex and required more configuration. We also discarded every solution that required any modification on the nodes configuration since we also deploy our application on-premises and, in this case, the cluster nodes are managed by our customers.

In the end, after some research and the typical “trial and error”, we were able to find a workaround for this issue until we get to a proper solution in the future, enabling us to proceed and hit the next problem. 😄

Best PracticesSoftware Development

Related Articles