June 2, 2020

The Elixir/Kubernetes Synergy

Elixir + Kubernetes, an awesome combination for any software deployment. Why isn't everyone using it?

The Elixir/Kubernetes Synergy

There's a bit of a division in the Elixir community about whether or not to deploy with Kubernetes. At Codedge, we think it's a great idea, and it's only made our deployments more robust with time. Let's dispel some of the myths that seem to come up.

Myth: Kubernetes and Elixir's BEAM do the exact same thing.

While it is true that there's significant overlap between the two, they aren't the exact same. Let's look at the similarities.

  • Supervision and restart of failed processes.
  • Failover of processes to other nodes.
  • Internal DNS management.

Pretty big similarities right? Many proponents of this idea claim that because Elixir does all of the supervision and failover that is praised in Kubernetes, why use Kubernetes at all?

That would be true, if your application was your entire system. However, even in the simplest of CRUD applications, that's not the case. You still have a database. You still (probably) have load balancer termination. The BEAM will not supervise your database, only connections to the database.

So what then, duplicate your supervision by running both together? Yes, but not exactly. Software is inherently fractal, as described in my article about instantaneous complexity. Think of Kubernetes as a system-level supervisor, whereas Elixir serves as a project-level one. In this context, the two fit together perfectly. What you've created is a cluster within a cluster, where only truly catastrophic errors are bubbled up from the sub-cluster to the parent (think pod restarts).

Side note: this is also one of the most elegant ways to recover from an Elixir project using C extensions (which can sidestep supervision and crash the entire VM).

Myth: Running Elixir in Kubernetes leads to "zombie pods"

This is actually a pretty easy problem to have starting out in Elixir, but it stems from a lack of proper understanding of the BEAM's OTP supervision.

So what makes it a zombie? If a process (like a GenServer) is spawned unsupervised, and then crashes, all requests to it will begin to fail even though the Elixir VM is up and healthy. These issues are frequently only caught with an error reporting tool, assuming you've taken the time to install one.

Whats interesting, however, is that this problem is not unique to Elixir on Kubernetes. You can have zombie applications in any Elixir deployment strategy.

So whats the answer? Supervise everything. There's a saying in the Elixir community:

If a process is worth running, it's worth supervising.

Even temporary task processes should be supervised (with the appropriate restart strategy). In this way, if a process that should always be up has a truly catastrophic error and cannot recover, it will bring the whole Elixir VM down. Scary, but a good thing. This is where Kubernetes can kick in and reschedule on another node, etc.

By structuring your Elixir projects like this, you can actually achieve a more robust Kubernetes pod than pretty much any other programming language.

Pod restarts for an Elixir container should almost only ever happen in cluster rebalancing or rolling updates. One-off process failures don't bring down the whole VM. This lets you keep your caches warm and reduces the likelihood of issues with interrupted long-running tasks.

Final Thoughts

In some sense, Kubernetes is an attempt at what Elixir/Erlang have done on the BEAM for years. But that doesn't mean ditch Kubernetes if you're already using Elixir. They work together wonderfully when applied to their specific roles.

Do you have an interest in using Elixir or Kubernetes for your software application? Reach out to us, we can help!