MeteorOps | Practical Tips for Kubernetes Upgrades for Startups

The all-too-popular Kubernetes upgrade-storm

There comes a day when you get a notification that the current Kubernetes version that you are running is reaching its end of life. Best case scenario you open a ticket knowing full well that this ticket will either be pushed down the list of priorities or be forgotten completely. After all, you have other priorities such as releases and bug fixing. You are a fast-running startup that needs to bring in new business in order to grow. Upgrading Kubernetes is the least of your concerns right now.

‍

You will have to upgrade eventually

But the day finally comes when you need to upgrade and one of the following could be the trigger:

Your Kubernetes version actually finally reached its end of life.
RND management finally decided it was time to upgrade.
You need to upgrade regardless of end of life because some critical components in your cluster need upgrading for a bug fix or a feature that you need.

You look up the helm chart or operator that is running in your cluster and realize that you cannot upgrade to the newer versions because they are incompatible with your current Kubernetes version. So you need to upgrade the cluster in order to upgrade the helm charts. And to top it all off, it has been decided to upgrade Kubernetes all the way to the latest versions and you find yourself needing to upgrade 4 versions forward.

Every Kubernetes upgrade has the potential to introduce breaking changes. In most cases it’s deprecated APIs or APIs that moved from one API group to another. This will affect anything in your cluster that relies on these Kubernetes APIs. Now take this and do it four times. You need to carefully plan how to approach and execute the upgrade:

Scope and price the process in terms of effort and time to completion.
Create an upgrade plan and iterate over it by testing on lower environments.
Set a maintenance window.
Declare code freeze.
Upgrade and verify.

This is a challenging process that will exhaust you. It is labor intensive and very error prone. If you upgrade a library in some microservice you can test it locally, the scope is almost always isolated to this specific micro service and in any case the blast radius is relatively small. But when you upgrade a Kubernetes cluster you are upgrading the entire system and anything going wrong could have serious consequences.

‍

How to approach the upgrade

Before discussing ways to approach an upcoming upgrade we need to address the elephant in the room. Once you need to upgrade, you’re probably short on time, short on resources, and need to upgrade several versions forward while making sure that the app stack itself and everything else that runs on your Kubernetes cluster remains functional. That is not the way to go.

What we derive from this situation is the first principle of how to approach the upgrade - upgrade small, upgrade continuous. Once we realize and implement this principle we can move on to what you need to do in order to successfully upgrade your Kubernetes cluster.

‍

Upgrade small upgrade continuous

Remember the day when you got the notification that your Kubernetes cluster versions reached its end of life? Well this is the day where you waste no time and put this task in the sprint.

Opponents of this approach might say that a startup cannot afford to jump on every upgrade because there’s more pressing business to conduct. But a startup also cannot afford system instability. The longer the wait the more unstable the system might become and the upgrade will be harder and harder, especially if more than one upgrade is in question. So upgrade small, upgrade continuous. This goes for components in the cluster as much as the cluster itself.

Keeping your helm charts, operators and controllers up to date will almost always guarantee that you will not have to upgrade them when upgrading your Kubernetes cluster. There is no doubt that a startup should think first how to bring in money. If the Kubernetes cluster upgrade competes with a feature that will bring in new business, the feature will almost always win.

However, by insisting on upgrading small and keeping up to date, you provide yourself with breathing room for when a feature or a bug fix are really critical. You can allow yourself to skip the upgrade and focus on business because the end of life of your Kubernetes cluster version is farther down the road.

‍

Test and verify on lower environments

Another principle worth discussing is testing and verifying the upgrade on lower environments. The term lower environments obviously means dev and stg but in many cases dev and stg environment represent the app stack and less so the infrastructure.

This means that when testing on lower environments they have to be identical to the production cluster that you are about to upgrade. As identical non-critical environments, you can allow yourself to make mistakes which are the best way to learn.

Upgrading Kubernetes is a difficult task. Having the ability to try it out without fear of service disruption is liberating and will allow you to experiment more therefore better preparing yourself for the upgrade day.

When you eventually test on lower environments, don’t just upgrade and settle for a working cluster. Remeber that lower environments are meant to represent the architecture and app stack of higher environments. Consider the following when upgrading consider the following:

Monitor the environments through metrics and logs to check for anything suspicious or out of the ordinary:
- A critical component fails to be scheduled - pods in crash CrashLoopBackOff, pods failing to satisfy liveness and readiness probe, nodes don’t scale when the cluster is loaded, etc.
- kube-proxy is not alive and well and services cannot talk to each other.
- kube-dns is not alive and well and services fail to resolve host names.
Run e2e tests on your app stack. Verifying that your app stack functions as it should in an upgraded environments will give extra confidence that the upgrade is going well:
- Run e2e tests.
- Run integration tests.
- Run load tests to verify that deployments scale accordingly.

There’s a caveat though in this approach that we need to address. You have to provision and maintain these environments which means more resources allocated to the upgrade process even if it’s not in the pipeline. But there’s no better approach. It’s a “measure twice cut once” and “invest money not time” rolled into one.

Try out the upgrade first and then execute and do it on preallocated environments. The stability you will achieve will contribute to the overall health of the system and the organization itself. No amount of money can compensate for dev teams overworked by system instability. It could also prove a source of churn where instability drives away clients and even prospects.

‍

Now it’s time to upgrade

Let’s assume that we are in an ideal world where you have your lower environments ready and well-maintained and you have allocated time and resources for continuous upgrades. How do you prepare for an upgrade? There are several things you need to do.

First of all, you have to thoroughly read the release notes. And it doesn’t mean scrolling through them but reading them line by line. It’s a time consuming task but it follows the principle of “measure twice cut once”. A lot of what you will read won’t be relevant. A lot of what you will read will be invaluable. Dedicate time and patience to this task. Tutorials and guides are obviously welcome but try to remember that not all environments are alike.

Now that you have a sense of what’s heading your way in terms of the effort put into research, you could use an automated process to give you a head start. You can find exactly that in kubepug which is an open-source Kubernetes pre-upgrade checker. What you can and should do with kubepug:

Run kubepug against your current Kubernetes cluster version to get the following:
- A list of deprecated APIs.
- Any objects affected by API changes.
- What APIs should be used instead of deprecated ones.
If all goes well and we wish you that it will, run kubepug once more because it is also capable of verifying current versions.
Trust kubepug but verify that everything that it drew you attention to was indeed upgraded or replaced and that it’s consistent with the release notes.

Once you gather the information and mode of operation from release notes and guides, go look at your Kubernetes cluster and find everything that both exists in your cluster and is referenced in the information you gathered. The match between the two is the basis for your upgrade plan.

‍

Automate and summarize

We’ve mentioned a few times that the upgrade process, especially everything that precedes it, is very arduous and time consuming. This is where you automate the discovery and summary process.

Use LLMs to summarize and highlight information that you gathered and other tools to scan, analyze and inform you on changes between versions. Another aspect of the upgrade process is to compile, document and implement the upgrade process itself. Laying down the foundations of upgrading small and continuously is perhaps the most important aspect even more than the upgrade itself.

It’s true that the goal is the eventual Kubernetes cluster upgrade, but how it’s carried out will determine the measure of peace of mind that you will have when approaching this important task.

‍

Yours is a startup and should start well

Kubernetes is one of the best things to have happened to the tech industry. By using it, your startup avoids the pain of having to provision your own orchestrator. Take into account that the time you save for using Kubernetes rather than maintaining your own solution, is time to invest in paying respects back to Kubernetes.

And to do that you need to consider that for Kubernetes to continue serving you it needs to be up to date and well-maintained. Then and only then will it guarantee the highest level of stability. And a stable infra for a startup is priceless as it allows you to grow and rarely holds you back.

So for all your successful upgrades to come, adopt the mindset that we are trying to convey.

Give the Kubernetes upgrade its place in the development pipeline. Like we highlighted, an upgraded, well-maintained cluster is an invaluable resource. Small to medium effort every now and then is better than an out-of-the-blue urgent upgrade.

Be ahead of the upgrade. Don’t wait for it to come to you. Seek it proactively and open a ticket with a due date. Add yourself a calendar reminder. Allow yourself time to educate and prepare yourself. Yes there are automated tools like kubepug that we mentioned but we need to know how to use these tools and rely on them to the extent that they don’t have the final say.

Test on lower environments and verify and validate by looking at metrics and logs. Validate further by making sure that the app stack functions as it should.

These principles don't guarantee smooth upgrades as the unexpected is almost always bound to happen. However, they do guarantee successful upgrades that will instill in you greater confidence for the current upgrade as well as future upgrade. You’re a startup and adopting these principles and mindset will prove itself not only when upgrading your cluster, but in anything that you set out your startup to become and achieve.

This is also a heading
This is a heading