Import multiple high-scale Kubernetes Clusters into Pulumi

How we organized infrastructure management of a high-scale system in the cloud by utilizing Pulumi and standardizing environment creation
Import multiple high-scale Kubernetes Clusters into Pulumi
TaranisTaranis
Company name:
Taranis
Industry:
AgTech
R&D size:
10-50 Engineers
Scale:
150 - 800 K8s Nodes
 Buldings Icon

1. Initial state

Taranis is an agriculture tech company.

They take images of crop fields from satellites and drones, analyze them, and provide farmers with insights regarding the state of their fields. Some examples:

  • Alerts on damaging insects
  • Alerts on areas lacking certain minerals

Behind the scenes, everything is running on Kubernetes in one of the large Cloud providers (confidential due to the client's request).

Their state when they met us:

  • Everything was created manually using the Cloud's Console UI
  • Frequent Jenkins crashes due to high-scale usage and plugins modifications
  • Parts of the system with both a QA and Production environments, and other parts without a QA / Development environment
  • Scale varying between 150 - 800 Nodes

Managing the infrastructure became complex, and the method of introducing manual fixes became problematic, and more often than not broke things in production, without a clear audit.

Target Icon

3. Project goals

  1. Make it easier and safer to provision infrastructure
  2. Enable testing Jenkins changes
Checklist Icon

4. Decisions

To achieve the goals, we made a couple of decisions:

  1. Use Pulumi to manage the infrastructure - the developers have knowledge of Typescript, so using Pulumi makes sense in terms of making it accessible for devs to create & own infrastructure
  2. Create DEV & STAGING environments and gradually add components to it
  3. Use the new environments to test both infrastructure and application changes before modifying production
Lock Icon

5. Restrictions

We one main restrictions:

  1. No recreation of resources

So we decided to import existing resources into Pulumi and based on the existing Cloud architecture.
The idea was to create the new DEV and STAGING environments out of the same Infrastructure-as-Code that manages production.

Map Icon

6. Strategy

The above goals, decisions, and restrictions, resulted in the following strategy:

  1. Create a Pulumi Project per the Cloud's oranizational unit (Project / Resource Group / Organization).
  2. We defined an environment as a Pulumi Stack, so that one Stack could span multiple Pulumi projects and still represent the same environment (DEV / PROD / etc.)
  3. Start with a 'PROD' Pulumi Stack, to which the existing infrastructure will be imported, and then using the same code spin up an entire DEV / STAGING environment from scratch
  4. Guiding principle: The only difference between environments must be configuration (stack configuration in this case) - no environment-specific logic in the code
Settings Icon

7. The process

The process of transforming Taranis' infrastructure was methodical and detailed:

  1. Created the Pulumi Projects per Cloud project/resource-group
  2. Started with importing the resources of the largest Cloud project/resource-group

The import process:

  1. During the import process, we had a temporary stack to which we imported the existing cloud resources and ran 'pulumi preview' against just to see if it imports those resources as expected - it helped us find issues with some resources that couldn't be imported as is (for example, the state of an imported Kubernetes cluster contains the NodePools / NodeGroups as well, even though it's a separate resource).
  2. With every import we refactored the code so that the next resource we create comes bundled with the best practices we've already implemented - for example, we ended up creating new Kubernetes clusters for argo-workflows, and created them with autoscaling, GPU/CPU/Memory-intensive Nodes, custom annotations, custom storageClasses, etc - all out of the box.
  3. Last step was creating a new stack for both dev and staging, and spinning up environments identical to production - we even went as far as creating a dev permutation of Jenkins to test upgrading plugins.
Chart Icon

8. Results

  1. The entire infrastructure is managed using Pulumi
  2. There are new DEV & STAGING environments (which we even used to test Jenkins plugins upgrade)
  3. The only difference between the environments is the Pulumi Stack configuration
  4. The developers started contributing code and taking ownership over the infrastructure they need
  5. Gaps between environments gradually decreased

Worth mentioning:

We did other things with Taranis as well, such as supporting multi-cluster communication between services, improving monitoring, streamlining the Kubernetes clusters' upgrades, handling Kubernetes deprecations, streamlining developing GRPC services, etc.

Table Icon

9. Before & After

Before ❌

After ✅

Manual infrastructure management via the Cloud console UI

Automated infrastructure management with Pulumi

Frequent Jenkins crashes causing delays

Stable CI/CD pipelines with dedicated environments

Partial development environments for infra and some parts of the system

Introduction of development and staging environments

Highlight Example

Explore how we can achieve something similar with you