The CTO DevOps Handbook: Simple Principles and Examples

The CTO DevOps Handbook: Simple Principles and Examples

Nail the DevOps part as your company's CTO

Michael Zion
Book Icon - Software Webflow Template
9
 min read

The goal of this handbook is to give you clarity on DevOps:

  1. Understand what’s DevOps (in simple words)
  2. Know what’s possible with DevOps (in simple goals)
  3. Get simple “when-to-do-what” DevOps guidelines

I added a bonus at the bottom of the article.

It's a production-ready setup example you could take inspiration from.

Who this article is for

You might be a founder who wishes to get started with DevOps the right way.

You might be a CTO of a 1,000 employees company who wishes to get simple principles.

Or, maybe you’re a Software Engineer, and you want to understand if your company’s DevOps approach is good.

If you’re looking for a simple DevOps playbook, this is it.

Understand the desired result

Two things your company needs to be able to do

  1. Serve its product to customers
  2. Build and improve the product

Abilities you need to build, improve, and serve software

  1. Run experiments and test changes

DevOps has a simple meaning

Developers and Operators have shared responsibility for building and improving the system.

In practice:

  1. Developers are responsible to “Operate”
  2. DevOps Engineers are responsible to enable to “Operate” AND do some of it themselves

Operate = provision, monitor, secure, configure, deploy, scale.

Choose a balance: Enabler, Doer, or Automator

The DevOps role will end up as a balance between:

  1. Enabler: Provides the tools and knowledge to fulfill the DevOps goals
  2. Doer: Does the tasks that fulfill the DevOps goals
  3. Automator: Automates any repeating operation

Know what things you should enable, do, or automate

  • Provision infrastructure
  • Secure the system
  • Deploy workloads
  • Monitor the system
  • Recover from issues
  • Scale up or down
  • Track & test changes
  • Automate processes

Choose the right tools

  • Has state management = Saves time automating state-aware processes (e.g., Terraform)
  • Has a big community & good docs = Saves time dealing with common issues (e.g., Kubernetes)
  • Has multiple interface types: API, CLI, UI = Saves time integrating with the existing system (e.g., Vault)

Set useful goals

There are DevOps goals that adopting them will focus you on the right direction:

  1. One-Click Environments: makes e2e tests easy and quick
  2. Atomic Commits: provides confidence that a tested change will work in production
  3. Separate the Shared & Env-Specific Parts: enables e2e tests as the company scales up

If you want to learn about more useful DevOps goals, feel free to book a free consultation here.

Enablers: Choose the Tools-to-Knowledge Balance

Developers can either have the knowledge or the tools to do something.

  1. More knowledge-reliance: if you want the developers to contribute to the DevOps efforts
  2. More tools-reliance: if you want to abstract the operations from the developers

If the balance between the two is not intentional, it’s accidental.

Doers: Have a good reason to do it

  1. Is it a one-time task?
  2. Does it teach you how the developers work?
  3. Are you directly accountable for the results of the task?

If you answered “no” to the above questions, enable or automate it instead.

Doing more = Learning the system's use-cases

Doing too much = Not scalable, too-much knowledge-reliance

Automators: Have a good reason to automate it

  1. Did it happen before?
  2. Is it likely to happen again?
  3. Will automating it take less time than doing it?
  4. Will automating it teach you an important company process?

If you answered “yes” to 2 out of the 4 questions - automate it!

More automations = Less reliance on knowledge to operate the system.

Too much automations = No system awareness.

P.S. - you can also enable developers to automate it.

Create available DevOps Capacity

The DevOps needs of a company have spikes.

One month you need 2 DevOps Engineers, and half of that the next month.

Switchovers between big efforts and small tasks are common.

This is true, especially for new companies.

Break the assumption: “DevOps tasks must be done by a DevOps Engineer”.

There are 3 types of DevOps capacity

  1. Non-Flexible: A full-time DevOps Engineer on the team
  2. Semi-Flexible: Key developers that can contribute to the DevOps goals
  3. Fully-Flexible: A flexible DevOps Services company or freelancer

You can read more about calculating the DevOps capacity your company needs here.

When to focus on what: Common Dilemmas

When: You work alone, and the system is simple

Focus: On simplifying the development - Dockerize your apps, Create a post-commit pipeline that runs tests

When: You need to be able to create new environments quickly (for development, or for clients)

Focus: On implementing “One-Click Environments”: Using IaC (e.g., Terraform) + Deployment tool (Depends on the platform).

When: You want to e2e test every code modification, but there are many code modifications

Focus: On splitting the “One-Click Env” into a “base” with shared resources, and “env” with env-specific resources

When: You want to unify & standardize how you deploy, monitor, scale, configure, and secure your workloads

Focus: On implementing an orchestrator such as Kubernetes

When: You want you have many moving parts and wish to be certain a tested change will work

Focus: On implementing GitOps and consider a Monorepo (the sooner the better)

When: You want the DevOps efforts to be done by the dev team

Focus: On using “actual” IaC tools (Pulumi Typescript/Python), Full “how to operate” (see above) documentation

Never: - Invest lots of time in new tech without a strong reason

Always:

  • Have your code in Git
  • Monitor the basic stuff: CPU, Memory, Disk, Network, App Logs, Cloud Costs
  • Architect for high-availability
  • Test before you deploy

BONUS: An example setup for a CTO approaching Production

2 AWS Accounts

  • One for development and staging
  • Another for production

Monorepo in Github

  • Docker-Compose for local development

2 Infrastructure-as-Code projects: 'base' & 'apps'

  • base = shared resources (e.g., VPC, RDS, ECS Cluster, EKS Cluster)
  • apps = env-specific resources (e.g., Lambda Functions, ECS Services, Kubernetes Namespaces)
  • config file per environment

Github Actions Workflow: Development workflow

  • Checkout branch and locally develop + test changes
  • Create a Pull Request: Deploys a Pull-Request ‘apps’ environment on the ‘development’ environment ‘base’
  • On merge to main: Deploys from the ‘main’ branch an ‘apps’ environment onto the ‘development’ environment ‘base’
  • Manual: Deploy from the ‘main’ branch onto the ‘staging’ / ‘production’ environment ‘base’

Notes:

  • Avoid mentioning an environmnent's name in the code for conditional resources deployment
  • Use each environment’s config file to declare if a resource should be created
  • Could be implemented using Terraform, Terragrunt, Pulumi, CDK, and other IaC tools
  • Production should have 2-instances of every workload for high-availability

If you’d like to see this setup in your startup, click here to book a call 👈🏼

P.S. - I'll be updating this page occasionally, so you might want to visit again


Another Bonus: DevOps Dictionary for Human Beings


TermDefinitionTools
EnvironmentA working instance of the entire system
CI (Continuous Integration)Enable developers to collaborate by agreeing on a single source-of-truth (master/main)Jenkins, Github Actions, GitlabCI
CD (Continuous Delivery)Create an artifact that’s ready for production (tested, tagged)JFrog Artifactory, Nexus, AWS ECR
CD (Continuous Deployment)Every available deliverable (artifact) gets deployed automaticallyArgoCD, Jenkins, AWS CodeDeploy
Monitoring / ObservabilityCollect metrics/traces/logs from apps and infrastructure, analyze them, and display them, and setup alertsPrometheus, Jaeger, Elasticsearch, Fluentd, OpenTelemetry
InfrastructureThe resources on which the workloads run, in which the data is stored, and through which the network flowsServers, Databases, Network Routers & Switches
Cloud InfrastructureSame as the above, but specifically in the cloudAWS EC2, AWS RDS, GCP Compute Engine, Azure Virtual Machines
CloudComputing & Data services served from remote locations for you to build your systemAWS, Azure, GCP
Containerization & VirtualizationTechnologies utilizing Kernel & OS features to create virtual machines, or isolate process (AKA run containers)Docker, vSphere, KVM
Secrets ManagementStoring and retrieving sensitive configurations (e.g., tokens, passwords)Hashicorp Vault, AWS Secrets Manager, SealedSecrets
Configuration ManagementUsually refers to preparing servers for workloads (e.g., creating directories & files, starting processes)Ansible, Chef, Puppet
Version ControlSaving the code in a versioned way (Git)Github, Gitlab
GitOpsMaking the system is the same as it’s described in GitFlux, ArgoCD, Jenkins
MonorepoAll of the company’s code is in one Git RepositoryNX, Turborepo
PolyrepoMultiple Git repositories for different components
IaC (Infrastructure-as-Code)Creating Cloud infrastructure with idempotent code and state managementTerraform, Pulumi, CDK, Crossplane
DeploymentExecute, serve, or install the artifactsArgoCD, Jenkins, AWS CodeDeploy, Scripts (Bash, Python, etc.)
OrchestratorDynamically allocating workloads to a pool of nodesKubernetes, Nomad, AWS ECS
Authentication & AuthorizationMaking sure each person, workload, or resource, has access only to what’s necassary (other workloads and resources)AWS IAM, OpenID, OpenVPN, Twingate, Istio
Service DiscoveryExposing available workloads using DNSConsul, CoreDNS