Calculating your company's required DevOps capacity

Calculating your company's required DevOps capacity

Required DevOps Capacity = (Scale * Complexity) / Leverage

Michael Zion
Book Icon - Software Webflow Template
12
 min read

Once upon a time 🧚

About a week ago a client of mine called.
“Michael, our CFO asked me why we need 3 DevOps Engineers on the team, what do I tell him?”

Until that point, justifying budgets was easier.
It’s easy to calculate disk utilization to justify an extra 300$ of disks a month.
It’s hard to calculate DevOps Engineering utilization to justify an extra 1.5 DevOps Engineers a month.

That’s how the DevOps capacity formulas were born.




🛠️ Useful for companies, Criticized by Mathematicians

The formulas I came up with were useful to the client, and hopefully for you, too!

If you are a mathematician or a physicist, you’ll hate my guts for calling these formulas.
I’m counting on the fact that most DevOps Engineers didn’t major in maths or physics.




🎯 My goals with this article

  1. Help you understand how many DevOps Engineers you need
  2. Provide you with tools to reason with your CFO
  3. Encourage you to learn martial arts to successfully defeat your CFO in combat if all else fails



🤷 Caveats

  1. It’s not written in stone - I’m open to learn and improve it based on your experience
  2. Getting precise numbers requires fitting the formulas to YOUR company
  3. The formulas include mostly quantity variables, and not quality variables (except for ‘coverage’ and ‘score’ metrics)
  4. DevOps Capacity ≠ Full-Time DevOps Engineer: DevOps work can be done by developers or DevOps agencies as well



📜 How to use this article

  1. See what your existing DevOps capacity manages to handle, and define it as a unit:
    160h of DevOps Engineering / Month = 20 microservices, 3 development teams, etc (you’ll find out what the ‘etc’ means if you read this article)
  2. As the company grows, look for indicators the existing capacity isn’t able to handle more than it does and hire more DevOps capacity



🚨 The indicators that more DevOps capacity is required

There are 2 main indicators that more DevOps capacity is required:

  1. More time spent on support, less on everything else.
    It could be OK that your DevOps capacity goes more towards support IF the growth rate is slow.
    Why only if it's slow? Because spending time on support teaches the DevOps Engineer what use cases should be supported, and s/he gets the time to build, document, and automate them.
    This increases your DevOps-to-Developers efficiency.
    (I’m inventing terms right and left over here)
  2. Different engineers and teams are solving the same problems in different ways for themselves.
    This starts happening when the hiring of another DevOps Engineer gets delayed for too long.
    The developers start solving their own unanswered requirements.
    It could be OK that different teams are solving DevOps use-cases, IF the solutions provided are done as contributions to a centralized system, and there isn’t duplicated work.

Clients Indicators Examples:

  1. Company #1 Example:
    - 300 developers, and a DevOps group of about 40 DevOps Engineers, divided into 6 teams: Stateless Infrastructure, Data Infrastructure, Monitoring, CI/CD, Data Streaming,
    - Each team had a Slack support channel where developers would send help requests
    Extra Capacity Required Indicator: The DevOps teams were forced to dedicate 1-3 team members from each team to just provide support for requests
    Solutions: Analyzed the most popular requests every week, Automated them and provided a self-service interface to resolve, Increased capacity by meeting weekly with key-developers to provide training to support their development teams
  2. Company #2 Example:
    - 30 developers, a DevOps team of 1 DevOps Engineer
    Extra Capacity Required Indicator: The DevOps Engineer worked mostly on support tickets opened by developers, and the developers started deviating from the standard deployment process to bypass common errors
    Solution: Increased monthly capacity using a part-time DevOps Consultant from our team, and trained key-developers in the organization to provide support within their teams.

I think you’re ready to jump into the big formula for DevOps capacity!




🧮 The Big Formula - Calculating DevOps Capacity

The formula for finding the ‘Required DevOps Capacity’ in your company is:

MeteorOps DevOps Capacity Formula

Why is this the formula?

  • The bigger the stuff to build and maintain - the more working hands you need
  • The more variations you need to support - the more working hands you need
  • The more time & effort saving things you have - the less working hands you have

Let’s dive into the ingredients of scale, complexity, and leverage.




🔑 The Basic Formula - What is ‘DevOps Capacity’?

It’s not a real DevOps article if the word “philosophy” doesn’t appear, so to check this item off the list:
The guiding philosophy I’m using is that the role of a DevOps Engineer is to ENABLE.
Enable provisioning infrastructure - Enable deploying workloads - Enable storing data, etc.

DevOps = Enable ownership of the relevant stakeholders over the infrastructure, monitoring, security, architecture, data, configuration, deployment, orchestration, testing, and development.

‘DevOps Capacity’ is the time & effort invested in achieving “DevOps” as defined above.




⛰️ Scale - Size Matters

The formula for understanding the ‘Scale’ in your company is:

“Michael, did you just create a recursive definition?”

Why yes of course! this is the rationale:

  1. Bigger system - More time and effort required to learn, build, and maintain
  2. Bigger team & client base - More time and effort are required to support a growing number of internal & external use cases

To more clearly define the two types of scale mentioned above, I’ve created… 🥁🥁🥁 … - that’s right! Formulas!
These are the system and organization scale formulas:

System Scale

  • Resources Used could mean many things, some of them are: Storage volume, Number of CPU cores, Number of nodes, Memory used, etc.
  • System Throughput refers to the volume of requests your system processes over a period of time: User-initiated requests, Scheduled data processing, etc.
  • Instances per Service refers to the number of replicas of the services developed within your company, for example: Kubernetes Pods, serverless functions executions, etc.

A Client's System Scale Example - 1-Year-Old Startup:

- Scale: 3 AWS ECS clusters, 6 Microservices, 3 main environments, 3-150 ephemeral development environment, 6-40 container instances per environment
- Approach: Both infrastructure and deployments are managed using Infrastructure-as-Code, Hired a part-time DevOps consultant to implement an initial system and train the development team to take ownership over it


Organization Scale

  • Number of Developers refers to the number of developers, and shame on you for bothering yourself with reading this line
  • Total Users refers to the number of active clients
    Noteworthy:
    B2C will usually have more clients, with more standardized interfaces - DevOps Engineers most likely will be less involved in direct ways with client facing features.
    B2B on the other hand will usually have less clients, with more customized interfaces per client - DevOps Engineers are more involved with client-facing features through the developers.

A Client's Organization Scale Example - Healthcare Company:

- Scale: 200 Developers divided into teams of 4-12 people, 10 DevOps Engineers and SREs, High autonomy per-team, No standardized process for all developers
- Approach: Both infrastructure and deployments are managed using Infrastructure-as-Code to achieve "One-Click Environments", Hired a part-time DevOps consultant to implement an initial system and train the development team to take ownership over it




Bonus:

Check our availability for a call about scaling DevOps in your organization in a more gradual way using MeteorOps




🥴 Complexity - More Options, Less Clarity

I know, I know… It looks similar to the chapter about Scale.

For the sake of this article, complexity means More options and less certainty about which option to choose, resulting in more time to choose, and a lower likelihood to make the right choice.
You know you are in a complex environment when you want to do something, but you are not sure if THAT’S the way to “do things around here”.

So, what does complexity have to do with both the system and the organization?

  1. More complex system - More time and effort are required to map the available options, and customize (copy, paste, modify) or generalize (refactor, parameterize) solutions
  2. More complex organization - More time and effort are required to communicate, provide support, cascade initiatives through the company, and lower likelihood of initiatives getting accepted

This time I’ll present you with the formulas for complexity without drumrolls:

System Complexity

  • Number of Tools refers to any tool, either existing or custom-built, that helps build and maintain the system. More tools mean more capabilities, more potential integrations between tools, and more potential ways to achieve the same goals. (Tools have an upside as well!)
  • Number of Platforms means any platform on which you run workloads or consume resources (e.g. - AWS, Kubernetes, Linux, Jenkins). Using more platforms requires supporting more ways of running each workload or consuming each resource.
  • Number of Codebases refers to code repositories: the more separate places you use to manage your code, the more the way the developers develop the system diverges, as it becomes harder to support and enforce the same development processes across multiple different codebases.
  • Number of Services is the number of services developed by the developers: more services, more different units with different logic to them, which in turn creates more unique operational requirements.

To sum up, ‘System Complexity’ increases when there are more services, running on more platforms, managed by more tools, across more codebases.

A Client's System Complexity Example - Low Scale, High Complexity:

- Complexity: The entire system was 1 service deployed on just 1 server, but was deployed using multiple deployment tools (chef, ansible, bash scripts), from multiple different repositories (about 10), with a specific set of steps required to fully set it up
- Approach: Containerize everything, Consolidate the deployment to one docker-compose file, Consolidate the configuration and automation to one tool (ansible), Consolidate the repositories into one

Organization Complexity

  • Number of Teams - More teams mean supporting and maintaining more ways of work
  • Number of Hierarchy Levels - More middle-management layers in the organization make it more likely decisions and communication will cascade through it, and so increasing the time & effort for decision-making and communication
  • Level of Hierarchy-Reliance for Communication - Exactly the same effect, but focused on the culture of the company



🪄 Is there anything that makes DevOps capacity more efficient?

So far, we only touched on time-spenders, effort-increasers, and sweat-producers.
You might be asking - “What gives me leverage with my DevOps capacity?”.

Valid question! And this is the formula for DevOps ‘Leverage’:

  • Documentation Coverage - Out of the internal company’s tools and processes, what percentage has documentation?
  • Automated Processes Coverage - Out of the manual processes used by the developers, what percentage was automated?
  • Engineering Talent Concentration Score - The seniority (relevant experience) of your engineering team, divided by the acceptance rate of engineers to your company.
    More experience and lower acceptance rates improve the score.
  • Tools Community Size - How widely adopted are the tools being used by the company? And how active is the open community working on and with it?

Example:

Another client of ours had all of its infrastructure managed manually.
To increase the documentation-coverage and the automated-processes-coverage, it decided to adopt an infrastructure-as-code tool.
They considered Terraform and Pulumi, and ended up choosing Pulumi, in order to avoid hiring more full-time DevOps Engineers, and delegate responsibilities for infrastructure management to the developers.
It increased the documentation-coverage by the sheer fact that all of the infrastructure is now described in a Git repository.
It increased the automated-processes-coverage because every piece of new infrastructure was introduced into Git, would exist in every new environment, and would get automatically reconciled if its state drifts.

And the formula for the score (One last formula, I promise):

Note that this variable does NOT say “DevOps Engineering Talent Concentration Score”, because the assumptions are that:

  1. Every engineer on the team can take part in filling the DevOps capacity - it doesn’t have to be a specific team or person
  2. More experienced engineers require less training and support, and so consume less time for the DevOps Capacity

Example:

MeteorOps hires 7 engineers out of every 1,000 vetted.
We only vet DevOps Engineers with extensive production experience in at least two previous companies.
Our median seniority is 8 years of experience.
It means the ETCS (Engineering Talent Concentration Score) is 114.




🧠 Engineering Talent Concentration Score - The biggest leverage you can get

If you can take only one thing from this article, make it the engineering talent score.

Having a DevOps candidates pool of engineers with relevant experience, and hiring only the top percentile of that pool, is an insanely high-leverage action.

You get engineers with perspective on how to do things right, and perhaps more importantly, how to do things wrong.
Enjoying someone’s learning from successes, and learning from failures, saves your company lots of time learning the hard way.

Now, our marketing director forced me to squeeze in a word about how MeteorOps hires only the top 0.7% of DevOps Engineers, how its engineers work with multiple companies so they have awesome perspective, and how valuable it is for your company to use MeteorOps’ Consulting & Hands-on Services as an alternative to spending insane amounts of time hiring a DevOps Engineer, and blah, blah, blah - he kept yapping about all of that.

I’m not going to do that.

(P.S. - I’m the Marketing Director)




To Summarize

Instead of writing more words, here are all of the formulas:




I hope you find it useful!

I’m also interested in learning about how you used he formulas, and what insights you had as a result.
If you came up with new useful variables and formulas, please send it to me so I can take credit over inventing them.




If your CFO isn’t convinced