On-call policy

On-call policies are given for granted. So much that there's a market for applications that manage the process. It often goes like this:

  • The company needs engineering support 24×7.
  • Therefore someone needs to sacrifice their sleep.
  • The company pays for it. Often not nearly enough.

Taking on-call policy for granted means assuming that no one can ever build software that runs 24×7 without human support. What happened to us building software that can run without humans looking at it all the time? Imagine we'd build bridges that require engineers to constantly check that there aren't too many cars on it. No one would ever want to cross such a bridge.

Our industry sets the bar for quality as low as it gets. Applications that barely work are considered good enough. This is the underlying assumption behind "24×7 engineering support". Relying too much on the on-call duty makes us lazy: we build software knowing that it's OK that it might fail in production.

Instead, I suggest you focus your efforts on designing processes that make on-call policies obsolete. There's probably no way to eliminate the on-call duty entirely, but designing the best on-call policy is the wrong incentive for a good software. Striving for software that doesn't break every day creates better design incentives. Of course I'm not trying to trivialise the challenge. I know this is a difficult problem, so let's break it down into smaller ones and try to solve those.

Why does software stop working in production and needs human intervention? Here's a list of reasons to get the conversation started:

  • Resource saturation: a machine runs out of memory, processors, disk space, bandwidth.
  • Third-party systems stop working.
  • The failure use case wasn't covered by any test (manual or automated).

Designing a little process for each of these problems mitigates risk and makes the systems more resilient. One effective approach is to design a checklist you insert into your workflow. For the sake of this discussion, let's assume you have a stage in your workflow called "code review". When features are ready for review, a minimum of N developers must approve the change before it can move forward. Having a checklist in place to guard the code being deployed works great in practice. What to put in such a checklist is too specific to your workflow and your business, but here are some examples to get you started:

  • Are third-party tools configured correctly in the env?
  • Did we benchmark this query against users with more than X orders in our system?
  • How many requests per second can this endpoint process?

Crafting a good checklist is hard, but don't worry too much about it. As everything else, the best approach is to start small and iterate. Look out for problems that come up often and ask yourself: how do we prevent this from happening altogether?

Designing great checklists is fun! It's like writing unit tests for processes.

The idea is to build small processes that keep raising the quality of the systems we work on. Instead of incorporating on-call in your design process, build a culture that fosters quality over speed and people over their code. Working code, velocity, and stable systems will follow.

We all know that systems fail anyway sometimes, no matter how hard we try to prevent it. There's a certain degree of randomness we have to account for in any non-trivial production system. In practice, people may need to work overtime because of a bad outage. So take that into account for your compensation structure. Pay people handsomely for overtime. On-call is a disruptive policy, it damages people's private life and the employer should pay accordingly for it.