After years of being ops-adjacent as a developer I was finally put in a position to experiment with some ideas of what sustainable oncall could be and how adopting some best practices could make my employer’s services more reliable. I volunteered to become an SRE-in-practice and it seems like six months in is a sufficient amount of time to jot down some lessons learned.
Why go on-call?
A lot of organizations lump development and operations into two separate buckets. In the past I’ve been responsible for checking alerts in Slack or trying to put up dashboards representing how code is performing in production, yet I’ve never had to experience the pain of a poorly constructed alert in the middle of the night or untangling a buggy service with minimal documentation. It seemed like a good opportunity to change that and lead by example. This was an excellent opportunity to really dig in and get some experiences that build empathy. I would hang up my ability to commit code (compliance reasons) and pick up the pager. With my manager’s blessing and organizational buy-in, I became part of a two-person on-call rotation.
Like any experiment, some assumptions were invalid. Folks say that distributed systems are complex and let me tell you that someone who constructs alerts that match “a bad state” for a system without seeing them trigger isn’t getting the big picture in action.
Here are some invalid assumptions that were in place that I ran into within the first couple of weeks.
- Failure in a distributed system is an anomaly, should trigger an alert immediately and all failures are of equal importance
- Third party dependencies are always available
- SLOs, SLIs, SLAs - this should be a breeze!
- Sprinkle some observability on it - there’s probably a library or service for that
- Two people in an oncall rotation is probably enough
I’ll address each of these individually and correct each assumption as I think these are applicable to just about any smaller organization.
Not all failures should trigger alerts
- Failure is endemic to most systems - there’s a chance that some request is timing out, a connection to a SQL database will time out or a 500 gets served at any point in time. Plenty of alerts were in place that expected these all to be treated as cause for high alert in isolation. These were quickly brought up to the team, discussed and in most cases removed from the rotation entirely. In this case, I’m defining an alert as something-that-wakes-you-up-at-night. Once it was phrased that way and we talked through what each alert actually meant they were demoted as necessary. Several of the examples above just became part of a daily checklist to be done first thing in the morning, as in our case they represented batch data processing failures that were not time critical. Defining a guideline for severity levels and what they correspond to is critical for meaningful alerting as well.
Third party dependencies will fail
- I dealt with this at a prior job as a developer. Basically, if any part of your system has a health check built around a dependency you should build in a circuit breaker to prevent the service / website / API from serving up requests or showing an error when that dependency goes down. Alerting should happen, but there are features that have to be put in place as well to make the system more reliable and fail more gracefully. It’s definitely different to advocate to fix this without actually being able to touch the code.
SLOs, SLIs, SLAs - More like SL..Why?
- If the basics aren’t there - you’ve got a handful of alerts strewn together around exceptions and gut feeling - then the odds that your organization can carve out time to throw down some SLOs and SLIs is also probably not so great. This was probably one of the more surprising facets of this process. Product and management folks are pretty comfortable talking about Agile concepts when it comes to trying to prioritize feature work or visualizing the cost in terms of doing a big item to pay down tech debt, but setting goals around what the expectations around services was a struggle. It felt like carving time to do this was always a low priority compared to feature work. Advocating for ‘Reliability as a Feature’ became a part time job, but I will say that the ‘Site Reliability Engineering’ chapter was stellar and easy for most folks to grok.
- Make sure that everyone knows what your SLAs are (because you are bound to have those if you have customers!) and then work backwards from there. Dig in to make sure that your organization is making good choices there as well. I found that using an interactive tool like uptime.is to be tremendously helpful in visualizing exactly what four or five nines mean in terms of downtime.
Observability doesn’t come in a box
- It’s a mistake to think that instrumenting your apps with off-the-shelf observability services is your one stop shop for calling a system ‘observable’. The only way this works is by changing the culture of an organization. We had code that was using least five well known observability vendors, yet only one or two people knew what was happening at any given point in production. The most important thing you can do is to socialize every aspect of what you are working on.
- Dashboards may not be in vogue, but they help show what production looks like at any moment and that is a tactile and powerful tool. When you can point and say ‘this is a high volume part of the day’ or ‘look, we deployed without an issue during the day’ it is incredibly powerful. Create a couple and share them with stakeholders and people who don’t have access to production. When questions are asked about things, walk through it and explain what the graphs represent. It may not tell the whole story, but it’s a starting point and the active work is making this a social activity - instrumenting the applications are only half the battle.
On-call rotation team size needs to be bigger than you think
- The last bullet point - about the size of an on-call rotation - is absolutely critical. Even though the services I support are now very stable, if you are primary and secondary all the time you are effectively always on-call. This has a tremendous drag on morale and it’s hard to push for increasing team size once things are relatively stable. My gut instinct here is that a culture of ‘you build it, you operate it’ is what organizations should strive for as product engineering teams are typically constructed in such a way that they can support a legit on-call rotation. I would feel comfortable being on-call one week out of every four as the primary on-call, especially if my job in between that time is making sure that it’s not awful to be on-call. This has earned itself a spot on the ‘things to ask in interviews’ checklist as I can’t imagine I’d advocate for keeping the rotation the same anywhere I work.
- As it stands though, running multiple 24x7 services with two people isn’t sufficient. It has the potential to get very ugly if someone happens to be sick or needs to take time off as they effectively always have to be switched on to deal with a potential production issue. It is not sustainable. The flip side is that this was a quick way to cram in 2 years worth of on-call experience into 6 months.
- My intuition is probably that you need around 8 people to have a successful, meaningful on-call rotation that is healthy. 2 people as primary and secondary each week in a month. No one is afraid to take vacation or turn their phone off at night.
What leads to success?
Production readiness checks
- Chances are if you are doing this from scratch there are a number of apps, services, and programs that are in production that never went through a production readiness process. A deadline had to be met, the code was done, let’s get it in production! I would say that this has to be locked down a bit further before getting an SRE involved. This is going to vary from organization to organization, but JBD’s blog post on this is very good and I’d suggest reading through it. I wish we had taken an inventory and done a retroactive production readiness check rather than just hopping right in. JBD’s checklist includes things like ‘how is this released’, ‘document dependencies’, ‘environment configurations’, etc. I’m not saying it would have been a panacea, but this process alone would have saved quite a bit of pain.
Set clear boundaries and objectives
- One of the things that surprised me after getting things rolling and providing more visibility into the state of production was that people expected code to be instrumented and alerts to be constructed for services we weren’t aware existed, especially in light of fairly frequent firefighting. In some cases we had to be creative in creating alerts based on infrastructure (!!) metrics for some services as the applications and infrastructure were not in a good state for observability. Set clear boundries around what the role should be and what services are being prioritized. If you completed a production readiness check, you should know where the most important services and applications are for your business. Put strict boundaries around them to ensure focus. There’s bound to be some excitement initially and if you let it, ‘busyness’ will eat into any slack time you may have - which is typically when you want to be working on reducing toil.
Define a common language and set of tools around production
- This was primarily a culture shift - as the year progressed I found previous attempts at instrumenting apps, things that had been configured by previous operations staff, but never really embraced by the company culture as a whole. Repetition and sharing the same resources to make them the normal touchstones of day-to-day work were most effective in getting buy-in for what we were doing. It won’t be perfect at first, but starting somewhere at least gives you a point that you can start iterating from. The tool itself matters less than getting everyone to understand how to use it and look at it on a regular basis. You can pay for a service without changing how you do business day-to-day and plenty of vendors will let you do that.
Set clear expectations around availability
- I also was privvy to quite a few war stories from operations folks that they experienced over the years while I was working with them. I would suggest connecting with the folks operating your services today and ask about what their pain points are. It may not even be on their radar that it doesn’t have to be crazy and that it’s okay to not be responsible for everything 365 days a year. Acts of heroism may feel good in the moment, but software shouldn’t require Herculean effort to operate smoothly. Hopefully your manager can advocate for a healthy balance of on-call responsibilities, engineering and project time.
This was a worthwhile experience. I think the next time I work as a developer I want to be oncall for the services I work on. It helps steer the focus on what is really important and the value of humanizing operations work can’t be understated. I definitely crammed about 2 years of on-call experience into 6 months - though I don’t think I’d want to do that again.