Should you do on-call rotations in your engineering org?
On-call rotations can be very stressful 🤯 -> simplifying is the way to go!
All Quiet (sponsored)
This week's newsletter is sponsored by All Quiet. It’s a solution that makes your:
on-call alerting and rotations,
monitoring and
incidents
easy to handle.
I’ve spoken with one of the founders and I’ve thoroughly checked their platform. I really like the simplicity it provides. It’s very easy to set up and use.
If you are looking for a platform to manage your on-call rotations, monitoring and incidents, this would be a great one to try!
Let’s get back to this week’s thought!
Intro
As much as having a good on-call rotation has certain benefits for the business, it might not make sense for every org.
Especially the work-life balance is harder for people that are on-call and also there might not be that many benefits if you have the right systems in place.
Let’s talk about what on-call actually is first!
What exactly does on-call mean?
On-call is a way for organizations to have a certain person or a group of people prepared to be alerted at a certain time and immediately come either in person or online to help mitigate a certain situation.
A lot of companies do 24/7 on-call, especially if they offer high availability of their product or service (all the way up to 99.999%, 100% does not exist!). That means that any potential situation needs to be found out and resolved very swiftly.
In engineering, we call such situations → incidents and proper incident management is very important for any organization.
I’ve written about the importance of a good incident management process and you can read my step-by-step process here: How to handle Incidents (paid article).
I’ve also included a 🎁 Notion Template: Incident Retrospective that you can use in the article.
On-call should be focused on resolving the incident and not on blaming or looking for the person that caused the incident.
That’s also very important to mention. Because I’ve seen many different situations where engineers or managers who were on-call jumped straight into finding the person who they thought caused it.
The funny thing here is that when an incident occurs, there is no such thing as one person's mistake.
Every change goes through:
code reviews,
QA testing,
UAT, etc.
We have collectively made a mistake somewhere. It's important that we as a team can learn from it and make sure we do whatever we can to prevent it in the future.
It’s important to focus on resolving the incident ASAP. Blaming will never do good for anyone.
You can read more about the importance of blameless culture here: Blameless culture should be a standard in the engineering industry (paid article).
We now know exactly what on-call is, what about on-call rotation?
To scale the on-call properly, you need to develop a good rotation between people. On-call can be quite stressful, especially if there are many incidents and at odd hours.
And a lot of companies have very strict guidelines on response times when alerted. So that means that the person who is on the rotation needs to immediately jump on their computer and put their 100% focus on resolving it.
Normally the rotations are set up on a weekly basis, where a certain person or a group of people are responsible for handling an incident 24/7.
A great way to make it a tad easier for people is to use the follow-the-sun model, where 2 people or groups are sharing the on-call together. For example 12 hours / a day each.
This is possible due to the difference between time zones. Like 1 person or group in San Francisco and the second in India. That works well to make it easier for people who are on-call to have a better work-life balance.
But, should you even do on-call and on-call rotations? Here is my take
So, let’s first talk about on-call. As soon as you have users it’s important that you focus on resolving a showstopper-type situation swiftly like for example, the users can’t even access the product. That means that some form of on-call is important.
However, simplifying is always the way to go and in a lot of cases, there’s no need to set up a complex process and rotations if you have a small amount of users and you haven’t reached the product-market fit yet.
So this is also my advice: Keep it simple as possible. Add complexity where there’s really a need for it. This is also very important to keep in mind:
Focus on preventing issues in the first place.
As soon as you develop a well-defined on-call system, it becomes some sort of a safety net for people.
So, how to simplify this process? I’ll share how I did that as a CTO at Zorion next.
This is how I simplified the on-call process
At Zorion, we were building a mobile app for investing in private assets. I knew and understood that our first mission was to get to product-market fit as soon as possible.
And by implementing a complex on-call rotation → that would not be aligned with it, because we would waste our valuable time and focus.
So what I did is very simple. I’ve defined the levels of incidents, which made it clear when something is critical and what not:
Showstopper → Users can’t even access the product.
Critical → Impact on a core functionality.
Major → Impact on non-essential functionality that disrupts many users.
Medium → Impact on non-essential functionality that disrupts certain users only.
Minor → Existing functionality that could be updated to improved.
We had 2 ways to identify the showstoppers.
We used n8n to check the health of our app every 5 minutes and would be notified if the response would be unsuccessful 2 times in a row.
And then I’ve instructed everyone in the org to contact me in case of any showstoppers (reports from our users).
And it only happened once that I was called on my time off in the span of 6-12 months. So, one such event only and we resolved other less critical things swiftly inside our working hours.
This has worked really well and I believe we saved countless hours and stress from the team doing it this way.
The focus was on preventing issues in the first place, not on the complex process of mitigating issues.
And this is how we focused on that. Let’s get to that next.
We had a well-organized incident management process and we did incident retrospectives
I am a huge believer in mitigating issues well and that’s where a good process helps a lot. So, that was defined in detail with what we needed to do in case there were any showstoppers or critical issues.
We used the exact process I’ve defined in the article: How to handle Incidents (paid article) and we also used the exact same template which is added in the article, for doing incident retrospectives.
I am a huge believer in learning together as a team from mistakes and improving to prevent such issues in the future. As mentioned above:
No need to resolve issues if you don’t even have them in the first place.
Last words
Of course every organization and business is different, so make sure to take a look at what’s best for your case.
My rule of thumb is to add processes once they are really needed. Let’s end the article with this:
Keep it simple, use simple tools and look for ways to help the business the most!
We are not over yet!
Senior Engineer to Lead: Grow and thrive in the role (less than 24 hours left to enroll!)
It’s getting close to the start of the November’s cohort-based course! It’s going to start in 2 days! I’m very excited!
In the course, we will particularly focus on the development of much needed people / communication and leadership skills in order to grow from engineer to leader!
If you are not enrolled yet, you have less than 24 hours left!
Use code ENGLEADERSHIP for 25% off or use this link: Senior Engineer to Lead where the code is already applied.
Looking forward to seeing some of you there.
Build A Redis Server Clone: Master Systems Programming Through Practice
My friend
has recently re-launched his cohort-based course.If you are an engineer looking to level up by building a real-world project, this would be a great course for you.
John has also given 20% off to Engineering Leadership readers.
You can use the code: ENGLEADERSHIP or just use the link below!
Liked this article? Make sure to 💙 click the like button.
Feedback or addition? Make sure to 💬 comment.
Know someone that would find this helpful? Make sure to 🔁 share this post.
Whenever you are ready, here is how I can help you further
Join the Cohort course Senior Engineer to Lead: Grow and thrive in the role here.
Interested in sponsoring this newsletter? Check the sponsorship options here.
Take a look at the cool swag in the Engineering Leadership Store here.
Want to work with me? You can see all the options here.
Get in touch
You can find me on LinkedIn or Twitter.
If you wish to make a request on particular topic you would like to read, you can send me an email to info@gregorojstersek.com.
This newsletter is funded by paid subscriptions from readers like yourself.
If you aren’t already, consider becoming a paid subscriber to receive the full experience!
You are more than welcome to find whatever interests you here and try it out in your particular case. Let me know how it went! Topics are normally about all things engineering related, leadership, management, developing scalable products, building teams etc.
Blameless culture is all that matters. Great article, Gregor!
Leveraging offshore talent is super great for helping with this problem. You can have part of your engineering team working while the rest are sleeping. That is how I recommend doing it!