Leveraging offshore talent is super great for helping with this problem. You can have part of your engineering team working while the rest are sleeping. That is how I recommend doing it!
Simplicity should be maintained until complexity becomes unavoidable. Often, adding layers of complexity only introduces more problems than it resolves. As a company scales, on-call rotations may become necessary, but the real value lies in creating systems that reduce the need for them.
Let's imagine the company have in call process in place.
One thing I see very useful to avoid being called during the on call duties is having proper runbooks.
Those runbooks should be clear enough to be used by SRE team if an incident happens, solving or mitigating the issue, and avoiding call you in the middle of the night.
The line, "On-call should be focused on resolving the incident and not on blaming or looking for the person that caused the incident." speaks volumes. I am not an engineer, but I do work closely with our developer and architect teams. The blame game, when it happens from stakeholders, takes time away from resolving the matter at hands. I work to bring us all back to targeting the root issue to get to a resolution promptly.
Blameless culture is all that matters. Great article, Gregor!
Right, best kind of culture! Thanks Petar.
Leveraging offshore talent is super great for helping with this problem. You can have part of your engineering team working while the rest are sleeping. That is how I recommend doing it!
Indeed Matt! Follow-the-sun model is a great way to make it less stressful for people and no need to be 24/7 on-call.
Simplicity should be maintained until complexity becomes unavoidable. Often, adding layers of complexity only introduces more problems than it resolves. As a company scales, on-call rotations may become necessary, but the real value lies in creating systems that reduce the need for them.
Couldn't agree more Tomek! Reducing the issues as much as possible is always the right way of thinking.
Let's imagine the company have in call process in place.
One thing I see very useful to avoid being called during the on call duties is having proper runbooks.
Those runbooks should be clear enough to be used by SRE team if an incident happens, solving or mitigating the issue, and avoiding call you in the middle of the night.
+1 for runbooks Marcos! Very true, they can save so much stress, anxiety and time in general. Thanks for pointing this out.
The line, "On-call should be focused on resolving the incident and not on blaming or looking for the person that caused the incident." speaks volumes. I am not an engineer, but I do work closely with our developer and architect teams. The blame game, when it happens from stakeholders, takes time away from resolving the matter at hands. I work to bring us all back to targeting the root issue to get to a resolution promptly.