As part of an upcoming episode of Shift Shift Forward I answered a few questions about incident response. The description of the episode is:
We’ve all experienced it before – you go to your favorite website, but it’s not loading. Or you try clicking on a link to complete a transaction, but your browser times out and you get an error message. It can be frustrating to deal with outages and similar issues from a user perspective, but let’s see what it looks like from the other side. What happens when these incidents occur, and what does it take to get everything running smoothly again?
The Shift Shift Forward team interviewed the SREs and our manager at Glitch - as well as a lot of other people - and asked a bunch of great questions.
The question I liked the best was “What are you feeling during incident response” or something to that effect. I wasn’t able to give a good answer in the moment but thought it was such a good question that it wrote down some quick notes after and did a quick recording.
Here’s the notes of what points I intended to make - it’s a bit different from what I ended up saying, but that’s usually how it goes for me 😅 I think I like the audio version better, so if you want to hear me clumsily work through the notes I’ve included the audio recording as well 😉
As for the feeling you experience during incident response, for me, I go through almost the full spectrum of feelings, not just the bad ones as you might think.
When I first get paged there’s a short time where I’m feeling dread, or at least a bit afraid - I’m worried it might be an actual incident that will affect our users.
If it turns out it is an incident, then you go into incident response mode, you’re very focused. You assemble a team, so that means picking a scribe to take notes, and a communicator that’s responsible for keeping the rest of the company updated as we work through the indent.
Once the incident response it going it can be very exciting - it’s still extremely stressful - but it can be quite fun. You’re trying to figure out what’s going wrong, coming up with hypotheses and trying to prove, or disprove, them together with the team. That is extremely challenging and can be very fun. You also learn more about your systems by looking at them when they’re broken than when they’re happy.
Finally, once the incident is over and you can mark the incident as resolved, that is extremely fulfilling. At that moment I’m always feeling extremely proud of the team that worked on the incident, and I’m feeling really proud of myself for not breaking down.
I think that’s what makes it worthwhile to be part of the on-call rotation, it’s not just stress and dread, it can also be very exciting, challenging, fulfilling, and fun.
Eventually, as you get more experience with incident response, the bad feelings take up less space.
The episode airs on May 13 - while this little clip might not be included in it I know it’s going to be a great episode, so head over an subscribe 😉