Thundering herds, noisy neighbours, and retry storms

Thundering herds, noisy neighbours, retry storms.

I love the names that people have come up with over the years. Some of them describe observed patterns, as Lorin Hochstein so eloquently put it “Operators give names to recurring patterns of system behavior that they observe” (tweet), others describe techniques used to mitigate these observed patterns.

I don’t know what you’d call these names, and I haven’t been able to find a dictionary or list of them anywhere, so I’ve wanted to create a list for a while now.

Update (17th May 2021): Lex Neva suggested calling these “operational patterns” and I love it.

So here we go, I’ll add some now based on the few I can remember and notes I have on my computer, and then I can always come back and add more as a I come across them.

I’d love your help growing this list. If you know of a name that is missing from the list please send me a tweet with the name and a short description of it and I’ll include it in the list with a link to your tweet 😍

Names
Changelog

Names

Thundering herd

[Wikipedia]

I first came across this term from a colleague at Glitch who used it to describe the situation where we had just recovered from an incident, only to have everything break once all our users tried to start their projects again 😅 The surge of projects overwhelmed the system and everything broke. It was truly a thundering herd.

Other related names are: Dogpile, Cache-stampede [wikipedia]

Noisy neighbour

[wikipedia]

I don’t remember when I first came across this, but it has come up quite a lot both at Glitch and Gitpod.

Nosy neighbour

Okay this isn’t a thing, but I really think it should be (tweet).

Retry storm

It’s a bit like the thundering herd, but specific to retries. Still a great name.

[Microsoft - Retry Storm antipattern]

Banding

[tweet]

Dimensions of doom

This is specific to time series databases that don’t support high cardinality labels, but I still love it.

The number of time series quickly becomes overwhelming, and impossible to store for tools that aren’t designed to handle it, much less read it back quickly enough to help you figure out where issues lie.

~~I don’t know the source of the description above. I found it in a note on my computer, but I doubt I wrote it. If you know where it might be from send me a tweet and I’ll link to it here.~~

This is from How Does Honeycomb Compare to Metrics, Log Management, and APM Tools?, thanks to Kevin Collas-Arundell for spotting the reference (tweet)!

Load shedding

Purposefully reducing requests to your systems to avoid them falling over. Netflix wrote a great post about it here: Keeping Netflix Reliable Using Prioritized Load Shedding

Circuit Breaker

[wikipedia]

Haunted Graveyard

Submitted by Lorin Hochstein in this tweet

I like “haunted graveyards” (learned this one from @john_p_looney), about systems that people are afraid to change.

Another related name for this is Haunted Forrest (see tweet from Jacob)

Flapping

Submitted by James Cheng in this tweet

“flapping”. When something repeatedly switches back and forth between “good” and “bad”. Imagine a health check for something that is healthy, then gets overwhelmed, then recovers, then again gets overwhelmed.

With follow up additions by rat rancher tweet

I’ve heard “flapping” mainly with regard to flapping links on network devices:

And Lorin Hochstein in this tweet

I’ve heard of flapping alerts.

Flaky

Jan Keromnes calls out the similarity to Flaky in this tweet; I’ve heard them used interchangeably and think of them as synonyms:

The definition of “flapping” makes me think of “flaky” (as in “flaky tests” – personally I’ve never heard “flapping” used that way)

Death spiral

Submitted by Lex Neva:

I’m talking about the pattern where the system reacts to a failure or degradation in certain ways that act to amplify the problem. An example is the db struggling under load, and the auto-scaler notices the degradation and starts adding front-ends, but the front-ends have to boot up by running a few intensive queries, exacerbating the load, so the auto-scaler adds more instances…

Verbalmatic

This was submitted by Paul de Lange who writes “I would like to contribute one that we use at Expedia. This was coined by Mike Peterson, and first appeared in an internal conversation on 18 September, 2020.”

When the answer should be automatic but it isn’t, and so we rely on talking to a bunch of people to come up with the answer. This goes against the tenants of SRE because it is manual toil to answer the question each time and the reliability of the answer depends on the specific person you ask.

Numbers from Lost

Submitted by Mark Ellens

Another is ‘numbers from Lost’ wherein a human being has to perform a certain action (entering a series of numbers, applying a specific piece of config) on a regular basis, at a specific interval, otherwise dire consequence (island explosion, catastrophic system failure) will ensue. Like in Lost, you see.

turn it off and on again…and again