NOV 22, 2021

42 MIN

Ep. #1, The October 4th ‘21 Facebook Outage

light mode

about the episode

In this inaugural episode of Getting There, co-hosts Nora Jones and Niall Murphy unpack the October 4th ‘21 Facebook Outage and the unforeseen challenges and responsibilities that emerge when responding to an incident of such magnitude.

show notes

about the episode

about the guests

show notes

transcript

Nora Jones: Today, I'm joined by my co-host, Niall Murphy.

Niall is an award-winning author, speaker, technologist, and executive leader.

He's best known as the instigator, as he calls himself, editor and co-author of the best selling Google SRE book, which just a short while ago, he ripped up on camera to create a claim.

He's devoted over 20 years of mission critical engineering roles to the craft of software, large scale service operations, teaching machine learning, and diversity in computing.

He worked at Amazon, Google, and most recently Microsoft where he was the global head of SRE.

He helps organizations to get better results in productions and describes himself as an infrastructure esthetician.

Thank you so much for joining me today, Niall. I'm super happy to chat with you about this.

Niall Murphy: Thank you, Nora.

I am joined by Nora Jones who best describes herself as a distributed systems cultural anthropologist.

She's worked in many organizations as a software engineer, but has best been deployed as someone who navigates between PR, engineering, and organizational psychology.

This has led to her writing two books in chaos engineering, working at large companies with many interesting incidents like Slack, Netflix, jet.com, and most recently led to her founding Jeli.io, the first incident analysis platform.

Her goal is to help the software industry navigate incidents in a productive way, so we can use this knowledge to improve from those incidents, basically get the results out of them having paied for them already and get better all together.

Nora: Thank you, Niall.

Niall: The main idea of this podcast is to look at public incident documents, public incidents generally, and discuss them in detail with an exploratory mindset.

We try and be respectful and non-judgmental.

When we're looking at the incidents, we're trying to hold in our minds, continually the question, "What would have to be true for that to have been a reasonable idea at the time?"

Of course, this doesn't prevent us from being critical where we think it's deserved, and we will be particularly if we see power differentials contributing negatively to an incident.

B ut that's not our main focus, we're actually just trying to explore with open minds.

Nora: Absolutely Niall. I really like that.

I think, one thing that we're really trying to do here is share all the different perspectives that needed to be true, like you said, for an incident to exist.

In order for us to do that, we'll have a few ground rules.

We'll do respectful analysis of public outage documents.

We'll be empathetic, hashtag HugOps.

We're going to be organizationally minded, and so we'll go into the technic is well appropriate, but you won't learn the fundamentals of DNS just by hearing us talk.

We're going to take a sociotechnical systems approach.

The system is the people and the technology together.

It's ultimately impossible to separate them.

The reason that we're doing this podcast is we find that online, they are frequently separated, which is not doing great service to us and our organizations and the technology industry as a whole.

Niall: Which one are we going to study, Nora?

Nora: Yeah, I'm super glad you asked.

Today, we're going to study a recent fairly large software outages.

It's notably one of the few that already has its own Wikipedia page.

This is the October 4th, Facebook outage. Niall and I have been chatting about this outage for a while.

We've been talking about this podcast for a while too, but this one was pretty fascinating for us to choose.

I'll let Niall go into why we chose it.

Niall: Well, I think the first thing to say is that this incident is now more famous than I am, because as it has its own Wikipedia page.

But actually the real reason is, I mean, it's a combination of things, right?

The first thing to say is it's so large and so noticeable and notable, right?

Because when you look at how outages, particularly in the large cloud providers, which I have to be extremely familiar with for various reasons, when you look at those outages, they are very rarely what I will call saw-toothed style outages, like everything's completely dead and now everything's completely back up.

You generally see some kind of spectrum of disorder or unavailability or similar.

That may or may not be related to the fact that these large public cloud providers are in more or less a B2B, a business to business, configuration with a random business running its VM on Azure or whatever. But Facebook is a wholly owned service provider in more or less a consumer to consumer model where folks are using the platform to talk to each other. That's already a kind of a significant difference.

I think another interesting thing about this outage is that you don't really get as big as Facebook is and not have unexpected or unintended effects, which I find interesting.

I mean, it is noticeable in things like login with Facebook, for example, where a whole bunch of people were unable to use their services because login with Facebook itself was down.

I know myself, I have sometimes when logging into some new online service, I have stopped and kind of stroked my chin and gone actually, should I make this a Google login or should I create a separate account?

I know, I think that way, but I'm pretty sure most other people don't think that way.

The impact of logging with Facebook being unavailable probably pretty severe, b ut one of the other interesting kind of unexpected effects is this question of the uptick in replacement services.

I suppose, in the language of economics, we would say substitutable goods, right?

Facebook is down, so everyone piles onto Twitter.

I think actually Twitter themselves released this famous tweet saying, "Hello, everyone. And we do mean everyone," or words to that effect, right?

I think also that the human impact that we know must be there, but hasn't really been officially talked about and given how really large incidents tend to be run by our industry, it probably won't be. I think all of those factors together make it a sensible choice.

Nora, do you want to run us through what actually happened? Like what we know about the timeline and the official documents and so on?

Nora: Yeah, absolutely.

What happened at a high level was that Facebook and all of its were down on October 4th, 2021. And so from the Wikipedia page as well, it says that the social network Facebook and all of its subsidiaries, which included Messenger, Instagram, WhatsApp, Mapillary, and Oculus became globally unavailable for a period of six to seven hours.

It wasn't just their companies that were down. Facebook has their own domain registrar and resolvers.

Those were all down along with the 3,500 domains that they own.

Effectively routing was restored for the affected prefixes at 2150 and DNS services began to be available again at 2205 UTC with application level service layer services gradually being restored to Facebook, Instagram, and WhatsApp over the following hour with service generally restored for users by 2250.

Like Niall said, everyone noticed this and in a lot of the other social platforms that were not related to Facebook, were chatting about it publicly too, which I imagine does impact how this gets responded to internally.

It puts a lot of pressure on responders when the whole world is taking notice.

Allegedly, employees at Facebook could not get into buildings or conference rooms because all of these systems were interconnected.

Facebook uses Facebook internally quite heavily. They don't use email. They don't use Slack.

They use messenger to communicate with each other.

And so you can imagine that that was really hard for them that particular day.

But I haven't seen that piece talked a lot about and what happened in the incident, but I imagine it affected the response and how they went about the response in a pretty fascinating way. In a lot of safety, critical industries, that's known as a strange loop.

The most shortened way I can break that down, and this is something that Dr. Richard Cook says a lot, it's when the thing that's broken is in fact the thing that's broken.

So you can imagine that that makes it very hard to go about addressing things, especially as a responder.

Because of these missing DNS records for facebook.com, every device with the Facebook app was then DDoSing recursive DNS resolvers, which was causing overloading.

And so I saw a pretty funny tweet about that during that day, it said, "DNS didn't break Facebook. Facebook is breaking DNS."

And so Facebook actually put out a public response pretty quickly. In the very first sentence, they said they were focused on learning, which is really great.

That's how we ultimately improve from incidents.

The author of that post was a VP of infrastructure.

It is quite good that they chose that role to talk publicly about the outage.

One of the things I wanted to know after seeing that, that was the role that was talking about the outage is I'm curious which other roles person spoke to that informed the perspective on the story itself.

That's super important, organizationally, for people to feel psychologically safe afterwards, that their perspectives informed the writeup that was written.

Something that Niall and I had talked about too, he had asked me, "What happens to your morale when millions of people are clamoring for you to fail?" Right?

That comes up online and it is cheeky in effect, but it was true.

I can't imagine being a responder in that situation, but it certainly had to make things challenging. This becomes a point when, like Niall said earlier, when PR mixes with engineering mixes with organizational psychology and all those factors make it very hard to learn from this incident in a productive way that that Facebook as an org will improve from.

Niall: Yeah, I think one of the other interesting things about the hugely public nature of this outage, and as you say, the millions of people clamoring for you to fail and so on and so forth is the fact that the other social networks were full of commentary about what was going on.

Some of this was leaked information. Some of it was leaked misinformation.

If you remember the angle grinder story about getting into the cages and so on and so forth, I'm told, I believe reliably that this is in fact totally wrong.

A lot of people were just making up stuff because not knowing what is actually happening is fertile ground for somebody to fill it with something that seems plausible, possibly even amusingly plausible, and so a lot more potential for going viral in those circumstances.

Of course, we live in a viral age in more ways than one.

But I just found it's interesting because you don't see that kind of thing about the smaller outages from the cloud providers, like when US East-1 goes down, people are all over it like a bad suit.

There's lots of jokes and so on, but I don't think it reached quite the pitch that it did for the Facebook outage.

Nora: You bring up a really good point too, which is that when it's an outage that impacts a lot of people, it's talked about a lot more, which means internally in the organizations, people are probably feeling a number of different emotions.

But those emotions become more heightened because the whole world is talking about it.

A lot of organizations, especially in the software industry, don't practice how they review incidents that aren't that big of a deal.

And so by not doing that as much, it makes the big deal incidents, like the Facebook one and like what you said, US East-1, it makes it harder to recover from those organizationally, because we haven't been here before, right?

It feels so anomalous. It feels so stressful. People probably feel like they did something wrong.

People might be subtly blaming each other, but unless we've practiced how we learn from each other and how we talk to each other in those situations, it makes these incident reviews a lot more difficult.

Niall: Yeah, I think of course, the company culture or even team culture or division culture, they can be different or at odds with each other in various ways would contribute to the ability to heal or otherwise, right?

I do think that given the past two years, more or less, have provided many of us with all kinds of reasons to be traumatized.

There would certainly be even more reason to be traumatized by these six or seven hours if you happen to be unlucky enough, to be responsible for restoring service.

I do think in one sense, talking about the humanistic approach to this, there just hasn't been a lot that's been said.

There's been some things that, that have been said by Facebook leadership, but there just hasn't been a lot that's been said.

I think on a human level, other than the trauma, we could say some other things about what must have been happening behind the scenes and how people conducted the incident.

Do you have any experience of this kind of thing, Nora?

Nora: Yeah, absolutely.

It is fascinating what you said, and it's not an anomaly that Facebook, more or less, mentioned what happened rather than shared the difficulties of responding to the event.

I wonder if they did that more internally, but I'll go into a little bit of background.

Incidents, especially the giant ones are super fascinating in that they're the one time where some rules and processes in our organizations go out the window because effectively everyone is trying the best they can to stop the bleeding.

They're trying the best they can to restore services.

And so at that point, they're ultimately doing whatever they can.

If we missed the opportunity to review what actually happened during that time period, when people were trying to up the bleeding, if we do review it after the incident is complete, you get an incredible insight into how your organization is actually functioning versus how you might think it's functioning.

By actually looking into that Delta and exposing it a little bit more so, it will give you pointers to areas in the organization that may need attention.

Now, if you don't do that, you're going to have a number of areas to look into afterwards, but those show you the most productive areas.

Niall: I hear you saying that there might be quite a difference between how we've written down the instant management protocol, and who's going to get on the call and what they're going to do and so on, so forth versus what actually happens in situations like this, where the entirety of the rug and everything on it has been pulled from underneath you.

You see a lot of people, what tends to happen psychologically of course, is that people fall back into old patterns that they've learned and are scored deeply in the surface of their brain rather than reflecting more because the physiological effects of stress often make it hard to be creative and reflective in the moment.

This might also be a contrast between work is written and work is done, Nora.

Do you want to talk about that a little bit?

Nora: Yeah, it's absolutely true.

We really talking about the coordination of the event, like maybe you and I are at this organization.

We've worked at Facebook together for like 10 years, but there's a lot of newer engineers.

Because you and I have this rapport and this camaraderie and we know how to work together, it might just be faster for us to just jump into a room in this particular incident than it would be to bring in new folks that we haven't had that chance to build rapport with, haven't had chance to build on disaster exercises with as much.

That's a good thing, maybe that we did that, however, by not addressing, not having someone interview each of us and share and try to extract what we knew about each other, like why I pulled you in or why you pulled me in, or why you looked at this particular dashboard or how you knew where to go at a certain moment, by not extracting some of those details, we missed the opportunity to allow all these other engineers that weren't perhaps involved in this particular situation to learn so that we are effectively disseminating expertise in the organization, which can ultimately be the best action point.

Niall: Yeah, I'm also thinking about the situation where there's some, let's say, well documented process for incidents.

I mean, I don't know if it is or not, but let's suppose it is.

What you have to do is first of all, look up the phone number or VC that you have to join.

Of course, you can't resolve any name that has FB or equivalent in it, the internal domain for Facebook stuff, as I understand it, which means of course that your point about the social networks that already exist, because people have bothered to manually put their numbers and their names in each other's phone and so on and forth, those social networks already existed.

I mean, existed at the time of the outage.

Obviously, people are going to reach out to the people they know best or have these numbers for.

I expect there was quite a period of confusion while people were trying to figure out, like, "Who do I talk to? Who can I talk to? What's happening?" and so on.

Of course, these are things that once you've got enough islands of people talking together, linked together, you can get a critical mass and agree on one phone number or VC or whatever it is.

But I would expect that a certain amount of flailing around happened, because none of the existing processes were going to yield anything, right?

In some sense, I suppose you could think of it as being a process failure and a human success, because the humans are providing adaptive capacity.

Can you tell us a little bit about that?

Nora: Yeah, I mean, humans are the reasons our systems up all the time. Incidents are inevitable.

They're going to happen throughout this industry, especially as companies are getting better.

Something John Allspaw has said in the past is that you're having incidents because you're successful as an organization.

And so by not taking the opportunity to understand how things went right, you won't actually be understanding how things ultimately went wrong.

And so you brought up a lot of good points about having phone numbers in there and knowing how to contact each other.

A lot of the times those questions are not asked in incident instead of reviews. From my research, it's because we might feel silly asking those questions. They might be just understood truths in the organizations. We really have to get to the hard stuff, which is like, how did this database fail?

But if we ask these coordinative questions first about how people work together, we will ultimately get to the how the database failed in very much a better way, right?

Because we're understanding who knows about the database and who knows about a particular nuance of that database and how they know about it and how long it took for them to get that information and ramp up on that particular thing.

This incident was particularly fascinating because they effectively had to restore from nothing in a lot of ways.

Niall, I have a hunch that you have been in this situation before.

What are some things you think responders might have been experiencing, just from your experiences, in that kind of situation?

Niall: "Hmm, maybe," he said.

I suppose the first thing to say is this kind of thing does fall into what I'll call the long tail of disaster distribution, which is to say that I know as a matter of certain fact that all of the cloud providers, for example, I'm using them to, and in place for large online service runners.

But let's say everyone for the moment, they all model to some extent, they all care deeply for sure about large incidents.

The thing which is tricky about this is the ability to model what could happen successfully and most people, or I suppose, most decision makers that I'm aware of in this field, which we'll call business continuity or disaster response, for lack of a better word, they're really thinking about the first or at most, the second standard deviation of thing that could possibly happen to you, right?

Typically speaking, the events I've been involved in which attempt to simulate very large disasters do take what I'll call a, again, to use that language, second standard deviation style approach.

They're thinking about things which are plausible, but which are known.

They cannot model an unknown to happen.

They can say, "What happens if mountain view falls into the sea and none of the EU people can access mountain view desktops?"

Or they can say, "What happens if a sudden climate event or a storm or whatever strikes the Puget sound area and none of the execs can get to communications?"

Those kinds of things, events have wide impact, but are inherently things you can predict.

You have seen storms before. There will be another storm.

But sufficiently large and complicated distributed systems are the generative grammar of outages.

They produce new things that we haven't seen before and things which are lurking in configurations often contribute to the novelty of these outages.

In short, I'm pretty sure that Facebook and many other institutions have a long list of things about disaster recovery and business continuity and check marks beside them and so on and so forth.

But of course, it's what happens when you come up against something fundamentally new like this, and you're left with no tools to do it with that is the trickiest thing.

I know, for example, that Google spent a lot of time thinking about what happens in the situation when data centers are totally off the air and you have to restore from nothing.

There were several plans put in place, which I left before I saw the full outcome of, but in general, the difference really is that restoring service without tooling and restoring service with full access to your tooling are very different situations.

One of them generally leads to a much longer outage than the other.

Nora: Yeah, amazing points.

One of the things I want to grab onto a little bit is those unknown unknowns and how an absolute and infinite number of things could happen to your system.

People could go on vacation. Certain things could go down.

Like you said, an entire data center could go down. We can prep for all these scenarios.

We can have times when we come up with disaster recovery drills, but those take a lot of time to plan and prep for.

We might not have sense as to how likely it is that something of a certain failure magnitude is even going to happen.

What's your sense on how to weigh if it's worth the prep work for what seems like a totally off the wall failure scenario happening?

Niall: Yeah, it's a great question.

I think there a couple of different models that different engineering leaders use to make those kinds of decisions.

I suppose the first model is an insurance style model, which is to say there is a 0.1% chance of this $1 million thing going down.

Therefore, it is economically rational to spend 0.1% of a million dollars on it.

The difficulty with that model is that it's very much a point in time style thing, and it doesn't necessarily capture the value, like when we say this system is worth $1 million, we often mean something like in the previous three months, it generated a run rate of $1 million or whatever, and saying that we can spend the percentage chance of it going down in this really unlikely way, multiplied by its revenue ignores the possibility that in the future, it will be worth more or that perhaps we have count these risks poorly.

And so I do think the insurance model does have some traction in higher level decision making about what's appropriate to spend on business continuity or disaster recovery.

But as a whole, I think that cloud providers are in a slightly different place, and they don't tend to use that model.

They tend to divide the systems by importance and to not worry too much about the kinds of risks that could happen to them, but to say that, "Okay, this system is at the center of everything that we do. If the network it goes down or if Chobi goes down or if the Azure VM launcher goes down or whatever it is, then we're in very serious troubles."

So it is worthwhile spending a constant amount of effort in design, in people who are charged with the looking after it and so on, so forth, to ensure that outcome either never happens, or if it does happen, you're well positioned to restore service relatively quickly, rather than nickel and dimeing the increasingly unlikely scenarios.

I'm pretty sure that for the six and seven hour outage, I think it's calculated.

They lost something like $60 or $70 million, but I'm going to assert that the total damage to their brand and being understood as people who can run a online service is going to be worth more than $60 or $70 million.

I'm not even sure the a point in time, narrow dollar calculation is the right model here.

Nora: It also impacts the confidence in the working styles of people internally.

Even though there's a public calculated cost for how much money they're actually able to calculate that they lost in the moment, like you said, there's also a brand cost, there's also an internal costs of how people talk to each other of whether or not certain people might voluntarily leave the company after certain situations like this.

And you mentioned the business calculated risk of deciding if it's a good thing to prepare for a failure scenario like this.

I think that is a good model. No models are perfect.

All models are wrong, some are helpful kind of thing.

I think the act of going through that exercise is probably more valuable than the exercise itself, a lot of the time.

Because I think in order to inform that calculated risk, it's necessary and most valuable to go through the incidents or near misses that you've already had in the past that will help inform if it's that 0.1% or how difficult it will be internally for people to account for that 0.1%.

We both worked at organizations that have had very public facing incidents before and just how to talk through things internally after some of this stuff happens.

What's some of your experience there? How do you, outside of the public facing response, how do you relate that to the internal response?

Niall: Yeah, I think there's a couple of things to say about the dichotomy between public and private incident reports.

I mean, obviously, the public ones or maybe it's not obvious at all, right?

But the public ones are generally written for an audience who could, if they don't like your words sufficiently well, sue you or are otherwise timid, relatively timid about what they say.

They tend to be heavy on the, I suppose, apologetics end of things, and light on the detail end of things.

This is not actually necessarily it, well, it certainly isn't helpful for the internal folks who generally speaking are suffering from second victim syndrome or are acutely conscious of bad things that have happened under their watch.

But it's also, I would argue not particularly helpful for the public either.

I mean, I'm sure there's some people who want to see an object apology and that's fine, but actually as a public consumer of services, the best postmortems that I have seen are the ones from GitHub.

The reason is, is because they have the follow up items connected to them and you can go and browse them and you can see what they're doing.

It's just treating me like an adult. Often I feel the really large outages for really large companies, don't treat me as an adult or they treat me as an adult who is accompanied everywhere by an attorney.

Nora: Yeah, I think you're right. GitHub has had some really great public facing incident reviews.

I'd also add CloudFlare to that list.

I found them incredibly detailed and sharing what folks have learned from them.

I think one of the things that's interesting about very public facing incidents that I know GitHub has had, CloudFlare has had, Salesforce has had, Facebook has had is the time you are given to write that public apology or write that public incident review becomes a lot shorter than the time you're given to write that incident review when it's a smaller scaled incident, which is fascinating.

A lot of that has to do with public pressure, consumer pressure, but it's fascinating because usually you need more time to really understand what had happened.

And so I think, what you mentioned on the follow up items can be important, but if I'm a consumer reading it, and it's only a few days after the event, I don't know that I want to quite see follow up items.

I want to almost see that, "Hey, here's what we know so far. We're going to keep updating this the more and more we learn," which is kind of what Facebook did here, which I do have to applaud them for that.

They did that on October 4th and then they wrote a write up on October 5th, but I do want more, right, as a consumer.

If they're, if they're really trying to improve their engineering brand in certain ways and improve their brand in certain ways, I think they would be a bit more transparent here with some of these learning styles.

Like you said before, there's a dichotomy between the public facing incident reviews and the internal incident reviews.

But they're also interrelated in a lot of ways because that public facing incident review, even though it might not have involved all the individual responders and it might be more of a PR piece, it does really impact how the responders talk about the incident too, because it has to match that public review in some way.

Niall: Well, if they're very different and continue to be very different for a long while, that even causes us to ask questions about the culture.

But in fairness, I think historically, Facebook have been super transparent.

I mean, possibly more than you would like in some cases about how some of their things work.

I have attended in the past, for example, talks about how they're externally facing DNS as it happens, but also going to front end, load balancing, architecture work, et cetera, and they're open compute platform.

They do a lot of technical stuff, which is relatively in the open.

I would be, I hope optimistic, foundedly optimistic about the chances of being a bit more transparent about what we're doing here.

But I think your earlier point is absolutely critical, that actually in some sense, it's a countervailing that the larger outages produce more pressure to get something out the door sooner and the something often isn't what you want or at least what we want.

I suppose there's an argument for getting something out the door as soon as possible, which is maybe even regularly refreshed every 15 minutes or whatever, still down, still working on it.

Okay. Now we have a clue. It's probably this thing and so on.

I'm convinced that was happening internally because they would have internal consumers.

The instant management teams, engineering teams, and so on would have internal consumers of, "Hey, are we still down for the world?"

There's going to be thousands of people who are asking that question inside of Facebook.

I think it might be a question to ask whether or not this could be done in a way which would be publicly viewable at some point.

Nora: Oh yeah. We would hope. I would love to be a fly on the wall in meeting.

I think what was missing from this public writeup that I think would be helpful, even though we can't control the time constraint that we're given after we're in an organization that faces a public incident, what we can control is telling the public how we went about conducting this writeup, right?

I am curious, like I mentioned earlier, about the perspectives that informed the writeup, that the VP of infrastructure posted online.

There's this technique in other safety critical industries that they've done for years called cognitive interviewing.

It happens a lot in healthcare and aviation after these large incidents.

We're really only just beginning to see it adopted in the software industry, but it's effectively the process of asking people questions one-on-one after an outage, prior to having a large group meeting, or even prior to doing the writeup.

Obviously, this takes a lot of time, but it informs a higher quality perspective and ultimately higher quality fixes and organizational psychology improvement later on.

It's not in the sense of what you're thinking, but in the sense of debiasing held beliefs about a situation.

It's vital to getting the most value that you possibly can out of this incident.

It's not asking the person, "Why did this happen?"

It's basically stating, "Can you tell us a little bit more about these bits and pieces?" that the person that you're interviewing knows about and how they might be applied to this incident.

The role of the interviewer is not to interrogate or be the expert in that situation.

It's to put the person you're interviewing in a position of expertise.

Because ultimately, they did have a piece of expertise about the incident, even though a lot of things might have been wrong.

They know things that no one else knows.

And so it's your job, as an interviewer to extract those things afterwards, which can inform a lot of different reviews.

It can inform the public review. It can inform the actions that are taken afterwards.

It's really a vital and magnificent tool to use if trained correctly in organizations.

Niall: Yeah, I think one of the interesting things, keeping that in mind, is to me, this smells like one of those areas where the public understanding a thing is very different from the actual practitioner understanding of a thing.

You've probably had this feeling if you've gone to a movie or whatever and the movie has a scene where they use computers, because almost all of those scenes are total and complete hokum for understandable and good reasons, right?

But often, you watch that scene and it decreases the credibility of the rest of the film for you, right? I mean, this is particularly true for computers, but it's also maybe true if you happen to have some kind of background in international relations and two countries are going to war in the movie and you're like, "They would never go to war. They've had a trade deal for decades. No way they'd do that." Or something like that.

Similarly, I think the other interesting thing about this outage is first of all, that maybe the public doesn't expect that the best thing to do to get value out of this outage, all of the damage having happened and being unrecoverable and so on, so forth, the way to get value out of this outage is to talk to the humans involved in it, which I think is a key organizational lesson.

But also the point that I'm told, I mean, I stopped thinking like an ordinary human being about incidents and internet services and so on a long while ago, but I'm told a lot of people were shocked by this kind of outage, the duration.

"Oh my gosh, does this mean the rest of the internet is built on these unstable stacks of turtles and could fail at any moment? This is horrifying."

To them, I have to say, "A, it is horrifying. Yes, it is all built on stacks of turtles. We are all juggling and keeping it going by our own effort, in many cases."

This is something which is surprising to the public.

Nora: It is surprising to the public.

You bring up a really good point because in other safety critical industries like software isn't necessarily using aviation in everyday life, but aviation is using software in everyday life and healthcare is using software in everyday life.

I've heard this very strong statement that what we do in the software industry doesn't harm people in any way and isn't safety critical, but it absolutely is.

It's powering all these things that I think the general public just takes for granted sometimes, like you pull up your phone every morning and you pull up email, you pull up certain things.

You rely on software to get your kids to school.

You rely on software to understand what you need to do next in your day.

We just assume that it's going to work all the time, but like you said, it is built on stacks of turtles.

And so I think we're still early on as an industry in a lot of ways that I think what we're going to find soon is that we need to be taking these with a bit more of a safety critical lens.

Niall, thank you so much for chatting with me about this today.

This was, I feel like we could talk about this all day. I feel like we probably will.

One thing you pointed out to me earlier was that there isn't a lot known about this outage yet.

It only happened a couple weeks ago, and so I'm definitely going to keep an eye on the future communications and learnings about this incident and also the updates to that Wikipedia page.

Niall: I would definitely hope there would be more if only a biogenetics program engineered in order to make turtles bin packable, so they're steadier when you put them on top of each other.

Square turtles, you heard it here first. Anyway, thank you for your time, Nora.

I really appreciate it and tune in next time.

Nora: Thanks, Niall.

Subscribe to Heavybit Updates

Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.

Content from the Library

Visit library

Jun 12, 2023

Podcast

Getting There Ep. #7, The March 2023 Datadog Outage with Laura de Vesine

In episode 7 of Getting There, Nora and Niall speak with Laura de Vesine of Datadog. Laura shares a unique perspective on the...

Aug 24, 2022

Podcast

Getting There Ep. #4, The April 2022 Atlassian Outage

In episode 4 of Getting There, Nora Jones and Niall Murphy discuss the Atlassian outage of April 2022. This talk explores...

Jun 28, 2022

Podcast

Getting There Ep. #3, The October 2021 Roblox Outage

In episode 3 of Getting There, Nora Jones and Niall Murphy unpack the Roblox outage of October 2021. Together they review the...