Ep. #59, Learning From Incidents with Laura Maguire of Jeli
In episode 59 of o11ycast, Jess and Martin speak with Laura Maguire of Jeli and Nick Travaglini of Honeycomb. They unpack Learning From Incidents (LFI), resilience engineering, process tracing, safety science, key takeaways from the LFI Conference, and the human side of observability.
Laura Maguire, PhD is Cognitive Systems Engineering Researcher at Jeli. Laura has a background in silence engineering and earned her PhD in Cognitive Systems Engineering from Ohio State University in 2020.
Nick Travaglini is Technical Customer Success Manager at Honeycomb.
In episode 59 of o11ycast, Jess and Martin speak with Laura Maguire of Jeli and Nick Travaglini of Honeycomb. They unpack Learning From Incidents (LFI), resilience engineering, process tracing, safety science, key takeaways from the LFI Conference, and the human side of observability.
transcript
Laura Maguire: So one of the interesting ways in which these things are connected is that it's about observability into not just the technical aspects of the system, but the sociotechnical aspects of the system. So in resilience engineering and human factors work, we use a technique called process tracing which is similar to a lot of the tracing that you might do within a software system in that you look at data both above the line and below the line. So both what's happening-
Jessica Kerr: What line is that?
Laura: That's the line of representation. I'm referring to a model that the late Dr. Richard Cook presented in the Stellar Report and basically it was a way of broadening our thinking about observability into the system being just about the technical aspects and the technical traces of what's going on, and instead looking at the human side of things.
How are the people that are using the system, building the system, maintaining the system, repairing the system, actually coordinating, collaborating, noticing what's happening? And bringing their knowledge about the goals and priorities of the organization to bare in how they interact with the system below the line.
Jessica: So below the line is the software? And above the line is the people?
Laura: Yeah. That's a good way to think about it. Below the line is the abstracted sense of working in software systems. So it's things that we can't necessarily directly touch, we can't necessarily directly interact with, but we interact with the software system through clicks, through command line and through these representations about what's happening.
Jessica: So Honeycomb is all about giving you representations of what's happening below the line in the software?
Laura: Yeah, absolutely. It does so in a way that helps make sense of the goals, the priorities of the organization as well because it helps us to say, "What is an anomaly? What is system performance? What's nominal system performance? What's off-nominal system performance?" And then that helps the people that are tasked with managing and maintaining that system to take their knowledge of the goals and priorities of the organization and to make sense of that, to actually derive insight from the monitoring that they're doing below the line.
Jessica: Right. I should've said observability in general, Honeycomb is just one instance of that. But yeah, observability is about the system tells you about itself and that makes it from below the line to above the line. Then you're talking about continuing the tracking, the tracing of what's going on with what the people are doing?
Laura: Yeah, it's a way of making sense of what does that information tell us about what the... I won't use the word appropriate here. But what kinds of actions are going to help steer the system towards the performance that we want and away from the kinds of performance that we don't want. But I think the bigger point as it relates to LFI is that it gives us observability into how the people side of the system and the software side of the system actually work together effectively.
This is broadening this lens away from purely technical failure to say, "Well, if properly handling incidents and properly managing the system also involves knowing who the right person to recruit to an event or to an incident is, that's part of the coordination. If it is about being able to get access or more information, it's about being able to pull the organizational levers that enable you to do that." So it's really about thinking very broadly about what the system actually means.
Martin Thwaites: I think that's a really good time for Laura to introduce herself so we know who we're talking to.
Laura: Yeah, absolutely. So I lead the research program at Jeli.io, and our product is a incident management platform and incident analysis platform that is really designed to connect these two parts of the system. The people side and the technical side of things. My background is in risk and safety in natural resource industries and physical resource industries, and I arrived in software by way of cognitive work.
And so when we start to think about how do fighter pilots or astronauts or emergency room physicians, how do they think about risk? How do they detect what's happening in their environment around them? How do they diagnose the problem? And then how do they react and respond? There's commonalities between any kind of high risk, high consequence environment, and software engineering in that it's about how the humans are perceiving the world around them and how they're making sense of that.
Then how they're directing not only their own activities and their own actions, but that of a broader group. And so I've had the good fortune of studying software engineers for about six years now and looking at what are the ways that we can design tooling, what are the ways that we can design work teams and design work practices to help support that process of noticing what's happening, responding appropriately to what's happening and then coordinating, collaborating to help maintain the broader organization goals.
Martin: So you mentioned LFI, I believe that everybody else here went to a conference called LFI. I'd love to know a little bit about the conference.
Jessica: And what LFI stands for.
Laura: LFI stands for Learning From Incidents, and the LFI is a bit of a shorthand, it's a bit of a catchall to describe a range of approaches and orientations towards how we think about site reliability, how we think about operating continuous deployment systems and it involves safety science, resilience engineering, human factors, engineering psychology. It is drawn from a number of different kind of exemplars of how safety science and domain practitioners have partnered and worked together to enhance and improve understanding of how do we support work.
An example of this is in healthcare, in the early 90s many of the technical conferences that were going on involved just physicians and nurses and healthcare practitioners. Then some very insightful practitioners said, "Hey, we have a bit of a problem in understanding fully how to handle the types of complex and changing problems that we see in our world." So they invited some safety science researchers at the time to help give them a fresh perspective on the nature of the problems that they were facing, some of the techniques and strategies to help uncover what makes that work hard.
That has been a factor in nuclear power, in aviation, as I mentioned healthcare, in a wide number of domains. And so software is really extending that legacy by inviting safety science researchers and human factor specialists to help understand and give fresh perspective to the types of problems that we can see in software. So to answer your question, Martin, of what was the LFI conference?
This was the first time we brought together the community, so this started in a Slack community, it started around some folks who had attended Lunde University which is a human factors and system safety Masters program. We brought together this group of interested software practitioners with safety science researchers, academics, and just a broad range of people thinking about and working with complex adaptive systems to share stories about incidents, to share their understanding of theory, to talk about research that has been going on, and to talk about some of the programs that software organizations are... I should say the programs and the experiments that many software organizations are undertaking to try and improve resilience and reliability within their organization.
Jessica: You mentioned something earlier, as we improve reliability, you said that safety science these days is about how do we support work? And I noticed that that's different from what I used to think about safety science as being about, which was more like how do we prevent accidents?
Laura: Yeah, that's a really good insight. So there's this academic street fight that's happening right now within safety science, and this is between two different camps. One which says, "it's about preventing accidents, it's about preventing humans from making mistakes and preventing heuristics and biases from influencing thinking in these kinds of domains."
And the other camp is saying, "No, no, no. As the world is getting more and more complex and more and more interdependent and the types of technologies and the systems that we're building, be that healthcare, aviation, energy distribution, as those systems are getting more complex and interdependent it's not necessarily about preventing the accidents. It's about supporting the people that are involved in building the kinds of technologies that can help adapt to unexpected events and to surprises."
And so this kind of dichotomy that's happening right now, software has recognized right out of the box your systems are complex, they're dynamic, they're always changing and so it's hard. It's hard work, hard cognitive work to actually be able to maintain and manage these systems and to be able to anticipate all of the kinds of problems that you're going to have to solve. In that way, it's less about preventing accidents and more about being prepared for surprising or unexpected events.
Jessica: Because we recognize that we're not going to prevent all the accidents. We're going to prevent all the accidents or all the things that could possibly go wrong, we're going to learn as we go and we learn both during incidents. But what you're talking about is a kind of getting better at responding to incidents that also supports everyday work.
Laura: Yeah. So I think that first of all the comment about we recognize that we're not going to prevent all accidents, I don't think that's as widespread. I think that's a signal that you have been thinking about this and diving into some of the conversations about resilience and reliability.
Jessica: That's true. Some old school monitoring is like, "Well, we need an alert if any request takes more than two seconds." Welcome to reality.
Laura: Yeah. Well, that's a really good example because it's like does that actually take into account the fact that if all we do is create alerts, then we're increasing the cognitive difficulty for a practitioner. When three years on, sometimes even two months on, we're just looking at a whole cluster of alerts, many of which may be contradictory or conflicting or may be confusing to the practitioner.
And so we need to be thinking about these things in terms of how do in time pressured, ambiguous and uncertain environments, how does the human brain work? How does the human brain work with the tools and technology that they're working with? And then how do they work collectively across a distributed team?
And that's a very different way of thinking about supporting a software engineer with their hands on the keys, than saying, "Just don't make mistakes. Just be correct in all of your assumptions and all of your interpretations all of the time."
Jessica: And this is learning from incidents versus root cause analysis?
Laura: Yeah. That's a shorthand way of saying it, for sure. There's a collection of techniques that include the root cause analysis or failure modes analysis, that have this paradigm, this worldview that all of the problems can be predicted and that we can eradicate human error, we can eradicate mistakes and slips and cognitive biases and create this perfect responder.
What we say is, "No, people are doing their best. They're coping with a lot of complexity and they're coping with a lot of variability." The time scales that they're dealing with these things, they're not sitting back and having one piece of information come in and then they can reason about it and make sense of it, then another piece of information comes in. These are happening on the order of seconds and microseconds, and you're getting a whole lot of new information coming in.
Oftentimes thanks to monitoring and observability tools, and so the cognitive work that goes into making sense of that in real time is actually extraordinarily challenging. And so we're looking at how do we support that capacity, we support that ability? As opposed to how do we limit and constrain the ways in which sometimes people get overwhelmed by information.
Martin: There's that inner conflict, isn't there? Of engineers, that you have to accept that your system is broken. You don't know where yet, but it is broken somewhere. It's just a matter of degrees as to whether you're willing to accept that level of broken, because if we sit back and just wait for things to be perfect and find every single bug in our platform, nothing will ever be delivered ever. So we've just got to accept those things, and I think that idea of learning from the incident and then trying to be better is a really, really important thing.
Laura: What you just said is so on point and it's amazing how insidious that worldview, that we can... that systems are always performing optimally and that humans are always performing optimally, and that the world is not full of surprise and it's not messy, and unstructured. It's amazing how prevalent that is. Even for myself, coming from the safety and risk domain, I came from a world in which I thought we can create rules and regulation and work processes that can help people to color within the lines and stay within the lines.
But also, my experience of working in those worlds is that no plan survives first contact with the environment in which you're trying to execute it because the world is messy and it's unstructured and problems are often difficult to see. Another aspect of what you're saying about systems operating in degraded mode is that it doesn't account for the hidden pressures and constraints that organizational goals and objectives have that often limit our ability to optimize a system around one criteria because we're thinking about profitability, we're thinking about compliance, we're thinking about quality, we're thinking about user experience. And so we're constantly making trade offs and adjusting performance relative to what is the most important criteria to optimize on here.
Martin: So do you think a lot of this is about thinking about from a customer perspective? I've been thinking a lot about how we, in roles around SREs, is thinking more about the customer, thinking more about the individuals who are using the platform versus the platform itself. How does that relate to that line you were talking about where under the hood you've got CPU metrics and things like that, and then actually throwing that around to thinking about a customer because a customer is way more important?
Laura: Yeah. Way more important than what?
Martin: Than thinking about the underlying hardware, the underlying systems. It's about that user experience and the hardware can fail, but if the user experience is important.
Laura: Right. Yeah. I think that what you're pointing to here is an interesting problem, it's that there's multiple perspectives that are relevant to consider when we think about what we mean by performance within these systems. And so that is absolutely the customer is going to dictate other important perspectives such as profitability, like where we're going, strategically.
But those multiple diverse perspectives often have different goals and priorities, and they have different ways of accomplishing what they need to do within the system. Sometimes organizations are good at being explicit to say, "Here's where the user experience is a top priority. Under these conditions the user experience is a top priority." But most of the time, it's often left to the frontline practitioners to say, "How do I prioritize and, most importantly, reprioritize goals when the system starts falling down around me?"
An example of this is if you have a service outage and you are suddenly faced with this possibility that you can either keep the system down for a little bit longer and so you're going to degrade the user experience, but it's going to enable you to get more information about the nature of the problem so that you have better capacity to understand what's been going on and to be able to respond in future.
You have this trade off here between what are the priorities and how does that factor into the long term goals of the organization. And so I think that the ways in which you are reprioritizing, you're revising, you are reconfiguring as the world unfolds and as the system continues to grow and change, that that is often left to frontline practitioners or teams to determine this and to communicate back up. So there's this constant dance around the pressures from different parts of the organization.
Nick Travaglini: So Laura, thanks so much for joining. One of the things, going back to the topic of the conference itself, was there was a lot of great speakers as you talked about, both from academia and from industry, and doing a lot of really interesting work.
We had Sarah Butte over at Salesforce and Alex Eldman over at Indeed talking about the issues of the multi party dilemma, I think is what they call it, where you can have an incident that involves multiple organizations and so how do you handle that and how do you learn from that so you can build a more resilient system when you've got these dependencies.
We also had some folks like Dr. Ivan Popolite who spoke about his work with the Forest Service and instituting the learning review there. Those were a couple that I found super interesting, but what are some of the other talks that happened at the conference and what did you find really interesting?
Laura: So those were all excellent talks, and I think the way we are thinking about the program was how do we have this mix of software engineers, site reliability engineers, managers? The organizational perspective of, "Here's what we're trying, here's what's working, here's what's not working." That sort of perspective as well as the research and the theoretical side of things.
Then we had a number of talks that merged those two as well. They said, "Hey, we learned about this kind of systems analysis and we tried to apply it to our incidents and here's how that went." So there was a good mix across the program of that and then as well of the incident stories. Those were in our hallway track, I'm making air quotes here, in our hallway track which was not recorded because we wanted to try and protect the ability to talk very frankly and transparently about failure and try and normalize that a little bit.
The feedback that we got from participants was, "That was amazing and well worth the price of admission to hear that some of the things that I'm facing in my organization are very widespread, and so hearing other people's strategies and approaches was really useful." In terms of some specific talks, the ones that you mentioned were fantastic. David Lee who's a distinguished engineer at IBM, and Rennie Horowitz talked a little bit about what IBM is doing. You've got this big organization-
Jessica: 12,000 people just in the office of the CTO. Yeah, they were shifting it from root cause analysis focus to really learning from incidents.
Laura: Yeah, absolutely. So that was many years of work, diligent work, and a lot of experimenting and I think they did a great job of saying, "We got some traction with this, but this fell a little bit flat." And so that was really great. Laurie Hochstein gave a keynote on advocating for change when there's different orientations towards change, and that he called... I'll definitely air quote this because that was part of his talk, was, "Your perception of reality is wrong," or, "Your reality is wrong."
And I thought that was really insightful because oftentimes you're not going to change anybody's mind or anybody's perspective by telling them they're wrong. You got to bring the receipts, so to speak.
We had David Woods, talked a lot about his work with resilience engineering across multiple domains and the importance of finding patterns in incident response work, and the patterns... he had a call to action which was, "Hey, this is great if we're doing this within one team within an organization or within an organization." Clint Barm from Spotify talked a little bit about some work that he did in trying to find problems across Spotify.
But Dave was saying, "This isn't just about one organization, the patterns across it. But we can abstract those patterns and say what makes this hard in software, period." And that has a lot of value, that was a very provocative statement to make because it shifts the mentality of if we are thinking about resilience and safety in software, this is a software problem.
And so it's going to be solved by many voices, many perspectives and the coalescing of those perspectives. So I thought that was really exciting as well. Courtney Nash talked about some of the work that she's done with the Void and the value of doing deeper incident analysis.
Jessica: Okay. Yeah, no. You're going to have to tell us what you mean by the Void.
Laura: The Void is Verica's Open Incident Database and it's a project that's been running for a couple of years now, looking at both submitted and public incidents that are software related incidents. She's collected... Nick, do you know how many incidents that she has there now?
Nick: I think about 10,000.
Martin: Wow, that's a lot.
Laura: Yeah, so she's extracted different characteristics of these incidents from across this database and made some comments and observations, some analysis around that. One of which, also another provocative thing to say, which was, "Hey, MTTX and TD meantime to respond, meantime to detect, meantime to whatever is actually not that useful. It's not a great metric for us to use." And she, in the most recent report, pointed to some more promising directions there. At Jeli we're actually working on a report that's trying to extend some of that thinking as well. So that was really exciting to hear what she's been learning about.
Martin: I absolutely love that idea, that idea of people not feeling isolated around, "My system is broken." Well, no, no. Everybody's system is. Everybody has incidents.
Jessica: Yeah, it's not you, it's software as a concept.
Laura: Yeah. It's a really powerful thing, it seems really simple. But it's also really hard to say because if you think that everybody else's system isn't held together with duct tape and glue-
Martin: Hopes and dreams, hopes and dreams is what keeps this system together.
Laura: Aspirations. If you think that everyone else has their stuff together, you're less likely to talk about failure. You're less likely to talk about what's hard. Then that really drives our ability to learn across the community, to learn from one another. It drives that underground, and so it's not inconsequential to try and normalize this failure and to try and talk about what's hard.
There was a couple of other talks that I really loved. Pierman Sherman talked, his talk was called You Are The Resistance which referenced a comment that Richard Cooke had made at ReDeploy a few years ago saying, "Sometimes you've got to do this work under the table. You have to experiment, you have to show that a different way of thinking actually has merit because these paradigms about you need to measure it and you need to manage it are so prevalent."
Nick: And this is really related to your talk about doing incident analyses without a template. That was another fantastic talk.
Jessica: Yeah, because, Laura, you had said earlier that the other camp in safety science was about getting people to color between the lines. And yet, your current camp, the LFI camp, represents that it's only by coloring outside of the lines that we can ever get better at this.
Laura: Yeah, I don't know if it's about that we have to color outside the lines. But to your point and Martin's point about these systems being degraded, it's that we're going to color outside the lines. And so if we disregard that fact then we're not actually going to understand what life outside the lines actually looks like.
That means that we leave a lot of people with their hands on the keyboard very unsupported because the work as imagined in that world, which is that you don't ever go outside the lines, versus the work as-done which is you're outside of the lines a lot of the time, means that we have a gap there between those two kind of models of how we think the world works.
And how that gap gets filled is by experimenting. Someone who's on call at 3:00 in the morning and who's like, "I don't know what to do here. I'm going to try something." So if we have this idea that, well, you should have, could have, didn't, we lose the ability to say, "Well, what made sense to you about doing what you did?"
Jessica: And then maybe rearrange the lines.
Laura: Yeah, absolutely. Redraw those boundaries and rethink about how do we help people cope with the next unexpected, unanticipated, unknowable event?
Martin: I think that's very key to being on call and that anxiety that you get from, "Well, maybe I make things worse." And that anxiety of making things worse means that you do end up making things worse because you hesitate.
Giving people that, maybe an overused term at the moment, that psychology safety of being able to take that chance and just commit to it and do it, and if it goes wrong, it's fine. We'll learn from it.
I love that idea that we're giving these new on call engineers, these people who are new to that supporting live systems, that confidence to be able to go, "I'm sorry, it went down and I tried some things and some things didn't work, some things did work. But we got it back, it's good, and we'll learn from that and then we'll redraw the lines."
Laura: Yeah. And I'll give a shout out here to Stripe actually. Bridget Lane and Will Carhart gave two companion talks. They were different, they were Stripe's recipe, ingredients for heart healthy on call. Then Will talked about using Pre Mortems to support or to prepare for incidents. That speaks to that orientation that you're talking about which is, A, how do we help people to be prepared for scary, unanticipated events?
And then also what are some different techniques and models? And the pre mortem is something that a safety science researcher, naturalistic decision making researcher Gary Kline proposed. So yeah, it's this shift in the organization, thinking about, "Let's think about surprise. Let's think about the ways in which you may not be prepared for this and help you be as comfortable operating in those ambiguous and uncertain states as possible."
That goes back to this idea of why coordination matters so much in this world, because one person's mental model of how that system works when it's large scale distributed, continuously changing is always going to be partial and incomplete. So you have to, by virtue of the nature of the environment you're working in, you have to work with other people.
And so figuring out how to more effectively recruit those people, bring them up to speed so that when they jump in they're as useful as possible, during a moment in time where you don't have extra attention, you don't have extra time, this stuff really matters. This is where incident analysis and really using the actual data from your incident can help you to take that apart and say, "Where did people look? Who did they call? What information was hard to access?" And then those kinds of conversations can help you to improve your systems and also improve the mental models of the practitioners.
It always comes back to how we're trying to think about this at Jeli, I will say that that's one thing that's really exciting about what we're doing, is that we're trying to give access to what actually happened during the event. Not the retrospective, like, "Oh, after we figured out what went wrong, all of the uncertainties and all of the rabbit holes we went down, and all of the difficulties in getting the right people in the room at the right point in time around the call. All that disappears and fades away and we don't notice it anymore." And so that's not really learning from your incidents.
Jessica: So learning from incidents isn't just about what went wrong in the software, it's about what went well in the response itself? One thing I learned from the conference was this concept of incident analysis is really interesting, and people use these to learn from their incidents. Not just about the software, the mitigations there are usually separate. They certainly are at Honeycomb when we do our incident analysis. But studying the response itself and some organizations have whole teams that do only this.
Laura: Yeah. That is a great point, that even trying to understand, "Okay. I'm onboard, I want to take a new approach to this. But how do you do it?" That's where I think John Oz was, he talked about some of the work that Indeed has been doing and you referenced the team that they spun up around this. I think that was really useful. We had a talk from Cooper Benson at Quizlet, which is the other side of the spectrum.
They're just getting their program up and running, but how do you as an IC influence some of this change and help to drive some of this change if you're just one person or you're part of a small team? You don't have a dedicated team to do this. And so really seeing the ways in which different organizations... what are the techniques that they're using, what are the practices that they're using? That can help increase the amount of experimentation that we have going on within the industry about what are the best ways to do this.
Jessica: Yeah, the conference, the first LFI Con was really amazing because it was a bunch of innovators, because it's a new space and it's a cross disciplinary space. We have so much to learn from each other and everybody was really there to do that and to make these connections, and to find the patterns across our organizations.
Laura: Yeah. Absolutely. I think that we talked about this different orientation and many of the people that were there, and I think many of the people that have been involved in this community and trying to change some of the ways that we have these conversations within software are really committed to this idea of normalizing failure, of knowledge sharing, not keeping the ways in which we have had success internally within our organization.
Not keeping that as a competitive advantage to ourselves, but rather sharing this across the industry and helping other organizations to improve as well. That's not insignificant because this is a competitive advantage, understanding your systems at this level and being able to take the learnings from incidents and apply that to your training, to your onboarding, to your recruiting, to your retention, to your tooling, to how you build your tooling and calibrate your tooling.
That all makes you more reliable and gives your users a better experience, and so if you're going to keep that to yourself the industry as a whole is not going to improve. And so it's not insignificant that people were so willing to share what they've learned from experimenting in this space.
Martin: One thing I've noticed around organizations that have onboarding processes, they have deployment processes, is if you go down the list of the ways that you do things you can pretty much go down that list. If it's not just, "Do the thing, deploy the thing," anything that's in between has generally been because you've learned from an incident that something has happened and that you might need to think about this in a bit of a better way and so you put something in a process or you put something in an education context for people.
So if you look through people's processes, you can actually go back and actually pinpoint exactly where the incidents happened because that was when the documentation was updated, when the education for onboarding was updated.
Jessica: Yeah, just like your dashboards. It's all gravestones.
Laura: Yeah, absolutely. I think it's an interesting point, because it says, "oh, there is something about incidents that makes the organization act, right? That triggers these reflexes within the organization, and that the reflex is to update documentation or create a new dashboard." To your point, Jessica, have limited value and they're perishable. They're going to get stale or they're going to get outdated or they're going to be insufficient in some ways. And so instead, I should say and, not instead, but we build these processes and we help create these repeatable processes within the organization but we recognize that that's in addition to helping to improve the mental models and update and recalibrate the mental models of people who are going to have to respond in realtime to unanticipated events.
Jessica: Yeah. The phrase incidents, it's not just that we get it fixed. It's not only the mitigations that we put into the software. It's who do we become as a team.
Laura: Yeah, and I love that. I think that was one of the things that I took away from your talk as well, was that we are shaping culture, we're shaping interactions, we are shaping the approaches to how we think that we need to improve and where we need to go. And so I think that one of the ways, one of the things that I really liked about your talk as well was that you brought in a number of different voices from your colleagues.
That is also very important when we think about this work, it's like it's not just the engineers in the room. It's also the marketing people who launched the Superbowl ad that all of a sudden drove unexpected amounts of volume that the SRE team did not know was happening. It's about the sales people who are on the sharp end of those conversations and hearing about where the pain points are for your customers.
It's about understanding the leaderships and the pressures and constraints that they're working under relative to economic or regulatory or organizational pressures. So it's really about being able to bring these multiple, diverse perspectives together to say, "How do we understand these multiple, sometimes competing priorities, so that we can be more adaptive in realtime and try and optimize on as many criteria as possible?" It makes us more resilient than just simply following a script or following a playbook.
Jessica: Nick, you were one of the people at Honeycomb who was instrumental in getting all of us to show up at LFI and sponsor and everything. Tell us who you are and how you got into this.
Nick: Sure. I'm Nick Travaglini. I'm a technical customer success manager over at Honeycomb, and I've been involved with the learning from incident community for several years now, so it's before I started working at Honeycomb last year. I really was moved by the idea in LFI that mutually supporting colleagues and teammates, supporting each other mutually beneficial relationships and the idea of reciprocity, that we work together, that we share our learnings and that we improve each other by supporting each other and we keep the system up by supporting each other is really crucial.
For me, why I wanted to participate in this is because I think that people are essential to making the system work, the technical components, the stuff below the line has to be made to work together with each other because data needs to be transformed from one format into another in order for one part of the system to work with another.
And guess what? It's the people who are making the decisions about how to do that transformation so that the data can move smoothly from one part of the system to another, for example. And so the idea of learning from incidents, of sharing our experiences, to improve each other and support each other, that helps us to build and maintain excellent digital infrastructure. That's what really moves me about LFI and one of the reasons why I'm so passionate about this subject and contributing to the community.
Laura: I think you bring up a really interesting point, because it's like the data moves from one part of a system to another. But I can't tell you the number of times that someone will say something in an incident and you're like, "How did they know that?" It was this obscure piece of information, like, "Oh, Martin's actually off this week and he's the only one that has... he has this amazing dashboard that shows all of these things." Or, "This team's about to do a big migration, we need to talk to them before we push that deploy because otherwise it's going to interrupt what they're doing."
And a lot of how that information flows through the organization is not necessarily through structured meetings or project updates. It is very ad hoc and emergent as well, and so there is an interesting aspect to how we think about structuring interactions between different teams across boundary lines, both internal and external boundary lines, that is very important to this idea about resilience and reliability. It often doesn't show up because it's not represented in official documents or diagrams or architectural representations.
Jessica: And that's part of that process tree thing that happens above the line, right? You find these connections that exist, you may have an Org chart but I promise you don't have a diagram of who knows who and who talks to who in your organization.
Laura: Yeah, absolutely. It's these hidden aspects of work, or difficult to see aspects of work and of cognitive work that there are techniques that are out there to help surface some of these. I think that at the LFI community in some of the tooling that is being built around this is helping to increase that kind of observability as well.
Martin: I think for me one of the biggest things that I've seen about learning from incidents is learning how other people had debugged that particular problem, because like you say, if you get that one person who has that one special dashboard that tells them about the thing that fixes the thing in half a second, and that for me, when I've been running retros... Obviously that's not a great term any more.
But running incident retros at the time was, well, that person who fixed it, they fixed it quicker than I would've fixed it, so how did they do that? And that is invaluable, because it allows you to say, "Well, maybe that dashboard should be public. Maybe that dashboard should be part of the onboarding process. Maybe that dashboard should actually be the front page. Maybe it should be an email."
Jessica: Or maybe not. Maybe another dashboard isn't the solution. But spreading around the knowledge that that one person has is, so that the relevant dashboard can be constructed ad hoc through interactive queries.
Martin: I mean, I was getting to that, obviously. But just that idea that how did that person do it? Because everybody has it, everybody has that person in the organization that knows every system inside and out, and they're the ones that you want on call because they're the ones that can fix it really quickly. But there's one of them, and we haven't invented cloning in a nice way yet, so we can't take that person and replicate them 1,700 times.
But what we can do is we can take that person's knowledge and we can allow the rest of the teams to understand it, and give them that ability. Maybe it's access, maybe it's knowledge, we don't know. But that's, for me, the biggest value add from doing that learning from that incident is how did that person do it, socialize what that person knows.
Maybe it's the opposite way around, maybe that person didn't know this thing and then somebody in that meeting at the time when you're trying to look at that incident, goes, "Well, why didn't you use this thing?" And that just makes everybody better, it makes everybody safer, it makes everybody feel like they're empowered to do the things.
Laura: Yeah. I think go to go back to an earlier comment about the difficulties of cognitive work is that I think that one of the things that you're describing there is these kinds of orientations and assumptions and beliefs and values that we bring or that we use as lenses to bring our knowledge to bare. And so these orientations and assumptions about the system often come out when we go look at a certain part of the system or whether we create a dashboard because we're like, "There's no monitoring on this, but I have this sense that it's really important for us to know what's going on over here."
And so that level of diversity and variability in the way people think about a system and their beliefs about the system is actually a feature, not a bug. Right? We want a certain amount of variability because it helps us to cover more bases, so to speak. And so I 100% agree with you, sharing that knowledge, helping level everybody up, and not trying to create homogenous incident responders, but rather encouraging that variability is really where resilience lives.
Jessica: Beautiful. Laura, thank you so much for joining us.
Laura: This was so fun.
Martin: Absolutely, it was an incredibly fun conversation.
Laura: Yeah, absolutely.
Nick: Yeah. Thanks, everybody.
Jessica: Martin, welcome to the learning from incidents community.
Martin: It sounds like a very accepting community.
Laura: Absolutely. I will do a little shout out for LearningFromIncidents.io. There's a blog there, there's some more information about the videos that were at the conference and more ways to engage, so come join us there.
Jessica: Great. And where can people find you?
Laura: You can find me at Laura.Jeli.io, or @LauraMDMaguire on Twitter, M-A-G-U-I-R-E.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
Incident Response and DevOps in the Age of Generative AI
How Does Generative AI Work With Incident Response? Software continues to eat the world, as more dev teams depend on third-party...
Getting There Ep. #7, The March 2023 Datadog Outage with Laura de Vesine
In episode 7 of Getting There, Nora and Niall speak with Laura de Vesine of Datadog. Laura shares a unique perspective on the...
Incident Response Round Table
Panelists from Zendesk, Spotify, and Salesforce have a candid round table discussion about their scar tissue and key...