Ep. #77, Observability 2.0 and Beyond with Jeremy Morrell
In episode 77 of o11ycast, Charity, Martin, Ken, and Jess welcome Jeremy Morrell to talk about OpenTelemetry, the future of observability, and how small teams can get started. Jeremy shares stories of debugging breakthroughs, adopting standards, and building tools that help engineers focus on what matters.
Jeremy Morrell is a principal engineer at Cloudflare. With years of experience leading observability efforts at Heroku, Jeremy is passionate about modern observability practices, including OpenTelemetry, structured logs, and feature flags. He is a blogger at jeremymorrell.dev, where he shares insights on platforms, systems, and observability.
In episode 77 of o11ycast, Charity, Martin, Ken, and Jess welcome Jeremy Morrell to talk about OpenTelemetry, the future of observability, and how small teams can get started. Jeremy shares stories of debugging breakthroughs, adopting standards, and building tools that help engineers focus on what matters.
transcript
Jeremy Morrell: Standards enable investment, societal investment. Physical standards, the stuff that we feel in touch every day, make the world work.
I can buy a two inch 1/4-20 bolt for a few cents and I know it's going to fit into a 1/4-20 thread that was made even years ago on the opposite side of the planet because it conforms to that standard.
But if I needed to go and hire a machinist to make a custom bolt, it would be hundreds of dollars in order to get that, and standards enable the investment to like the big industrial machinery that exist halfway around the world and the supply chain that makes it so that a bolt costs just a few cents.
And so standardization enables that broad societal investment. I think the same thing is sort of broadly true in software. Having OpenTelemetry as the standard allows us the send data the observability of vendors, sure, and we switch between them. But that's kind of just the start of that movement. If the standard is successful, it's going to be built into libraries, frameworks, tooling, IDs, languages, platforms.
For users, it's likely that the standard is going to sort of just disappear and it'll just be an expected part of the system. And I think that's ultimately the goal of the project.
Charity Majors: I love that. Standards are what allow us to progress. Standards are what allow us to go, okay, we figured this out, then we can build on that.
There was a tweet that I did five or six years ago, I haven't been able to find, I remember just like bemoaning, why are we so far behind when it comes to like logging and telemetry?
It feels like every single shop has its own like hacked together, schema-ish, you know, something that's evolved or designed or whatever. But it's different every place you go, which means that engineers can't build on it, can't improve, we can't move forward, we can't take things for granted.
We can't be like, okay, this is a skillset, I only had to learn the basics once? No, you have to bootstrap every single place that you work. And so yeah, I wish we had done this 20 years ago. I'm glad we're doing it now.
I also feel like, Jeremy, I love what you said about it's going to recede into the background for users. Like OTel is a complicated beast, right? Like it's become the top CNCF project and you don't become bigger than Kubernetes without some complexity, let's say.
And I think a lot of people are just like, uh. But the point is that not everybody needs to know everything about everything. A lot of this is going to be taken for granted.
The interfaces are going to get simpler and cheaper and sometimes you won't even need to care about the interfaces, but the underneath, the plumbing will still be structured and predictably named in ways that we can leverage the data and do great things with it.
Martin Thwaites: I love the fact that OpenTelemetry is basically the USB-C of the telemetry world.
Charity: Oh god.
Martin: On that note, Jeremy, I think this would be a really good time for you to introduce yourself, let us know who you are and what you're about.
Jeremy: I'm Jeremy Morrell. For the last several years I've led the internal observability team at Heroku. And by the time anyone is hearing this, I'll be starting as a principal engineer at Cloudflare working on the workers' platform,
Charity: The workers'platform. What does that mean?
Jeremy: During my time at Heroku, I was, I spent some time as the node owner and did a whole bunch of work answering support tickets for users. And you'd end up with a lot of people coming from the front end space and now they're coming and they're building their first server, and what they need is a backend job.
And so unfortunately what I had to tell them in response to them was like, oh, you need like a Redis queuer, you need a Postgres, here's a queuing library. Here's this infrastructure you need to set up. And now you have to productionize it and monitor it and build all of these things.
And this is, I think where the functions as a service platforms and the Cloudflare Workers Platform in particular is giving you a whole bunch of these primitives sort of out of the box and you do not have to build or operate them. So it's like standards.
And I think that there is a big opportunity within platforms, platforms like Heroku, platforms like Cloudflare to build in OpenTelemetry and standards and these opinionated ideas about how you instrument things, and you can get a lot more out of the box than you can today.
Jessica Kerr: Nice.
Martin: But should you start with OpenTelemetry? You know, that is, you know, something that we hear quite a lot, you know, this is hard to do. It's clunky. But you know, how did you get started with that? What was your sort of epiphany moment for moving on to that? How did you start?
Jeremy: So I'll give a little bit of my history. Like Charity, I also come from having experienced time in Facebook and then used the system Scuba.
And if you haven't entered a FAANG before, the experience is you join and you get a whole bunch of new systems and tools thrown at you really fast and there's a whole bunch to learn and you kind of just accept it, okay, this is the tool I use for, I use this thing, and it works this way, moving on.
And then you leave that company and you have all of these new skills that you've built up over time and you're like, oh, how do I do that thing that I used to do? And so I experienced this when I went into Heroku, and I wanted like, oh, there's these failures.
Like, oh, can I see which users those correspond to and can I dig into that? And the person just sort of looked at me and was like, no, that is not a thing that you can do, that's not possible. And I was like, but why?
And it took me quite some time to sort of sort out like, oh, the way we're storing data, we're actually sorting not, you know, these structured things. We're actually just storing counters.
Charity: Yeah.
Jeremy: And it wasn't until I saw one of the Honeycomb talks at Strange Loop, Sam Stokes' Why We Built Our Own Distributed Column Store, that a lot of stuff started clicking into place.
And that led me down a path that eventually ended up at OpenTelemetry, but the start wasn't OpenTelemetry at all. It was logs. I couldn't wave a magic wand and get Honeycomb approved as a vendor. So I had to make an argument for my case. Okay, here's this new workflow, here's how it works.
And the way that I approached that was by building up these really wide log lines with a whole bunch of context, putting them into our log tool and then showing how you could query that. Then I took that data, downloaded it and uploaded it to Honeycomb in one of their test accounts and showed the same workflow side by side.
And in one I would have to make a query and it would take several minutes for the answer to come back and then Honeycomb came back in under a second so I think I could show the next query and the next query and that was how I was able to make the argument that, hey, this tool is going to be beneficial to us.
Martin: Well, you started basically with logs, that was the thing, it wasn't tracing, it wasn't anything OpenTelemetry-based, it was just logs.
Jeremy: Yeah.
I think that there's this tendency to shortcut and say OpenTelemetry is observability, but OpenTelemetry is just, it's a set of standards and it's a set of tools but you can have opinions on top of those tools. My sort of hot take around it is if you are in a small team and you're struggling with your existing systems, you're on fire, you have limited engineering cycles, maybe starting with telemetry isn't the way to go. Start by building up very wide structured logs, do something just radically simpler.
Go through my blog post and add as many of the fields that I lay out as possible and work in the three techniques I talk about. Figure out how you can take that data and visualize it, filter it, and then group by specific fields.
Charity: It kind of goes back to your first point about OTel as an investments, and people sometimes say investment as an argument for doing something, but it's actually just an explanation of costs and it that there are costs and trade-offs involved.
You know, and I think to your point, it's not the right investment for every team at every place and sometimes you need to sort of bootstrap, you need to get to a better place to free up the cycles to make the investment, or sometimes the payoff just isn't, you know, like I will often say, which is a bit hyperbolic, but I'm no stranger hyperbole.
I'll sometimes say that nobody sits around like eager and excited to rip and replace one of their developer tools. In order to have a strong compelling argument for replacing a dev tool, what you're offering has to be I think an order of magnitude better or cheaper or some combination of both for it be unquestionably worse, the payoff.
And I think for, you know, what do you think are the sort of criteria that folks should look for to be like, yeah, it's worth investing in OTel for us at this time.
Jeremy: Do you have someone on your team that's willing to become the OTel expert?
I think this is sort of just the nature of standards. I think you, in any sort of engineering discipline, you can point to a standard and you're going to find some cranky engineer in the corner going like, oh no, you know, the US is using the 120 volt standard and that's totally wrong. Everything would be so much better if we use this other thing. But the societal benefits of having a standard outweigh any of those individual drawbacks.
OpenTelemetry is this big complicated beast, I think, in part because it has to answer to so many people.
Charity: So many.
Jeremy: I think the user experience for the average user installing a CRUD app is maybe a little bit more complicated than it otherwise could be.
But what I have found is that as I've watched the SDKs develop, all of those extension points were added for reasons, they were added for good reasons that people had needs that couldn't be accommodated any other way, and that additional complexity is the cost of bringing everyone and getting everyone on the same page, on the same board, and overall, it will pay off.
In the long run, I really hope that there's not so much upfront work on the end user. I'd really like to see frameworks start to have opinions around this stuff. If you are building and deploying a Ruby app on an opinionated platform like Heroku, there's a ton of information that you should probably just get out of the box that is not the case today.
But to answer your question, I think I would start with logs unless you have someone who has the time and cycles to become an expert, and then once that has shown value in your organization, once you can point to like, hey, we're able to debug this much, much quicker because of this instrumentation, then you can argue for the investment and the cycles to now swap that over to OpenTelemetry.
Martin: I think there's levels of standards that we're talking about here. There's, you know, OpenTelemetry provides a few different levels of standards, you know, from the data model of how we transfer data but also the naming of some of those parameters and things like that.
I think taking it back to where you started in logs, that idea of building those sort of wide structured logs, which we talk about as being the basis of sort of what modern telemetry looks like, which is wide structured logs, there's a level of standardization that even at an organizational level you can build your own standards around.
Is that something that you looked at, like consistent naming of parameters, the mandatory things that needed to be there because you don't need OpenTelemetry to get there on that first step. Standardization starts in your organization, you can adopt those things
Jeremy: And that falls into like the social technical problem. Naming things is quite difficult. Naming consistently is very difficult.
Generally, if you have a handful of people, you might be able to write a doc and standardize. If you are in an org with hundreds of people, then you probably need to create some sort of library or something that you can share.
That was our approach internally at Heroku while we wrapped the OpenTelemetry libraries into distributions and they built in a number of attributes that would be consistently named 'cause they're pulled out automatically and it's built on top of OpenTelemetry, and then it built in a mechanism and a convention for creating these wide events because that's not something you get with OpenTelemetry out of the box.
Charity: I love that you're talking about this, so like I wrote this white paper recently called "Logs Are the Bridge From Observability at 1.0 to 2.0."
You know, I really, I feel like the workhorse of telemetry in systems for decades has been metrics, and I feel like the single most powerful thing that people can start to do to invest in the future is to start taking those cycles away from metrics and start investing those into the wide structured logs.
You know, for a long time we tried to define observability. 2016 and 2018 we're like observability as compared to monitoring. And now of course, everything is observability, right? If you're doing anything with telemetry it's observability.
And so we've been starting to like sort of talk about the sort of generational gap between older tools and newer tools as being like observability 1.0 at least the traditional like three pillars world, right?
You've got observability has three pillars, metrics, logs and traces. Well actually most people, for every request to enter their system, they're storing it in a RUM tool and an APM tool, and a metrics tool, and unstructured logs, and structured logs, and a profile tool, and a trace tool.
And it's just like the cost multiplier alone is absurd even before you start to take into account things like high cardinality, which you referred to earlier.
And I think, in thinking about observability 2.0, I think it's getting away from that many sources of truth that are connected by mostly just an engineer sitting in the middle, like eyeballing shapes, to a world where you have this single source of truth, right?
Where you've got these arbitrarily wide structured log events that are modeled after a unit of work, which you can hopefully visualize over a time as a trace, you can slice and dice, you can zoom in, you can zoom out, you can derive metrics, you can derive SLOs, you can derive all these other data types, but because you know that single source of truth that connective tissue gets preserved, which means less guessing, which means it's more cost effective.
You know, and I think I kind of messed up when I started talking about o11y 2.0 because I had this, I got really excited. I'm like, it's about, you know, 1.0's about how you operate your code and 2.0 is about how you debug your code, and I think that I accidentally muddled the waters when it really comes back to this one difference from which all of the other things flow, right?
So I'm starting to see mentions of o11y 2.0 pop up in the wild and it's a bunch of people, a bunch of vendors are like, well obviously we do it and it's just like, fine, it's not serious. Right?
So the thing I loved about your blog post, Jeremy, on instrumentation, which you're going to link in the show notes, everyone should read it, I have it open like in a permanent bookmark. It's so good.
I think it's like the canonical guide to how to instrument your code for modern systems, for observability 2.0, and it's vendor neutral. Like there are Honeycomb examples, but you also reference, I think if this thing's going to take off, if we're really going to like move forward as an industry, there have to be open source analogs.
There have to be like sort of composable, you know, build your own on top of ClickHouse using Grafana and other things, right? It can't just be Honeycomb, it has to be a broad movement. And I feel like this year is the first year that we're starting to see a lot of this groundswell.
Martin: I think the, what Jeremy was saying at the start around that commoditization of the telemetry data, it's if there are seven different, dare I say pillars, signals of data, maybe that becomes 15, maybe that becomes 20. All of a sudden you essentially end up with proprietary again. It's like, oh well we just do the first three signals. We don't do the last three signals, you know, we kind of do the middle ones.
If everything becomes one type of signal you derive things from, it becomes so much easier for things to be built, like you say with ClickHouse or building it on top of some other open source platforms, building it on top of standardized query languages and all of that kind of stuff where, well, you know, it doesn't really matter which signal you are using because it's all the same signal.
That I think, is at the sort of heart of it. One of the things I loved about one of the presentations I've seen you do on this, which is there's only three types of data.
There's metrics data which we can do maths on, there's unstructured data which is strings and we can search them, and then there's structured data, because you know, we can do everything with that.
You know, if you take those three different data types, and just think, well what if OpenTelemetry was just one signal type, just a structured log, I think that would be amazing as a, well, everybody just has the same data now and it looks the same
Charity: And a lot of people are like, Jeremy, he said something about how the reactions to your blog post had been like one of two either, well obviously we've always done it this way or whoa, you're blowing my mind. This is so new. Is that, I got that quote right, right.
Jeremy: Yeah, the reactions to my blog post kind of fall into two camps. One is like, oh this is new information, this is interesting, I've never heard of this, and the other one is, oh this is that thing that we've been doing for a decade and a half.
And so I think that there's a schism in the industry. There's the hyperscalers and the FAANGs of the world who have had to do this stuff out of necessity for better part of two decades, and there's people who are just now reaching those levels of complexity where they're drowning unless they can figure this stuff out.
Charity: Yeah. Part of it is just that like Amazon has been doing this for 15 years, as long as EC2 has existed, they've been storing our data in these wide, you know, structured events in like basically a flat file and root partition. But like there wasn't really a term for it and they didn't really evangelize it. But yeah, no I think, I think that's super interesting.
Jeremy: Even now there's so many terms. I've heard of this as canonical log lines, wide events. Charity is now observability 2.0 and there was observability 1.0 without, the version that was the same idea that it sort of got co-opted and now it just means everything all at once.
We've also had a Cambrian explosion of observability vendors and I think a lot of that is due to the costs have just dropped enormously. When Honeycomb started, there wasn't really a good open source Column store that you could pull off the shelf and build this against.
And so now there's ClickHouse, now there's DuckDB, now there's a number of them you can choose from, and then OpenTelemetry also like, Honeycomb had to build their own instrumentation libraries at the start. There's just an enormous amount of work and now you can pull off an off the shelf database and an off the shelf SDK through OpenTelemetry and start graphing.
You can get something working very, very quickly. And so I think that there's just, competition in this space is heating up.
Martin: I think that's something that happened sort of before this was, well Prometheus and Influx, they were metrics databases that people knew that they could kind of run at scale.
So, well I can post those myself, and you know, every engineer loves sort of hosting and building their own things. I am a bugger for it, I swear. But everybody loves doing it.
So they were like, oh great, I can take that off the shelf and I can host my own metrics store. I can kind of use Elasticsearch and throw all my stuff in Elasticsearch and do some things. But they'd rely mostly on metrics 'cause that was easier to do.
And now as soon as we've got the barrier to entry, as you say, that cost to entry into that space is, well, I'm using standardized telemetry libraries that emit the signal data for me. I don't need to worry about that anymore.
I can use an open source database that can run well at scale. So my only bit in the middle now is, you know, a bit of React front end, which is, React, it's easy, just to like do some bit of querying and stuff.
So now there's way more people who are thinking DIY, the observability 2.0 stuff, because what they're doing is they're just dumping all of their data into ClickHouse or DuckDB or wherever and saying, great, I can do 2.0-type things now because I can start just using that data.
Charity: Once you get hooked on being able to slice and dice by things like app ID and user ID and request ID and strings and unique response codes and stuff, like, it's like so hard to go back.
Like it is, you feel like hampered as an, like you're trying to develop with a blindfold on and like one hand you're just like, I don't know how to, I don't know how to engineer anymore without this stuff.
But it's also a little ironic that I think that like they've had nice things on the business side for like a decade, maybe 15 years. They've been using, I remember calling their stores, Vertical, was it that came out almost 20 years ago. Blew our minds for data warehousing.
Can you imagine trying to run a business if you're like in marketing or sales if you had to predefine the buckets of cohorts in advance and you couldn't see what people were actually doing like for sales pipeline or whatever.
You can't run it business that way. It's so crazy to me that like the cobbler's children have no shoes. Software engineers, just still here sort of limping along with this 30-year-old technology, 'cause oh, we could make it work with some duct tape and shoe laces. Like I feel like the falling costs, the wider availability, it's just, yeah, I am really excited. I think it's a really exciting time to be an engineer.
Martin: Yeah, if you went to your business people and said, if you want to ask this question, you've got to ask a team and that team will put it on their backlog and you know, two days later you'll get an answer as to whether you need to order more paper for the printer and that office over there, like, no.
Charity: When we started working with Slack engineers, they had the ability to ask these questions, but they had to wait for overnight jobs to run so they could get it out of their data warehouse.
So they wanted to ask questions about what was happening in production. They just had to wait until tomorrow.
Ken Rimple: That problem may go away too, and like what is the issue anymore?
Charity: Yeah.
Ken: Is it something magical, just was there and gone? You know, the feedback loop ties is the issue, right?
Jeremy: Yeah, I think you can judge your team based off of, hey, how able are you to answer these questions, if you have an incident and you see a spike in errors?
I've seen a lot of teams of companies where if someone asks, well how many customers are affected, which customers are affected, then the answer is, well we don't know. It might be this many. We got three support tickets opened.
Oh, and maybe we can go and you can find the answers to that by running these overnight queries, or something that's very tedious. And so all the incentive is there against the engineer to say, oh well we don't really need to do that because it's just too hard.
Martin: Yeah, I think that's a lot about what 2.0 is about, is how do we get those answers? It's not really about the data. You know, we could store it in 17 data stores if we really wanted to, but it's actually way easier to just store it in one and do it that way. But it's about how do we ask those questions?
It's one of the things I think we were talking on socials recently about the idea that the visualization, I mean we've got one called BubbleUp and there's lots of other people using that same kind of visualization where what we want to be able to do is segment data automatically and sort of bring up those ideas of which things are common and that kind of stuff.
It's all about how do we use that data now to get answers. Like you say, if a user, a user being a developer in the system, can you answer these questions? That's really, in my eyes, sort of a measure of 2.0. Like can you ask this question?
Can you ask this weird question that you didn't know about yesterday? Like, is that possible based on you just having the date you've got?
Charity: Yeah, so much of this is about shifting the time that you ask questions from at write time, are you making these decisions at write time and locking yourself into a set of questions, or are you allowing your future self to ask these questions at read time?
Jeremy, the jeremymorrell.dev/blog is fantastic. You have two posts up there. You're talking a little bit earlier about, you know, you're a principal engineer, do you wish that you had started writing sooner?
Jeremy: Absolutely. In my earlier career I gave a handful of talks, in which case, I put a similar amount of effort in I put into each of these blog posts preparing this talk, traveling, it's occasionally paying for my own hotel and travel in order to speak at these events, and then maybe 300 people heard it.
And so I recently started this blog, so I had these, you know, four plus years of experience trying these things and I just really had a lot of stuff I wanted to get out, and I wrote these and I didn't really know what to expect.
And I think I've had 40,000 plus visitors to the blog in the last three months. Just the level of impact and visibility that it brought led me to rethink is like, oh, I think I've actually dramatically underinvested in public writing in my career.
I've put similar amounts of effort into internal documents, internal blog posts, and you know, maybe had a couple dozen people read it, the ones that were really successful. Occasionally people are like, oh well I'll get to that and then they never read it.
But just being able to put something on the internet, there's no barrier to entry and the reach is, it has exceeded every expectation that I had going into it.
Charity: I found like a pro tip is actually people internally pay more attention to things than I post externally, than things that I post internally.
Jeremy: This has affected me a lot really early in the advocating for internal observability at Heroku, is that I've had this a number of times, experienced this a number of times in my career. There's such a thing as being too early.
And so if you are internally and it's just your voice and you're advocating for something, internally and it's just your voice and you're advocating for something, there's a whole bunch of noise, but if the, there's the who do we trust, who's technical vision do we trust, and a lot of that is external to your company.
And so a lot of my own internal advocacy efforts greatly accelerated once there started to become more external voices advocating for the same thing. But that took years.
Charity: Yeah, another of my like pro tips is that anytime you write a blog post, you should probably turn it into a talk and anytime you write a talk you should turn it into a blog post. So it's only like 10% more effort and then it exists in multiple--
Some people are video learners, some people are readers, you know, and it's just like, I could not agree more. I think writing is one of the best things that people can, people often, I think, think that they need to be like masters before they can communicate out, but part of achieving mastery is learning how to communicate.
And I feel like some of the most valuable content is actually people who are figuring things out at more intermediate levels than like processing and like reflecting back what they've learned and what they're achieving.
I also feel like for women, people of color, you know, folks who, there's a little bit of a tax, people don't automatically assume that you're technical, but I feel like it's like kind of a life hack to that can compensate for that because the social proof of having your name show up in a Google search next to technology is like, oh well they must be an expert.
You know, so they like perfectly counteract and balance each other. I think everybody who is from marginalized background in tech should make a point of doing some writing or speaking about the technologies they care about because it gives you that.
It gives you more of a stamp of credibility than it probably should because like you say, barrier to entry is nothing, and anyone can write shit on the internet, but why not use that to your advantage?
Jeremy: I'll say that it's scary. Even writing this blog post, this thing in this domain that I've been working in pretty much solely for four plus years and I had, I felt like I found out I had a lot to say, but I was like, what if I'm wrong?
What if I'm saying something and someone comes in with some new information that I've just missed? And so I felt like I had to polish and polish and polish to try to get the thing out.
And also I was like, I think this is good enough. I have as much confidence in this as I'll ever have it in anything and so I'm going to publish it, and everyone was super, super kind.
Martin: How do you think that helped you internally, you to sort of have to put things into words on paper and review them? Do you think that helped you mentally sort of understand these things in more depth? Did it help you not just in your career and obviously your outward perception?
Jeremy: Absolutely. I think it definitely sharpens your thinking, 'cause you start to write a thing, you're like, oh, I think it's like this. And you're like, wait, is it, am I sure? How sure am I? How does that actually work?
And then it forces you to go in and dig just a little bit deeper to sort of polish those things. Whereas if you're sitting in, you know, across the table from someone and you're trying to argue and advocate for a thing, maybe just hand wave over that bit.
But it becomes just glaringly obvious when you're trying to write down that, oh, my thinking actually isn't very clear on this point. Maybe I should go fix that.
Charity: Yeah.
Martin: I tend to find that I kind of take it as if I've sort of chiseled something into stone somewhere that people are going to sort of take down from a mountain and, you know, show everybody about.
Like, that's not going to happen. I know that. My conscious brain knows that, but I feel like, say, it has to be right, it has to be verifiable, which means that you really end up spending a lot more time on a talk.
I mean, Charity, you said it takes, you know, only 10% more time to write the blog post. But I tend to get way more in depth about the references and making sure the images are just right, and you know, maybe even going even further than I would in a talk. But it really helps me sort of solidify my knowledge, 'cause I've written it down.
Jeremy: My drafts are full of to-dos. Like, it's like, oh, put this reference here, and then I have to go back, like to do this, to do this, to do this, make a graph here. And then I have to go back and reread all of the articles that I'm referencing and make sure that, do they actually say what I remember them saying?
Charity: You know, my constant challenge to myself is to like write shorter, simpler pieces. Stop feeling like you have to boil the ocean, like this is literally, I'm holding up the sticky that I literally have on my monitor. Like write shorter, simpler pieces. Not every piece needs to change the world,
Ken: Right?
Jeremy: That is one of my goals with the blog is to, this took me, you know, this is four years of knowledge condensed essentially to one blog post. I can't do that again. So I think I do want to develop a writing practice so that I can't, it doesn't take me, you know, a month and a half in order to get a blog post out.
Charity: Well, so much of like becoming, you know, a post senior engineer, know a staff engineer, a principal engineer a manager is about learning to have influence, right? Influence without authority.
Even managers, even people who are people managers, most of your leadership is not exercising formal powers on behalf of the org. It's, it's having an opinion, communicating that, rallying people behind it, convincing, persuading.
And like, kind of to your point, Martin, I think that writing is the best way to be persuasive, to make sure that you have confidence in what you're saying, to make sure that it's true, to like look at, you know, revising it to make it pithy, to make it memorable, like having influence, and like, I feel like the best writing comes from a place of, I care about this.
Not, I'm trying to do this for my career, but like the career benefits are so real because these skills, they're crossover skills, right? Learning to be crisp in your own thinking and right and outline things, you know, it gives you more influence, inter-linkage, more influence externally, it opens up opportunities to you.
You know, these skills are not confined to one domain unless you confine yourself to that domain.
Jeremy: I think the social proof, I think is a really big component of that, especially for people that are underrepresented. I've gone into meetings and I've advocated for a position and it's only contained within that room.
If the other person has more social clout, a better reputation or something, they can really easily just say, oh, I'm not interested in that. I don't think that's right.
But if I am also bringing a whole bunch of social proof in terms of like, oh, it was linked to here, it's read by all these people and it has, I've had these responses externally, your argument in that meeting is so much more effective.
Charity: Yeah.
Martin: I think the first time that somebody presented my blog post back to me was a little bit of an epiphany moment, because they were in a meeting and they said, oh yeah, so you're talking about that thing, I'll just bring up the blog post I was reading.
And then you could see them sort of looking at the blog post and then looking at me and looking at the blog post, which was utterly hilarious. But my favorite is when you actually start googling a problem
Charity: And then find your own words from like three years ago and you're like, shit.
Martin: Yeah, using the internet as your own memory is great.
Charity: It's both humbling and a reminder that everybody else's shit stinks too. Like there are no authorities. Everybody else is just a dog on the internet pounding away just like you.
Jeremy: Oh, that's hard to believe though, because they have, they're like the books with, they're published, and like, there's so much authority in that and it feels like, oh, I'm not allowed to do that. But it turns out you can just do things in life
Ken: And you can revise them too. So if you find a better way of explaining something or if you realize, oh, I was wrong about this one thing, something I've actually said, you know, in one of my posts about something technical, you know, that's my understanding at the time, but I'm updating this now because this is a better way of looking at it, or this is a more correct answer.
Martin: That is my favorite type of blog post where you go to the blog post and the top of the thing says, this is a link to my previous one, but I've changed my mind.
Ken: I've got a fair number of those.
Martin: It's like, it's okay. Like, changing your mind when new information comes in is perfectly fine. Strong opinions, weakly held, is one of my favorite statements.
Ken: I completely agree with that.
Martin: And my opinions are very strong.
Charity: It's so true. There's also, I feel like, you know, you've heard that quote, we tend to compare our insides to everyone else's outsides.
You know, people who show up and are eloquent, they didn't start that way. Like, people think that I'm an extrovert, I'm not. When I started giving my talks 12 years ago, I had nightmares, like for weeks. Like I had to write out every single word.
The first talk I ever gave was internally in front of 12 people and it bogged me up. I was so scared I couldn't walk. I had to get a prescription from my doctor for propranolol. I had to take drugs in order to get up in front of people and give a talk for like two years.
But you do it again and again and again and like you get better. You know, there's some people are maybe just born with these skills. I'm not, most of us are not.
Jeremy: I'm definitely not. When I first started giving talks, like my foot would start to uncontrollably vibrate and there was nothing I could do to stop it. When I've given talks, the week up to it, I'm reviewing everything I'm trying to say and making, running back through it. The couple of hours, I can't have people talking to me.
Charity: Yep.
Jeremy: And then after the release of, I'm done with this, this is over and I can go talk to people, that is so good. And then the second release is the getting back to like your house or your hotel room and just being able to collapse and there's silence, and you're like, yes, that is done.
Charity: Yeah.
Jeremy: However, every single time I've done it, it's been worthwhile.
Martin: I think there's a lot to be said, that the rush that I get is normalizing certain things. Like when you're starting to talk to people and you know that, and I don't you call it vulnerability if you like, but being able to say that here's a thing.
Like your, the post that you did about why structured events where there are people looking at it going, this is brand new information, or I've been doing this, or I wanted to do this and everybody said it was wrong, actually wait, there's real people doing this and they're writing about it.
There's all those benefits, maybe societal benefits is probably a better way of putting that, that I really like when I start talking about these things because I know from my talks that I do, I'm talking about things that my peer group have talked about them doing.
So me then up on stage and talking about it, maybe they are the people who don't want to do talks, but I'm normalizing these ideas and then everybody's going, well yes, absolutely we're going to do that. Why wouldn't we?
Charity: You're part of a groundswell of change.
Martin: Yeah, you know, talking about the weird and wonderful things is interesting as well. I think your blog, Charity, where it's not all of one thing, it's meanderings around the entire of your brain, which is, let's just say a weird and wonderful place.
So what else do we need to know about the observability, sort of, I mean, you talk about wide structured events, what else is it that you are hoping for from a 2.0 world, Jeremy?
Jeremy: The other point I wanted to make was around how important outliers are and how wide events allow you to see those.
Your most important users are frequently outliers within your own system. This has been true basically every company I've ever worked for. There's a handful of users that are responsible for double digit percentages of your company's revenue. If they're having a bad experience, that is literally the difference between being able to make payroll or not.
And they're frequently not very many and the behaviors they have on the service you're providing probably don't look like your average user. They may send you more data, create more resources, they use different features than your average user.
So if you're looking at a P99 and you're saying, okay, well then I assume that everyone, this is the P99 for any given user, but I've definitely had experiences where, say, the user responsible for 15%^ of the company's income, they created more resources and it prevented the homepage from being able to load.
Charity: Yeah.
Jeremy: And so yeah, the average user is getting a P99 of under a second, but their service is just down
Charity: Yes. For one your most important users.
Jeremy: Yeah.
Charity: And the same is true of outliers when comes to cost. One of my favorite stories from early Honeycomb days was Intercom, they were like, when I was in charge of marketing, I'm like, cardinal is a great marketing term.
They're like one of the only people to find us via that. And they like added an app ID to all their stuff and they rolled it out. The story, as I remember it was, MySQL was about to outgrow the largest EC2 instance size and they were about to have to go and undergo this multi-month sharding, migration, something-something.
They added Honeycomb and she ran it with app IDs and they just started like kind of bumping around, slicing it. I went, oh shit. Something like 80% of execution time was being eaten up by this one app who was paying them $20 a month.
So they could do this multi-month migration thing, and like this app didn't show up any of their top 10 lists of like queries or anything. But like it was when you had the ability to adapt cumulative, like execution times time spent waiting on MySQL, it became super clear you could do this migration or you could throttle this app that's paying you practically nothing and kick that can down the curve for like, and that's the kind of thing like you can't predict an event, right?
Jeremy: I have multiple stories like that where you, if no one has ever looked at what is actually happening in any particular system or a set of systems, once you can start to look, you're going to find, so the roaches scurry, you're going to find, oh, actually we have this thing that's calling in a loop thousands of times and no one just ever knew.
We're producing terabytes of data that we don't need and don't ever use and no one has ever just looked and said, Hey, do we actually need this?
Charity: Yeah.
Martin: We talk about that in the term of MTTWTF-
Charity: Yeah.
Martin: Which is the meantime to, what the F is that. That one spike that comes up, and you know, that is a lot about how do you analyze the data that you've got, how do you slice and dice it, and how do you get that high cardinality?
But every tool has that. I think that we probably should put this in the door metrics somewhere, you know, the measure of lead time for, so the meantime to what the F is, you know, a measure of how well your observability is. Because how long until you go, that one user, that one database.
Charity: But also on the sociotechnical front, I feel like, for so long our tools have not encouraged us to be curious in these ways because they've kind of punished our curiosity and creativity.
I've said very many times that like, I have so many memories of like putting a software engineer on call for the first time. You know, and they start to see something they don't quite understand, and they like, oh, I want to go investigate.
And you're like, oh, grasshopper. Like, don't pick up that rock, and you could spend days trying to investigate this.
You might never, you know, and I feel like that's an artifact of not having these wide events with all of this rich context that preserves the tissue and the relationships, which allows you to identify the outliers like you were just talking about. And if our tools are punishing that curiosity, you just, you start to learn to not ask these questions.
And one of the most exciting things, the consequences of the sort of like shift from 1.0 to 2.0 is that I feel like your tools start to reward this curiosity. You start getting this dopamine hit of, I was wondering what this was and I figured it out.
Oh my God, did you know that this was happening? Did you know that this was happening? And it's contagious. We're engineers because we're curious, 'cause we love solving problems and puzzles and I feel like it's so exciting to see our tools finally stepping up to the job of helping us.
Jeremy: Just anecdotally, my own experience is in--
There's two tools that have really made a difference in my engineering career. One of them is early 2.0 wide events. The other one is really embracing feature flags and gradual rollouts.
And when I've, you know, helped mentor junior engineers straight out of code schools, they're eager and they're excited to learn new things and these concepts are not difficult, but those, when they can adopt wide events and feature flags, I've seen them run circles around even very experienced senior developers who haven't made that shift.
Charity: The combination of the two of those is lightning. I have this shirt on that says, test and prod. Feature flags plus observability 2.0 is what allows you to do this safely. It allows you to like play in production like it's your sandbox.
It allows you to separate the concepts of deploying, like consistently shipping, small deploy, and releases, which is more of a marketing and a product experience.
I think there's this whole shift of like, what I think of, as like predict wave of software engineering best practices, which are all about moving that center of gravity from pre-production to in production and giving me these scalpel-like tools to really inspecting and understanding the consequences of what you've just done.
Jeremy: I think that's very clear when you leave a workplace organization that has this sort of tooling baked into it, the way it works, and then you go and you try to work somewhere else that doesn't have that, it's like everyone turned off the lights.
And now it's like, oh, I'm scared to hit merge and deploy on this because it's just going to go out to everyone immediately and I don't know what's going to happen.
Charity: Yes.
Martin: And it's a vicious cycle, isn't it, because then they're like, they're scared to put this thing out so they pause, they wait. And then all of a sudden you have 50 PRs that you're merging at once into one big deployment thing, and then you-
Charity: And then it's really scary to deploy.
Martin: You know, it gets scarier and scarier. And I think people then rely too much on the things that are way too far left.
You know, they'll kind of reach for architectural design pans and unit testing and really low level stuff to give themselves confidence, where, you know, one sort of end-to-end test that has some telemetry hooks that you can then query when it goes into production with solid rollback strategies that can happen really quickly.
All of that kind of stuff, and you go, well, yeah, I'll just deploy it. If it breaks, I'll revert it.
Charity: Exactly.
Martin: That kind of idea is just game changing. And you know, I have a few friends who went from an organization from, that we built that in and then they go into another organization and they're like, I want to go back. You know, Hey, I'm going to can can go back there. I know I said bad things.
Charity: It starts to become what you look for in every job after that because it's a whole different career.
Jeremy: Yeah.
Charity: Jeremy, thank you so much. This was so much fun. I am so excited that you're starting. I can't wait to see what you write in the future. I'm excited for you new job. They're so lucky to have you. And thank you so much for coming on o11ycast.
Jeremy: Thank you. This has been lovely.
Martin: Can you just let us know where people can find you, and you know, where, you got a blog, you're on social media. Let us know where everybody can find you and what you want 'em to find you about.
Jeremy: Yeah, I started a blog at JeremyMorrell.dev and then you can find me on pretty much all the socials as Jeremy Morrell, although I spend most of my time on Bluesky these days. And I talk about observability, about platforms and about systems.
Martin: Awesome. Thank you so much.
Content from the Library
O11ycast Ep. #76, Managing 200-Armed Agents with Andrew Keller
In episode 76 of o11ycast, Jessica Kerr and Martin Thwaites speak with Andrew Keller, Principal Engineer at ObservIQ, about the...
O11ycast Ep. #75, O11yneering with Daniel Ravenstone and Adriana Villela
In episode 75 of o11ycast, Daniel Ravenstone and Adriana Villela dive into the challenges of adopting observability and...
O11ycast Ep. #74, The Universal Language of Telemetry with Liudmila Molkova
In episode 74 of o11ycast, Liudmila Molkova unpacks the importance of semantic conventions in telemetry. The discussion...