Ep. #29, Testing in Production with Glen Mailer of CircleCI
In episode 29 of o11ycast, Charity and Shelby are joined by Glen Mailer of CircleCI. They discuss testing in production and rethinking socio-technical systems from the ground up.
Glen Mailer is a software engineer at CircleCI and a software development consultant at Stainlessed.
In episode 29 of o11ycast, Charity and Shelby are joined by Glen Mailer of CircleCI. They discuss testing in production and rethinking socio-technical systems from the ground up.
transcript
Charity Majors: When you hear the words "Testing in production," what does that mean to you?
Glen Mailer: I think it means giving up on the fiction of staging, or at least some aspect of that.
"Testing in production," to me, is about feedback.
It's about putting a little change out there and seeing how it behaves in production and watching it as it goes , and I think for a long time a lot of people in software engineering, they would build the change and then they would hand it off to the ops team and that was it, done.
I think I really leveled up in my career when I started following my change all the way through to users actually using it.
Charity: "All the way out until users actually using it."
Yes, your job is not done until you've watched someone use your code in production.
I feel like this has been a more controversial topic than it perhaps should be, because so many people tend to hear, when you say "You should test in production," what they hear is "You should only test in production."
It's like, "No. We're not actually saying that. That would be insane.
You should still do your unit test, and your integration tests, and all the building blocks.
What we're saying is that TDD was the most impactful software movement in my lifetime, no doubt.
But they've made it impactful by discarding everything about reality, everything that could ever be variable, concurrent or interesting, they're just like "It doesn't exist. Tra-la-la."
For a long time, you're right.
Software engineers just stopped there, and what we're saying is, "No. Your job's not done. Your job's not done until you actually make sure it works, and that has to mean production."
I think the phrase "Test in production," I'm not really sure who popularized using that term, but I think I guess it's similar to the hashtag #noestimates sort of stuff, where the phrase is designed to be contentious. It's designed to make people sit up and listen.
I think I could speak to that, because I started giving talks titled that.
And the reason that I glommed onto that term was because of that wonderful meme, the "I don't always test. But when I do, I test in production."
I was just like, "It's so good."
But I always meant it a little bit tongue in cheek, a little bit ironically, I feel like we have erred too far on the side of not paying attention.
Almost treating production like an afterthought, which, How ? How do we do that?
Shelby Spees: Yeah.
Developers are so insulated from the reality of their code running and people actually using it that they no longer think about how to solve physical problems and think about the user experience. They just think about these lines of code in front of me, so it's just this very synthetic experience.
I feel like it's a loss both for the quality of our work as well as our enjoyment of that work.
It's so profoundly engaging to observe and experience the impact of your work in production and seeing how people actually use it, and then learning from it.
That feedback loop, so I appreciate you bringing that up, Glen.
Because that's, I think, the most important thing about testing in prod.
We're already testing in production, every time we deploy a change it's a test.
So we might as well learn from the risks we're taking by pushing changes.
And now's a really good time for you to introduce yourself, Glen.
Charity: Yeah, Glen. Who are you, Glen?
Shelby: Tell us.
Glen: OK, hi. I'm Glen Mailer, I make computers do things for people.
Earlier in my career I was really fortunate to land at a company called Sky Bets, where I learnt a lot of things about DevOps, Agile , lean security and business.
After I left, I became a consultant where I tried to spread those lessons.
But nowadays I am a senior staff software engineer at Circle CI.
Charity: Nice. Talk to us about Circle CI. What are your philosophies around staging and testing there?
Glen: We don't have a staging environment, which was one of the things-- I joined about eighteen months ago now and when I heard that I was very surprised, because you get used to everywhere has a staging environment.
Everyone hates their staging environment, everyone's like, "Got to go through staging," it's really expensive to maintain and it's always out of sync with production, but it never really has the right data on there.
Or maybe you've invested loads of money in this really fancy pipeline which copies data for production and then hides all the useful bits, so it was a real breath of fresh air to find they didn't have that staging environment.
I think that's been the way for a long time, like the early founders were really big on continuous deployments, so yeah.
We merge with pull requests and it goes to production.
Then in order to make that safe, there's a lot of techniques and practices we have to do so we can test in production.
We do a lot of automated testing where we test pre-production as well.
Charity: How do you how do you do that?
I don't know what the architecture of Circle CI is, obviously you have a bunch of customers who all have their CI pipeline.
But are these--? Do you have as many environments as you have customers?
Glen: No, it's all big one shared thing, and I think that's a lot of how we can make it fast for customers, having this multitenant architecture.
The team I'm on is actually called "The execution team," and in the term of executing a program.
Charity: It's important to clarify.
Glen: Yes. Our Slack channel was called "Executioners" for a while, but we decided to soften it a little bit.
So we are the engine room, we're the team that's in charge of the infrastructure that takes care of running those builds.
So it's primarily a lot of pretty big boxes running on AWS, where we cram people's builds in there and then we can run with a bit of headroom in our scaling, so we can bin pack all these jobs, which means they can start really quickly.
Charity: I assume that it isn't a big bang, where everybody gets the same new build at once?
Glen: Sometimes, we still do that.
Charity: R eally?
Glen: It depends what the change is, it depends what you mean by "At once," I suppose.
Charity: Do you do rolling deploys, or canaries? Or anything like that?
Glen: Yeah. The way our pipeline is setup, we try and minimize the time between commit in the hands of customers.
Because I think the longer you're in that middle state, the more you have to think about it.
Charity: Oh my God, yes. What is your average number of minutes, typically, for a change to go out?
Glen: I think our slowest repository, it's about maybe eight minutes between commit and the deploy starting.
I think that rolls over the course of about five or ten minutes.
Charity: Nice. Well done.
Glen: The fastest one is probably about maybe a minute or so to run all the tests, because we can paralyze as large as we want.
We've got the bigger Circle CI plan, which is nice, so we really go wide and shallow with the pipeline so we can get to the end quickly.
Charity: That's fantastic.
Glen: Then some of them the roll will happen over the course of a minute or so, if we're confident enough in the change.
We've got a few different levers we can apply, so when we're doing a change we can say, "OK. Right, just do a fast roll. Or actually, now we think this change needs a bit more thinking about, so we'll launch a new version of the software into production alongside the existing version and check in our monitoring, 'How does that differ from the existing software?'"
Or we can go a bit more heavy weight than that and say, "OK. Let's put some feature toggles in our code. Let's shift that out to everyone, but dormant, and then let's take our time over a couple of weeks gradually ramping that up to customers or opting in certain customers."
Or sometimes what we'll end up doing is, as we roll it out percentage-- Because basically, Circle CI's product is arbitrary code executions as a service.
So our surface area is exposed to customers includes the Linux kernel, and we've got --
I don't now, I should have looked this up beforehand, but we've got many thousands of customers.
Often when you upgrade the infrastructure, some really subtle thing breaks one or two people, so being able to keep ramping up that percentage rollout but just exclude a few people until you figure out what's going on with them can be really effective.
Charity: This is so interesting. We're a customer of Circle CI, by the way.
We're really happy with it. But build companies have been some of our earliest and best customers, because you have this characteristic of chaos.
Your customers are all a little chaos monkeys who all have these very specific environments and pipelines, and they're all different.
It's really hard, I've got to imagine, to ship a build that works for everyone.
Glen: This is something that, I think the before my time we had the 1.0 version of Circle CI, which tried to use inference.
So, just look at your git repo and guess what build steps you wanted, and then it would run at the same environment for every single build that we'd just install 10-20 different versions of every bit software and you'd have to pick which ones.
Then when we upgraded it, we'd upgrade everyone at once, and also before my time but some people worked on the 2.0 version of our platform.
Which is very much about letting the customer pick what they wanted to do and what to fix and what to flex and what base container to run their build in.
I think that's one of the things that allows us to make changes is that each thing we change could be for a subset of customers at any one time.
Charity: It's the original high cardinality problem, where without observability what a nightmare to track down these longtail problems.
I feel like the first time that I ever used canaries in a way that wasn't just by hand, where you're deploying to one host at a time.
The first time that I used them in a more programmatic way was when we were doing the rewrite--Rewriting the API using Ruby on rails to using Go lang.
So every day we're shipping changes that shouldn't break , that shouldn't change anything, and God, the types.
Because Ruby would just guess and assign types and write them into the fucking database, and then Go comes along and goes, "No, no, no, no, no."
We've got mobile apps out there who could only be updated by hand every few months, and so they-- You have to get all the types right and all the data, every ordering nightmare.
So we wrote some stuff to basically let us fork the output, so the inner requests that came in, we'd send it to an API server that would actually fork it to an API server that was written in Ruby and one that was written in Go, and then we would diff the outputs and write any differences out to a file and then send the Ruby result back to the user.
Then once a day an engineer would log in and look at the file, just "Which API request is sending us different results?"
We got really good at just programmatically doing some of these things, and this is what I think that people forget when we talk about testing in production too, is we're not actually just talking about "Everyone gets the new changes immediately."
Because that would be insanity, right? They're right, it would be terrible.
But there are so many tools and knobs, that yeah you have to invest in it, but my argument has always been "You need to be investing into your ability to change production a little bit of the time for a few people at a time for controlled-- Controlling that--"
Shelby: It is the blast radius.
Charity: The blast radius, exactly. Because we all have a limited number of development cycles.
It's the scarcest resource in our universe, and so many people are just sitting all of these resources into trying to get staging right, so they don't have anything left over to invest in production.
My argument has never been that nobody should have staging, there are some very legitimate arguments and industries who need staging environments.
My argument is that the bulk of their resources should go into production, and then staging should get what's left over, not the reverse.
Shelby: I can't tell you how many times I've had something worked perfectly fine in production, but because it was the way we had to roll out changes--
Especially on the configuration side, the infrastructure side, it works great in production and then we found out that it was buggy in QA or in staging, and then I had to spend all these cycles--
Charity: Or vice versa, and it's just a total time si nk for every engineering team, just the difference between the two.
Glen: When I think about these things, I like to put my product business hat on saying, "OK. There are tools and techniques that can be used to formally verify that software is completely correct to the specification."
And then you spend all your time making sure the specifications are right.
Now, those tools are really expensive and almost nobody in the industry uses them.
So that effectively sets the scene for, there is this gradience.
There's this gradient of correctness to what we actually do, and none of us is actually shooting for 100% correctness. We're aiming for confidence.
Charity: Right.
Glen: "How much time do we want to spend making sure this is right?"
Versus "Putting it out there and being confident enough that this is a net positive change we're making?"
How much does it cost to create a staging environment? How much does it cost to maintain a staging environment?
Where could that money go, where could that time ago?
I think there's some really interesting stuff when you start testing in production, you flex that muscle of getting that feedback from production that has some amazing really network effects that knock on virtuous cycles.
Then when you look at the research from the accelerates or DORA people, they were saying "These elite performers who are just pulling away from the rest of the pack--"
Charity: Those are the teams that are investing in production, I guarantee you.
Glen: Exactly.
Charity: Absolutely. It's not any one tool.
There's a suite of tools and techniques and everything, and we were just talking about this, I think, in our last recording .
I firmly believe that if you're asking an engineer to really change something about the way they work, to adopt a different tool or a different technique or something, it has to be an order of magnitude better than what they have in order for it to be worth training everyone and changing-- Change has a lot of costs, it's very costly.
All the unexpected things that happen, whatever, it has to be an order of magnitude better for me to confidently say, "Yes. This is worth your time."
I feel like the whole testing in production, reducing that amount of time between when you write the code and when it's live, users are using it.
That is-- I think for everyone out there it is now an order of magnitude better, like 2 or 3 or 4 years ago I don't think it was.
Even when we started Honeycomb, I think it was maybe twice as good, maybe three times as good.
So you've got some people who will adopt it and some people--
I couldn't confidently say that everyone needed observability because it honestly wasn't-- But now I think that I think you're absolutely right.
That percentage of teams that are elite, they're just reaching escape velocity because they're so much better than everyone.
Those are the teams that are investing in exactly this.
Glen: It's that whole thing about "The future is not evenly distributed."
So I was fortunate enough really early in my career before I picked up a bunch of bad habits, I was exposed to some of these ideas and they've been there at the back of my mind for a number of jobs in different aspects.
Then the last couple of years, it's really all come together.
Charity: As the tools have been maturing to meet them.
Shelby: I really love that Circle necessarily dog foods your own platform in your own tooling the way Honeycomb does.
I almost think it's a requirement to build a great product, you have to use it and you have to love using it and really care about it.
Charity: Especially for developer tools.
I think that it's not a coincidence that so many of us-- We've all just got it stuck into our heads, and "We all" meaning the industry, that building software and support just has to be miserable.
It just has to be awful and shitty, and we'll bitch about it and complain about it and that's just the way it is.
There's nothing you can do about it. There is so many things that are bad about this.
Anyone who is over the age of 30 doesn't want to get woken up all the time, anyone who has kids is just like, "Oh my God."
There are so many terrible side effects for this, but I think that what Circle CI and Honeycomb and other teams are showing is that this can be a humane industry after all.
But you really do have to radically rethink your socio -technical systems from the ground up, and there isn't a recipe book.
You can't just stamp it out, because each one of these systems is a snowflake, every single one of them is unique.
You have your own business requirements and your own customer thresholds, what is painful for your customers is not the same as what is painful for my customers.
There's no substitute for actually understanding what you're trying to do.
Glen: I think the thing you said there about each system being different, I think that's really key to me. I am allergic to the phrase "Best practice."
There is no such thing as a best practice, there are practices that work in contexts and there are costs and there are benefits, everything is a tradeoff.
That's not particularly useful, but it's more of a truism, I think.
I think the dog fooding, using our own product is really beneficial.
Especially for that feedback loop, there's some really interesting occasional downsides of that.
For instance, if we have an outage, especially when it's down to one of our suppliers that we're connected to, hook ups to them.
But then we're like, "OK. There are things we can do to try and mitigate the impact, but currently we're down, so we need to make sure we've got break glass procedures in place or we need to maybe--
We'll have a version of our enterprise product deployed on the side that we can then connect up to our production and ship stuff out." So, that's a really interesting thing.
On the other one, which is a thing where when you work on a product day in and day out and you know how it works, you get used to certain quirks of it. Then there's this delta that happens between the way we use our product and the way a lot of our customers use our product.
I spend all of my time thinking about CI. I do a lot of builds a day.
I'm looking at that pipeline, I'm saying "How long is this going to take? I don't want to wait for it. I'm going to spend time optimizing it. I know all of my product features, so I know how to optimize it."
Where some of our customers, they will take 15-20 minutes to run a deployment, and that's OK for them but it means that our product, we can't just myopically focus on us.
We can't build it only for us.
We have to go out and we have to do that user research, we have to go out there and talk to people and make sure that the things we're building work for them as well as working for us.
Shelby: Totally. Just meeting people where they're at and making sure that you keep in mind that we are always--
Maybe not always, but we're likely to be the most sophisticated or the most knowledgeable users of the tool we're building.
Making sure we step back and look through other people's eyes and observe their experiences and understand the context they are coming from, where I'm like, "OK. Here's my side project.
I want to build something, so wait, how do I configure the build config again? How's that work?"
I think about it once and then I never think about it again.
I deploy once a week on a weekend or whatever, just whenever I make a change because it's a side project it's not going to be something I optimize all the time.
That's OK, but also, how do we help people like that?
And how do we also help people who are deploying 50 times a day like you probably are?
Glen: This is the thing, from where I'm sitting I can see all of these builds.
I can look at them, I can inspect them, I can see the trends, I can see the aggregates.
I can drill down with high cardinality and go, "OK. What's the medium build time? What's the P95 build time?"
But when you're at your company doing your builds, you have no frame of reference, all you can see is your build.
This is something which we're trying to figure out a way of doing, it's a very careful balance because we're in a position of high trust with our customers.
They trust us to run their sensitive material, so we don't want to go poking around in their builds and certainly not telling everyone else what's going on.
I think we published a data driven report, maybe last year, where we took this aggregate information and those percentiles, and said, "Look. I think something like our fiftieth percentile build time is three minutes."
That might not be accurate, read the reports and triple check me on that, but those insights and comparisons is something that's really interesting.
Charity: When your customers are developers, and this is a thing that we ran into at Parse.
We were running mobile back end as a service and running databases for people, it was a crazy business model.
But I was constantly complaining about our customers, like "Can you believe this query they're running? It's doing a five x full table scan."
Just like, "Motherfuckers." But in reality, we weren't surfacing that to people.
They were just using the SDKs and trying to make something that worked, they had no visibility into how they were using my poor database.
I feel like you can only hold people accountable for them insofar as you give them the tools to understand the consequences of what they're doing, and code is such a powerful tool that it's very easy to do terrible things.
Glen: So this anecdote might be a bit niche, so maybe it won't make the final edit.
But a little while ago, I think a Ubuntu and AWS agreed to change their kernel LTS policy, and the net result of that was we released an image which should have had a pinned major kernel version, but did not actually have a pinned major kernel version.
So there's a major kernel version upgrade, and we're like "Oh no. We've upgraded the major kernel, what's going to happen?"
And actually almost nothing happened, the majority of builds were fine and that's why it got through canary testing and that's why it got to the level of rollout it did, because everything looked fine.
Then we start to get these Zendesk tickets trickling in of people saying, "I'm getting 'Out of memory' errors that I wasn't getting before."
I'm like, "That's strange."
So we look at one of the builds with the "Out of memory" error and we see, "Yes . That is using too much memory. That is correct. You're getting killed by the kernel, that's what's supposed to happen."
So I spent a couple of days combing through the kernel changelogs trying to understand what had changed, and there was this ridiculously subtle behavioral change where I think basically somebody looked at the kernel and gone "This is a really weird heuristic. Let's remove it."
The heuristic in question was, "If you are a process and you have some child processes and your parent process is using too much memory, we will kill a child first."
S omeone saw that and went, "That's ridiculous. Why would anybody want that?"
But it turns out the Android build system benefits greatly from that optimization, so when you use too much memory you don't actually blow the limit.
Charity: That's amazing.
Glen: Just being able to identify, "These are the customers. Here's the commonality. Here we can reproduce that."
And then we eventually pit one back to the old kernel version, and said "That was too strange."
Charity: This is one of those areas where observability, and by that I mean "The technical definition of observability," which I will preach until I die.
But the ability, what it lets you do is it lets you say "These things are weird. What do they have in common?"
You can't fucking do that when you have a monitoring tool, you need to have those arbitrarily wide structured data blobs, one request per service, and they all need to be high cardinality and they all need to be high dimensionality, and all this stuff.
Because otherwise, trying to stitch together that narrative when you're just spewing logs the old fashioned way is almost impossible, and trying to do it with metrics is literally impossible because you discarded all the connective tissue of the event at the time that you wrote them out.
This is a constant struggle for us, it's like, I don't want to spend all the time lecturing people about why they're monitoring tool doesn't do observability and why their data formats aren't correct, and everything.
That's really obnoxious. I feel like we need to find a different set of language that is more focused around "Here are the things you can do, and if your tool doesn't meet the bar of 'Can you do these things?' Then it isn't observability."
I feel like that might be a little bit friendlier.
Glen: I think I've had a similar scenario, when you try to explain the way a tool works to someone, and you go through "OK. Which ecosystem are you from? Ruby, Python, JavaScript? OK, here is the tool you've heard of that is most similar to this."
It's like Shelby was saying earlier, trying to meet people where they're at.
One thing I've been saying is that we have this internal observability group at Circle, where we evangelize these concepts to people who are less familiar with it.
As I said, I was really fortunate, back in 2012 I was working at a place where we were doing structured logging and then we had a process which titled The Structured Log past them and produced metrics.
So when you say to me, "Metrics are just derived logs," I'm like "Yeah. Of course they are, what else would they be?"
That just fits straight away . So what I've been saying to people mostly is these wide events that you talk a lot about are just good logs.
Most people have not seen good logs, so they don't know what good logs look like.
That's the way I've been trying to sell it, which I guess makes it sound more similar than it is. But no, the really good logs.
Charity: They're really good logs.
Shelby: What made it really click for me, and what I've been trying to help when I help people connect the dots there is sending disparate pieces of information that you're trying to get it to match up later on, versus sending your context up front as this blurb that you can then slice and dice later on.
With metrics and with traditional flat logs, it's very expensive within even the same method to connect the dots between a log line that you're sending at the beginning of that method and at the end, when it's successful or something like that.
Versus with your contextual events, that's already in the event that you're sending. So then it's just better structured data.
Charity: It's hard to describe.
Shelby: Yeah, maybe I just described it in a more confusing way, but that's--
When I've explained it to people like this, sending it up front as a unit versus trying to piece it or glue it back together later on, it just makes so much more sense.
Charity: The thing that helps me describe it sometimes is reminding people that this all became very much more mandatory when we started having micro services, because it's a monolith.
If you really wanted to understand what was going on, you would just trace it.
But now it hops the network, which means that whole model is just broken.
You can't attach GDB to it now, so you have to have a way of passing all that context around with the process as it hops around, so it's really more like applying GDB to your systems.
It's just your responsibility to tackle that context together, to chip along with the request as it goes bouncing around.
Glen: This is actually really interesting to me because I'm not going to disagree with that, but I'm going to say something slightly different.
Which is I think when I first started hearing about Honeycomb and reading about Honeycomb, I think it might have even been before you had the trace waterfall implementation.
As before, it feels like the tracing was the thing that really made people go "OK, now I see it."
And tracing, those trace waterfall graphs had this really viscerally, "My code is doing what?" We got so much of that.
Charity: "The time is going where?"
Glen: Once you've got over the initial hurdle of tidying up those traces, then actually traces, that trace waterfall becomes way less useful.
Not because it's not showing information, but because you're not doing such weird things anymore.
Charity: It's only useful to you when you know what to look for or where to find it.
This is where we versus Lightstep, we're very similar but we've heard it described as "Lightstep is tracing first and then high cardinality events, and Honeycomb is high cardinality events first and then tracing."
Which I think, as I would, is the correct way to approach it.
Because it's always about trying to figure out what's wrong and then you want to trace it to see what's wrong about it, but finding the place in your system that has the code that you need to debug is the entry point, that is usually the heart. That's the hardest part and it's the place to start.
Glen: Our adoption of Honeycomb and Circle CI, like anyone's adoption of any tool, I think is incremental and it's driven by need.
What I've actually found in a few, we've got a few of our services which are not really connected up to that big trace graph, but they do emit a single wide event by request.
I can get so much out of just that without connecting the tracing to anything, and especially when the tracing bits we've got around people running builds.
I have a single-- Actually, I have two events for every build that runs.
I have an event when it starts that tells me how long it queued for and what the aspects of it were, and I have an event when it finishes which tells me which node it ran on and what version of the kernel was there, how long it was waiting, what it was connected to, what type of execution it's on.
Just from those two events I can get so much value, and so one thing we're grappling a lot with at the moment is dynamic sampling.
We have a lot of data, I think our role logging system is in the terabytes a week. That's very expensive.
We don't want to spend the money, we want to keep the signal and we want to lose the noise.
But there's this visceral, emotional attachment to my data, it's like, "No. I did some work. Don't let me throw away my logs, I want to see it."
The thing I really like about those wide events, especially if we're running a build, is that's the thing we bill our customers for, is running builds.
I can say "I'm charging someone this much and I'm storing two events."
I can definitely afford two events per build, so I can 100% sample those.
But when you start talking about, "OK. We don't bill for use of our API, so if someone calls our API a million times a week, how much of that can I afford to keep? How much of that am I willing to afford to keep?"
Charity: Those wide events also, in a way, help replace the need for staging.
Because if you can see, "These 300 or 400 attributes are the context in which this job was executing," that often tells you everything you need to know about how to reproduce it.
You don't actually have to go reproduce it, you can just see "These are the outlier things. These are what all the errs have in common."
Secondly, I love that you said that because I have increasingly been thinking that another differentiator between the infrastructure monitoring thing, your Datadogs, your SignalFX, your Wavefronts versus observability tools is that metrics are the correct tool to use for infrastructure.
Because you care about provisioning, you care about capacity, you care about "Do I need to speed up more of these because I am running out of some resource?"
There's been this division between dev and ops, where dev is responsible for the code and ops is responsible for making the services that the code runs on.
Infrastructure is the code that you run that you don't want to have to run, but you have to run in order to get to the code you want to run.
I feel like observability is the right tool that you need for the code that you are writing that is your business. It is your differentiator, it is the stuff that you're changing constantly, it is the stuff that your users are interacting with, it is the stuff where you don't care about in aggregate.
Also, if you care about each and every request needs to succeed and can it get the resources that it needs or not?
We found that the thing that most predicts whether or not a company is a good customer of Honeycomb or not, is do they have dollar values attached to the quality of their service?
If they're a media company that's just ads, they're just spraying ads or they don't actually give a fuck whether or not a particular user sees an ad at a given time, they're not a good customer for us.
If you have dollar numbers attached to--
Like, if you're a delivery company for example or a CICD company or something, you actually care about every single one of those jobs and it needs to be able to run for something that is not our fault, at least.
Then you need something like observability.
I feel like it's interesting, I've been thinking about this and realizing what an enormous percentage of the engineering at every single tech company goes to not their core business differentiators.
Even for the software engineers, so much of it goes into writing the libraries or just getting the stuff to make it so that a few of them can work on software that is actually what their customers are paying for.
Shelby: Like, teams full of yak shaving.
Charity: Even at Honeycomb, we've had 9 or 10 people writing code for the past year, for all of it.
The storage engine up to the UI/UX integrations, we had 9 or 10 people.
We've been making about one or two engineers worth of progress on our core business objections every eighth and we're incredibly high performing.
Serverless is the only example of, I think-- I hate this about them, that they are always like "Less ops."
And I'm like, "No. You motherfuckers, you're doing ops better."
Glen: Less infrastructure.
Charity: You're doing less infrastructure because you've successfully created the right level of abstraction to make most of your ops somebody else's problem.
It doesn't mean that there's less ops, it means that it is being done better and probably not by you.
Glen: I see this a lot in people I interact with, with the term "SRE," where the terms "Sysadmin," "Ops," "Infrastructure engineer," "Platform engineering," "SRE," "DevOps."
They all just get glommed into one, but there's a lot of different disciplines within that.
Charity: Yes.
Glen: I started my life as a dev, and then I worked in some ops teams as well where the people sitting next to me literally have the job title "Sysadmin," and we were in a DevOps team .
Which apparently you shouldn't do, but I've seen it work really well if you get it right.
But one of my colleagues at the time, I think I was still-- I think we were learning about the DevOps movement.
We were a PHP shop at the time, we were looking at Etsy a lot to hear what they were saying, Etsy were putting out some really good stuff around that time.
I don't know who coined it, but someone once said to me that DevOps is ops, but measured on business outcomes.
In the same way that you don't generally measure devs on lines of code or code courage, you measure on "Are they moving the business metric?"
I think on a recent cast you were talking about value street mapping, like "How does the work we do actually map to the outcomes of the business?"
Charity: I love that.
Glen: In the bad old days, we have the ops teams measured on time, and the dev team was measured on new features.
So one team wanted to change and one team didn't want to change, and I think to me, the best framing of DevOps that I saw was "We're going to take the ops team and we're going to measure them on business outcomes."
So now they're incentivized to promote change, and if they're incentivized to promote change, they're incentivized to make change safer and more continuous. Smaller.
Charity: I love that. I just wrote this very long essay about this, about how infrastructure and operations used to be synonymous and they've been increasingly diverging around this fault line of infrastructure.
If you want to do infrastructure, God bless, go join a company where infrastructure is their mission.
There's still a lot of work for people who have expertise in ops to do, but their expertise is increasingly enjoying a company and helping them figure out how to make as much infrastructure and ops stuff someone else's problem.
Because it turns out this is a really challenging and difficult and sticky and hard and fun set of problems , like figuring out how to ship software efficiently is just mind-blowingly challenging and interesting, and intricate.
It's really hard to do unless you have a very strong grounding in what has traditionally been called ops or DevOps or infrastructure.
Glen: For many years, I used to have a Post-it on my desk at all times that said "It's probably disk space."
I have not had to think about disks for about five years now.
Charity: Or the network, it's probably the network's fault.
Glen: It's always DNS, right?
Charity: It's great, it's liberating. But you have to, there's a certain amount of mastery that you have to get in order to help teams actually level up and improve at the stuff.
Shelby: I think there's an important space there for just helping developers.
This happened to my friend who is very much a back end developer, never touched infrastructure, doesn't know how to spin up an EC2 instance or whatever, but lives in production.
He got-- He sent me a screenshot of this recruiter email that was like, "You're an expert in SRE"and stuff like that.
And I was like, "OK. You might not have lots of experience with that SRE title, but you've been living in production for years and years and you've gone out of your way to make your services more reliable, and I think learning the domain of what is infrastructure and what is ops--"
And like you said, Charity, we don't want to be thinking about infrastructure but we want to be thinking about operating our systems and making them available and reliable.
I think there's-- You've even said before, the next ten years of DevOps is teaching devs how to live in production.
I think that's the convergence of everything that's happening right now, is help devs live in production and break down those walls.
You will have not only DevOps teams and SREs with a sysadmin background who really care about making your services reliable, but also you'll have developers thinking about "How do we reduce infrastructure costs? How do we reduce spending on third party tools? How do we make our code more efficient so that we can exchange up for more business value?"
Things like that.
Charity: My favorite website on the internet is WhoOwnsYourReliability.com.
Glen: I guess it says "You,"right?
Charity: "You." You do. All right.
I think that we're about out of time here, but this was super fun.
Thank you so much for coming, Glen.
Shelby: Yeah, thank you so much.
Glen: Thank you for having me.
Subscribe to Heavybit Updates
You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.
Content from the Library
O11ycast Ep. #26, Unknown Unknowns with Parveen Khan of Square Marble Technology
In episode 26 of o11ycast, Charity and Shelby speak with Parveen Khan of Square Marble Technology. They discuss Parveen’s journey...
O11ycast Ep. #75, O11yneering with Daniel Ravenstone and Adriana Villela
In episode 75 of o11ycast, Daniel Ravenstone and Adriana Villela dive into the challenges of adopting observability and...
O11ycast Ep. #74, The Universal Language of Telemetry with Liudmila Molkova
In episode 74 of o11ycast, Liudmila Molkova unpacks the importance of semantic conventions in telemetry. The discussion...