Ep. #30, Harder Conversations with Beau Lyddon of Workday
In episode 30 of o11ycast, Charity and Liz speak with Beau Lyddon of Workday about the recent sea change in observability, the shortcomings of metrics, the role of senior engineers, and the responsibilities of corporations.
Beau Lyddon is a Senior Engineering Manager at Workday. He is a writer, speaker, and podcaster, and was previously Co-Founder and Managing Partner at Real Kinetic.
In episode 30 of o11ycast, Charity and Liz speak with Beau Lyddon of Workday about the recent sea change in observability, the shortcomings of metrics, the role of senior engineers, and the responsibilities of corporations.
transcript
Beau Lyddon: Anywhere Amazon has a lead, I would never pick somebody just to come and eat their lunch.
That said, the engineering part of me loves parts of Google Cloud.
The architect/business owner/responsible for making money side of me has conflicted views with Google Cloud, I guess, is the way I put it.
I think they've improved there, and it's interesting because you can almost see the trade offs in real time, with the part of the engineering and technology part of it that I love trading off with the business part.
They're trying to figure out where their calibration is.
Liz Fong-Jones: It's almost like this thing that people complain about Google, "The engineering is like it's from the moon, but also the sales process and the actual connecting the business value also looks like it comes from the moon."
Beau: I think to a lot of business outsiders that's the obvious statement, but I think one thing that's lost and that I didn't hit, really, until I left consulting--
Because when I started at Workiva, they were on Google App Engine from beta.
What I learned there and what we learned there is you had to buy into their vision, and if you bought into their vision, "Holy crap. It's amazing."
And that vision actually was ahead of its time.
Liz: So, this is the pre Kubernetes days. Like, Google App Engine was at roughly the same time as EC2 before container engines and lambdas.
Beau: We were actually using both Amazon and Google App Engine, so we were using EC2 for our very high compute, but 90% of our work went through, which is actually a pretty good pairing.
Liz: You're the legendary Multi-Cloud user?
Beau: Yeah. The good version of Multi Cloud, not the lowest common denominator version of Multi Cloud that I can't really stand.
To me, it's like "Take advantage of the best systems. Or, the best parts of the systems."
The interesting thing is our time on Google App Engine actually led to us working quite a bit with Google on what became Kubernetes and voicing a lot of our input.
So it was an interesting time, we got to work closely with a lot of those folks being one of their early adopter customers.
But that was a learning experience too, and I think the interesting thing about if you really bought into the Google mindset is it set you up for the big technical challenges you would hit as you grow your company.
You were just already there, you're so constrained by the environment.
Liz: Right, exactly. It'll scale you to the moon, but it also has a cost to get onto it in the first place.
Beau: Yep. We definitely took that engineering-- Especially at that time, so few engineers had time was non relational databases.
The amount of time we spent just getting people to learn datastore and eventual consistency, I've presented eventual consistency to so many people at this point that I can do it in my sleep.
Even literally leadership, of "Why can't we do this thing that we've been able to do in technology for 20 years? Well--"
Charity Majors: That's the thing. Yes, you can scale to the moon. But you also have to build the rocket that is shaped exactly to spec, and there's not a lot of flexibility there.
Beau: The interesting thing is those constraints allowed us to really build some pretty powerful stuff.
I think constraints can be good, it's certainly helped me understand what would eventually become observability, just because GIE was the definition of a black box.
All we had was logs, so we had to get really creative to understand what was going on behind the scenes.
That was back to the struggles with Google, that was part of it.
They didn't truly dog food, and nobody-- Like, Amazon doesn't truly dog food, but they certainly dog food more than Google did.
They just were oblivious to our struggles, they were like "We don't know, we have nothing here. We need help."
Liz: It's just some corp internal IT applications, nothing user facing and nothing hyperscale.
Beau: Yeah, and so we got creative to figure this out.
But it also allowed us to just be really smart, because we had less than 10 people in ops when we went public as a company.
Our R&D unit was focused on the business.
We weren't thinking about infrastructure, or any of that.
Liz: Which is super powerful.
If you have Google's SREs being your ops team and scaling team, it takes a lot of load off of your business.
Beau: So much of what we were doing was learning how to interact.
And same with them, by the way. I definitely have a lot of stories of some painful interactions.
Liz: Every SRE at Google has a sticker that says "We are a Snapchat SRE."
Beau: I remember going to conferences with the Snapchat folks and watching them sitting, just like us, sitting in the hallways trying to get their app to stay up.
Liz: Now would be a good time for you to introduce yourself.
Beau: My name is Beau Lyddon, I am a engineering manager at Workday.
I have done a lot previously from consulting, to being in leadership, to just being an engineer.
I just love to solve hard problems, mostly.
Charity: I feel like there's a sharpening up of the abstraction layer that needs to be happening between-- I feel like this is the thing that Serverless has done really well.
They've made it very easy for you to reason about what's yours and what's mine , and I feel like the Amazons of the world, they're clear on their boundaries and the obstruction level is clear, but it's also very bumpy and uneven. It's not always where you would expect it to be. Sometimes it's right up in your face, and sometimes it's down racking servers.
I feel like this is what I hope and expect over the next 10 years of technology, is that-- So, I was a systems generalist.
I started out doing mail and DNS and file systems and operating systems, literally everything.
But I don't rack servers anymore, it's been so long since I've gone to a Colo, I've forgotten how to flip the power button. And I like it that way.
You had to sell me on it at first, I was not stoked about it. But now I'm like, "Yay. Those brain cells can be recycled."
I feel like this level of a threshold of what can be recycled is creeping up the stack, and there's this uncanny valley there for a while, where we call it "Outsourcing" because it's really obvious and hard to us.
That gives outsourcing a bad name, but as soon as it's done well it's no longer outsourced, it's just something we don't think about anymore because somebody else does it better than we do.
Liz: Exactly. It's this build versus buy thing that people struggle with, and we see that with cloud ops and we see this with observability.
Charity: The angst about it. They don't actually struggle with it, they angst about it while it's in that uncanny valley of being so close yet so far away.
Beau: The thing that's maybe a little bit unique about me is I've always been a product person.
What little bit of nerd fame I have on Twitter comes from some ops stuff I've done, mostly because Charity promoted some of the talks I did.
But I've never actually worked on ops, although I am also old enough that when I wanted to build a website I had to go stand up servers.
The first job I had was enterprise IT, and I had to work with the IT group in installing the servers, racking them and putting the OS on them.
But to double down on your story there, we-- My old business partner, Robert, when we were still at Workiva and we were presenting at a conference together, he goes "Everybody here who's racked a server, raise your hand."
This was to 400 engineers, and I think 3 hands went up.
Charity: Whoa.
Liz: That's a sea change.
Beau: And Workiva was unique, because they were on Google App Engine from day one, so they didn't have that need.
The interesting thing, the reason we even ask that question, is because we were feeling constrained by Google App Engine and we were looking at leveraging more of these AWS services.
And engineers were coming-- It was so weird, because they were coming from this world of constraint.
So all of a sudden they're logging on at Amazon and they're like, "Oh my God. It's so much. I want to do it all. I want to play with it all."
Charity: Right.
Beau: It's like going back to my old days of being on Microsoft, when they'd send you the MSDN stack of CDs.
You'd be like, "Cool. I'm going to go--"
Liz: Whoa, yeah. You can pick anything from this range of 100 different tools, but which one is actually the right one for me? I don't know.
Beau: Probably none of them. But from a sales perspective, brilliant, because this engineer goes "S hit, I'll install this thing."
Next thing you know, you're trying to convince him to buy it.
I think this is why Amazon wins the business side, because they're like "We're going to give you all of that. We're going to let the engineers use their own damn credit card to pay for it to get started."
Charity: You're not going to have to make any hard choices up front. They'll pay for them down the line.
Beau: Google is like, "You have to completely change your mindset to even be able to use this?"
Which is one of their struggles, is you have to understand their software and their systems to really get the advantage of it.
And by the way, Amazon has the same problem if you actually want to save money.
Charity: Yes.
Beau: That's one of-- When we were consulting that was Amazon's biggest thing, is they were like "We need you to help us. Because so many people think using the cloud is just taking the same architecture and just putting it on EC2 systems, and then they're mad that it costs twice as much."
It's like, "It's because you're not getting any advantage of being on the system."
Liz: Also, the reliability is not necessarily there.
Because if you are counting on your servers staying powered on for years, that's not going to work, because that's not how modern data centers work.
We had to hire ops people now because we needed all the same stuff, and their job was a little easier but we still needed them.
Which is why ops people always liked Amazon, because they're like "I get to still do most of what I normally do, it's just instead of me having to rack and stack I can actually just click a button and get my server. Then I can still go configure it the same way I want to." Where Google is like, "No. You don't get to do any of that."
At the Google App Engine, obviously Google added more just because they were trying to compete with Amazon, and they did need some of that flexibility.
It was constraining. So which approach do you think is going to succeed?
How did we wind up with these two different paths of the Google, "Build it exactly our way because it works at Google scale," and the Amazon, "You can mostly run your previous architecture, but it's going to cost you."
How do we wind up at somewhere happy?
Beau: I think I'm a big fan of Simon Wardley, who does the mapping and showing how things become--
Liz: I just spoke at Map Camp 2020. It was super fun. Geeking out about maps with other Wardley mappers is super fun.
Beau: I think this gets back to Charity's point, each of those layers eventually becomes the standard of the market, the baseline.
Charity: Right.
Beau: The same way Kubernetes is basically becoming that, they are the market share owner, and then that becomes an abstraction that nobody will care about again.
Charity: Right.
Beau: At some point, nobody is going to give a shit about Kubernetes.
But we had to for a while, just like we had to care about Linux for a while, a nd we're going to keep moving up these stacks.
There will still be people who go back and do that, but there will be-- I always use beer as my example.
There's the Budweiser and there's the microbreweries.
There's a world where both exist, and eventually Kubernetes will be like the Bud Light, where it's like "I can go get it."
But there will still be space for the craft breweries who want to go do their custom stuff, and that's how I think about software.
That's how we thought about it at Workiva.
80% of our crap can be on app edge, but we do have some high compute stuff where we need to go get that muscle, so we're going to go do it.
But I wouldn't want to run my entire infrastructure on that either.
That to me is the beauty, because I want the flexibility.
This is why I was saying earlier, "I don't like the boring Multi Cloud where it's like, 'We're going to try to standardize across all of these providers so you can just hop around.'"
Because you you're giving all that up.
You're basically saying, "No. We just get the parts these guys all agree on."
Which to be fair is basically the fact I was saying earlier, it's the part that everybody's already standardizing on.
It's the Kubernetes that everybody's already standardizing on. "Great. OK, you're now using the same stuff everybody else is. You're not even taking advantage of the stuff that Amazon, Google and Microsoft have determined to be the differentiators that they're invested in."
Charity: There is a cost to diversity.
If you're investing in that cost, you want to reap the benefits of it.
I feel like this is the thing that we're struggling with a little bit with observability, which is how much do we force people to learn up front and understand up front?
Ideally, they wouldn't have to learn or understand anything.
But in fact, we want to change the way people do things because it's better for them--
Liz: Which means that we have to be opinionated, we have to say that "This is the way to do it."
Charity: What I think has pleasantly surprised me over the past few years is how far the market has moved to meet us, because when we started talking about this and when I was flying around the world just constantly giving this talk over and over, I was met with these with these blank stares.
So many people were like, "It's a solved problem. Datadog is going public. There's nothing left to be done in the space, what are you even doing here?"
I feel like it's gotten easier for us, not because we've changed what we've been talking about so much as a lot of other people are now echoing it.
Now they start thinking about wide events and they started noticing the shortcomings of metrics, and they started thinking about instrumentation more.
But I also don't want to turn into the boundary of our time, who's just like "They were so great. They were so ahead of their time. Sure miss them."
Liz: I've told this story on previous O11ycasts, but startups run in my blood.
My uncle founded the first photo sharing and printing startup in the year 1998 or 1999.
They were ahead of their time, and they went under.
Beau: I'm reading a book now, it's called Meet Me in the Bathroom. It's about the music scene in New York in the late 90s, early 2000s.
They talk about how The Strokes were the first band that came in and they broke up numetal and boy bands, but they weren't the ones who made all the money.
It was Kings of Leon and The Killers, or whatever. It's always the people who are like, "They found something. I'm going to go do that better, or scale it, or whatever."
Charity: Look at the fucking market right now.
There are just so many database companies, monitoring companies, logging companies, APM companies, and they're all just like "We do observability, too."
That's what keeps me up at night, is that we will have blazed the trail but then we won't be the ones to find out the easy way to do it, or we won't be able to leverage our market size or our capital.
We've got our, what, 10 developers? They've got their hundreds?
Beau: The funny thing is, I think going back to Google, I think they did this with Google App Engine.
It was the first version of Serverless.
I've been working on Serverless since 2011 or 2010, but they just didn't win the market because of other reasons that had nothing to do with the technology.
It was other reasons. Amazon's like--
Charity: Sometimes you need a catchy name to get people to argue about it and be pissed off about it.
Beau: In Google's case, I think they were legit reasons.
I just don't think they could empathize with the enterprise customer, and I don't think they realized how much of the market was actually going to be--
This is what Microsoft figured out, they were like "Amazon is already eating into the startup world."
Liz: The money is migrating to enterprise customers, the money is not betting on there being 10 or 100 Snapchats.
There are not going to be 10 or 100 Snapchats.
Beau: Even then, that's all luck. At that point, you're an investor.
You're like, "I hope Snapchat is the one who makes it because we need them to be the one that makes it."
I think some of it was just it wasn't Google's culture to go meet them in the middle, Google's culture is "We're out here. You come to us."
Like I said, that appealed to me as an engineer who was like, "I'm aggressive. Yes, take me there."
But for a lot of people, "No."
Liz: I guess to Charity's point, how do we make sure that observability is not the "We're here, come to us" and instead we're meeting people halfway?
I'm really curious on your thoughts there, Beau.
Beau: I think the future of all of this, both Serverless and observability, start to come together.
Because I think this stuff will eat its way beyond the engineering staff and into the rest of the business.
This is the biggest thing, once people realize that everybody is becoming a tech company and everything is going through your technology and your bits, so that same data that business analysts were looking at manually collecting, they need that from your systems.
They need it just as much as the engineers do.
This is always why I say Amazon eats everybody's lunch, they can tell you the cost of everything.
To me, it's like "That data has got to come from the same data that the observability is."
So there's a way where you can even see this becoming--
I've always wanted to do this, like this data is the foundation that-- It's the data businesses have always wanted.
They've always wanted the ability to measure and have all this, and now they get it because almost everything they're doing is digital. It's just there, it's just nobody's providing it.
Liz: If I'm almost going to restate what you're saying, what you're saying is that we've siloed operational data as a separate category when instead this is something that CFOs that CIOs should be caring about just as much as CIOs, and should be using the same data sources.
Beau: That's the presentation that Charity would tweet of mine, I basically talk about it because that was my job, because I would have to go take that data and go talk to everybody else in the org.
Investors, and everybody. Be like, "This is the data I'm basing my decisions on. I can literally present it to you and show you, this is real stuff. I'm showing you the cost."
Once again, back to the Google App Engine, most of our time was spent tracing down costs, because it would just scale.
We had a customer, when they clicked a button it cost us $15,000 dollars every time they clicked that button.
We were like, "Oh God. OK, what are they doing?"
Liz: That is why all these companies, I think there was Cloud Zero and all of these DevFinOps companies are going to be really great.
Charity: The talk that Beau is talking about is the one called What Is Happening: Attempting to Understand Our Systems.
I love it, it just starts with the quote that we have no idea what's going on in our systems.
We have no fucking clue, and nobody does, and we're starting to be held to account for this by society and by the government.
We can't really even defend ourselves until we start to understand ourselves, and then it goes into a lot of great best practices.
It's very ranty, it's very funny. We should all look at it, we should put a link to it in the show notes.
Beau: It was prescient. I slam Zuckerberg and Facebook right at the beginning.
Charity: It's the best, dude.
Beau: I was ahead of the curve on that one.
Charity: It's the way to my heart.
Beau: That was my point of it, that companies forever have wanted to be able to do this.
This is why they want us tracking how much time we spend on X, this is why they've got all these annoying things we never want to do.
They've always wanted it to make real decisions, and once you've been in leadership you understand why they wanted it.
When you're not in leadership, you're like "Oh my God, they don't trust me."
Or whatever, you think of all the worst reasons they're asking.
When really, it's just like "I want to be able to make intelligent decisions about what we do next."
Liz: This is why I am so excited about service level objectives, this concept from SREs is not just a Google thing.
It's an actual thing that people are adopting, because it enables them to control business outcomes.
"How do I actually concretely measure it, how do I even correct the data from my systems ? And how do we make sure that the executives and the engineers are using the same source of truth and using the same data to make these decisions?"
Beau: This is back to why I don't think Amazon is going to lose anything any time soon, I think their company culture understands all of this.
I'm not saying they're perfect, I'm just saying they're so ahead of the curve compared to most companies you walk into who are just like, "Wait. What?"
Like Charity said, companies are getting better, but even then they're just getting better within the R&D unit.
Even thinking about it throughout the rest of the business to them is still so far away.
But I would get it going back to Robert, his wife at the time was one of Workiva's business analysts.
She'd keep coming to us, like "I need this."I'm like, "I don't want to keep having to answer that question. I just want to give you the data so you can go to it." She knew how to do everything, she didn't need me to build the actual data dump, she knew how to traverse the data dump. She just needed access to the data dumps. It's like, then she could tie all that together.
She could do projections based off history, she could actually project out and go "Our Google costs--" Back to the $15,000 dollar thing.
We had a different thing where once again Google made it so easy, Workiva's data was very seasonal.
They did SEC filings, and come beginning of the year all of a sudden our traffic went nuts.
So we would go in and we'd just start tweaking the knobs to increase the system so it would scale better with it, but one of the problems we found is our CFO comes and goes "Our bill was $100K more than expected. What happened?"
Thank God at this point we were pretty comfortable with this, and this is what I mentioned in my talk too, we had to start thinking like economists.
We're going through this data trying to figure out "Why all of a sudden did our bill spike?"
We couldn't-- What shipped? All of this, and it ended up being simple.
One of our version of an ops person just went and turned the knob up to 11, he was like "We had customers that were getting blocked in queues. I just cranked it up."
He forgot to turn it back down, and it was that simple.
That's what I'm getting at, this is where we're headed.
The companies that are going to be good at this, they're going to beat you on margins every single day.
They're just going to kill you. That's what Amazon does to everybody, they beat you on margins.
They are willing to take everything down to less than a penny because they know exactly what it costs them to do it, and most companies can't even get it down to $1,000 dollar margins.
They're just so lost.
Charity: This is the other thing about Serverless, is that it is so much cheaper.
This is the other reason that I feel like the Serverless movement has been more successful than I expected, is because they're so fucking cheap and they can show it.
They show all these presentations about, "Here's how much it cost to run this app."
Your hundreds of thousands of dollars. "Here's how much it costs to do it on demand,"and it's pennies.
Beau: It was funny, when I started back at Workday it was the first time that this had happened to me in a decade.
They go, "You're releasing a new service. I need some estimates so I can provision."
I'm like, "Oh God. I forgot how to do this. I'm used to having elastic services."
The advantage of that is I don't just have machines running, not doing anything and wasting money.
Our service is very on demand. It's pretty simple.
It should only do what it needs to do, it doesn't need to sit there waiting for requests to come in, just wasting money.
Once again, that's back to a mental shift. That's a cultural change for everybody to even think that, and that's--
Like I said, I think that's what you're running into and that's what Amazon and Google were asking us to help with when we were consulting them.
They were like, "You need to go help them understand why this matters."
And what we'd end up, we'd always come in because it was some VP of R&D who was like "We need the cloud."
Whatever that meant, "We need the cloud."
And then they talk to Amazon and Google, they're like "Cultural shift here. You should talk to these guys."
We go in there, and we wouldn't spend even half our time with that VP.
We spend our time with the executive teams, and we're like "You need to understand what's happening here. This is a cultural shift for you."
Liz: It's two different cultural shifts, almost. You have the shift to cloud and then you have the shift to Serverless and on demand.
Those are almost two separate conceptions, and in fact we strove some of that because we had an accidental $10,000 dollar lambda.
We set something running and we didn't realize how much it was going to cost, and we didn't have alerts set up on that kind of dynamic s pend.
Beau: So many stories about that.
We would have interns that would cost us $10,000 on a weekend because they'd start some service on Friday and forget to turn it off.
Literally, to give Google credit, they would actually be watching our systems close enough that they'd be like "One of your non production environments is churning through compute. Are you sure you want that to happen?"
And we're like, "Really? Oh, God. No, please kill it. Kill it."
But yeah, it wasn't that we accidentally ran out of memory.
It's that we accidentally spent thousands of dollars because the system would just do it for us, which was on one level brilliant.
Liz: The cloud is infinite.
Charity: I love that we're starting to get to a place where it's no longer OK for engineers to just engineer and not have some awareness of what business they are in and what they're trying to do.
I feel like there's at first, it feels like a drag. Those stodgy things and people, but it actually adds meaning to your life.
I feel like it actually has the potential to make you so much more invested in what you're doing and care so much more, and we care about autonomy and mastery and meaning.
What is business, if not the meaning of what we're doing and why we're doing it?
Beau: I agree. This is why I like product.
I got into product because I like helping people, I like building things that people like to use and seeing the impact on their day to day life.
Because a lot of these folks are stuck in crap jobs using crappy software that some IT person shoved down their throat, and it's like "No. I want to give them something they actually like so they don't hate their life."
But also the engineers, when you see that, it's like "This is great."
And I think especially the lower in the infrastructure stack you were, the less you saw that.
Charity: Yes. It's a shock to me, it's a very novel revelation to me.
Beau: But at least for me, I'm not sure if everybody is this way, but once you've done it you're like "This is amazing."
Charity: You can't go back.
This is my hope with observability and Serverless and everything, it's like, "Yes. It's a hurdle. No doubt, any time you're asking someone to change what they're doing with a tool, I feel like it has to be an order of magnitude better than what they've got in order for you to look them in the eye and say, 'Yes. This is worth it, yes, we should try it.' Once they've tried it, once they've seen it once, you can't go back . You cannot unsee."
Beau: That was always my struggle, both consulting and even coming back to Workday, which they're just an older company.
They're trying to get through this process too, they're in the same state, they just started sooner than Workiva did so they started pre cloud.
So, the same things. It's legitimately hard, I feel so constrained.
This is so weird, because once you get into high level leadership--
Then I went back to doing and writing software, and I was like, "Oh my God. This is so slow."
Liz: We have to have empathy for these people in order to actually be able to change their behavior.
You just can't say, "Here are my stone tablets from the mountain."
Beau: For me, it was like I used to view entire teams as functions of code.
Because I'd be like, "They're going to go get me this thing--"
I was at a high enough level, that's just how I saw the world.
So when I was down, I'm like "I just want this service to exist."
Then I'm like "Shit, I have to actually go build this."
Which once again is why I love the cloud, because I'm like "I can just go turn it on? You mean I don't have to go install RabbitMQ?"
By the way, installing it's bad enough but maintaining it is the nightmare that you don't even-- That's the other hard part about this, since we don't measure this stuff very well.
Once again, we measure the actual costs of running RabbitMQ but we never measured the cost of maintaining RabbitMQ.
So you don't even have a baseline to compare it to, that's the other difficulty is when you're trying to sell this, "In this new system we'll be able to measure it."
We used to hit this just literally on the management side of things with things like OKRs.
We'd be like, "You should look at doing this to help you understand."
Literally leaders would be like, "But nobody else is doing it. If I'm the only one who's measuring my work, I'm the only one who can be held accountable."
It was like, "This is true."
Charity: This is the thing, with engineering managers who are often trying to beat the case for build versus buy, or for buy instead of build.
But because no one has ever asked them to see their job and monetary units before, they don't know where they're at so they can't make an argument for why one is better than the other.
This is a huge problem, and it's endemic in our industry right now.
Lots of people, they're sold.
Let's say they're sold about observability or buying a tool or whatever, but they don't know how to craft the arguments because they haven't been expected to measure anything all along.
Beau: I'm actually very jealous of the presentation at Amazon Reinvent, and generally I find all of those types of conferences pretty worthless.
The hallway conversation is the only thing valuable, but there was a presentation and it was an actual Amazon engineer, not an AWS engineer but an Amazon product engineer, and he was given a presentation about a service he had to build.
He literally just walked you through his development process, and he literally pulled up the AWS cost calculator.
Like, "This is how we work. Here's the first architecture I thought I had."
Then he plugs it in the calculator, "Oh my God this is going to cost us a million dollars a month. I can't possibly justify this. I'll never get this through."
So he re-architects his system based off the cost, not based off literally anything else, just the cost of running that service.
By the time it's done, he's got it down to very cheap.
One of the brilliant things that that I learned, and this is the part that I was so jealous of this.
Once again, back to the margins part of it, they passed customer configuration data with every request so they can dynamically respond.
"Oh, you're actually abusing your quota. We're going to put you in the bad queue."
I'm like, "Oh my God, that's brilliant. That's exactly how you do this. This is how it dynamically adjusts to your customers," and then me as a consumer when I go flip the switch and say "Give me more power," I don't have to go prevision anything.
It just happens. It's like, I'm jealous of people who did that.
Liz: I had these interesting conversations with Amazon, though.
Amazon has been reinforcing, to us at least, that they would rather have us do the right thing architecturally as long as it's sensible, and not have to worry about the cost.
They'll fix it up on their end, rather than going into contortions to gain the Amazon pricing.
That's been one of our architectural struggles with our Kafka, with our Ingest pipeline, that Amazon data transfer fees are very expensive.
We don't want to have a outage because we skimped on some level of reliability, but also the costs were just not sustainable for us.
Beau: I believe that's a perspective thing, that's the AWS folks telling you that going, how much they've probably been burnt by other people saying, "I did this as the cost cutting thing."
They're like, "I don't want you to come back at me and be angry."
Liz: Yeah, that's exactly it.
I'm glad that they're so responsive to the feedback, it's this interesting--
Going back to our earlier conversation about constraints, constraints can be helpful creatively but they can also hurt you.
So, figuring out what's a good constraint and what's a bad constraint is really hard.
Beau: This is why I have so many engineers that report to me, and they're like "I want to become an architect."
I'm like, "What is an architect?" When I think of what an architect is, it's literally that.
It's the type of people who think about those type of things and make those kind of decisions, and you've just over time built up enough expertise that you start to have an intuition for those type of things.
You're like, "This is a really poor design, even though it's a cost savings I think I need to go--" and then you need to go sell it.
You need to go explain to somebody, "This is why we need to make this more expensive decision."
That takes experience and pain that you've been through, and understanding.
Liz: That's what a senior engineer is.
The senior engineer is capable of designing a system and defending the choices, and explaining not just why this is technically pretty, but what is the problem being solved for the customer.
Beau: Back to Charity's point earlier, it's not just the product.
I think over the last twenty years our industry has done a much better job of catering to customer needs instead of just telling them what the hell to do. It's tragic what we did the first however many years of IT, where it's like "No. You're going to use our crappy software."
I will forever be grateful to the iPhone, because I was in an enterprise industry when the iPhone came out and the C levels were like "I want my iPhone."
And the IT departments are like, "No. You're going to use our Blackberries."
And they were like "No I'm the executive. You're going to support my iPhone."
And then as soon as the iPhone started coming in and everybody's like, "This is good software," and the entire enterprise world just started to change.
Because all of a sudden people are like, "I work with good software every day on my phone. Why am I using the worst possible software that has ever been created? " It's because it was created by people who never cared about the customers. They weren't empathetic to them, they just built the thing they thought-- They didn't even talk to them, let alone observe them.
So I think we've done better there, but the next thing and the next step is the cost side of it.
It's not just about building the best experience, it's understanding the cost.
As we're seeing with companies like Facebook and Twitter, externalities.
There is stuff, we have not-- There's no way Zuckerberg or any of them thought they'd be in a world where they're impacting literal global outcomes.
They're like, "Wait. I didn't sign up for this."
Liz: Accidentally created a genocide in Myanmar. Whoops.
Beau: This is the counter to Andriessen's "Software is going to eat the world."
If we're going to be part of that industry, we have to understand the responsibility of that.
It's all of us, and we've just been slowly working up our chain. Finally, we're not abusive to our customers.
At least just in the product experience, but we might be abusive to them in other ways.
We need to understand that. I'm not saying there's always good solutions to that, these are hard decisions.
This is also why you're in leadership and get paid more to be that person.
Charity: The whole attention economy, what do we do with that?
I think that we all have this ideal in our heads that if anyone is going to be making these decisions that it would be a democratic process.
Instead, we've got these unelected corporations that are making these incredible decisions about the future of our species and A, we didn't consent to that. B, they didn't either.
I don't think they ever wanted to sign up for that, and we've just ceded this space to them because they were there first.
I feel like we have to rebuild our trust in government so that we can take that back somehow, because now we're way above our pay grade.
Beau: But this was the point of my presentation. This is where I think this is all going.
This is the point, if we can't even understand our own systems, how are we possibly going to be able to have these conversations? This is what I mean.
Charity: How can we even look ourselves in the--?
Beau: We're just guessing. It's no different, and I get it.
Facebook, Twitter, those companies are such scale.
It's easy for me to intuit why they are struggling, but every business has got to go through this and be able to explain these things, and we are so integrated into life.
Charity: So, what do you think is a reasonable percentage of your infrastructure budget to spend on observability?
Beau: I don't know. I'll tell you what we did at Workiva.
As I mentioned, Robert, he ended up becoming the VP of all infrastructure, ops, support, everything.
He set a hard limit of our entire R&D spend for all of operations at 10%.
Because he went and researched other companies, like Google and all of them, he was like "They're 50/50 at best. But we don't need to do that, we're on the cloud. Why should we need to do this?"
Charity: What?
Beau: "We don't need people stacking servers. We don't need these many people in ops."
And we didn't, but what we did spend our money on was things like that.
Literally observability, it's like "I don't need to spend-- If I put the money into Honeycomb, I don't have to pay for somebody. And it's always more expensive to pay for the labor."
So his view is, "I want software to solve my problems."
His view from the ops world is "I want the majority of ops to be software, and that includes the observability."
Liz: It's a challenge though, getting people to treat their ops budget as being the combination of SaaS and headcount.
Most people see them as two different buckets, and are like "I'll give you all the headcount you want, but you can't have money to spend on outside staff."
Charity: Because the budgeting process has completely different parts of the org at stake.
Beau: There's a lot of factors.
There's also the control side of it, we're still not totally comfortable handing things over to software and we want there to be a person I can just go strangle.
Even those things I talked about with the budgeting thing, it was always difficult for us when they'd come and be like, "I need to know this specific thing."
It's like, "These are complicated systems. There is no--"
Charity: I've heard your very complicated answer.
Now just pull a number out of your ass, what percentage of the overall infrastructure budget would you say is reasonable?
Beau: It depends. Are you including all your system costs?
Charity: Yeah.
Beau: It should definitely be, if you were doing 10%, I think it should be at least in that 1% range of that.
It should be 10% of your operations costs in that realm to me, because that is the tool I use for everything.
If you have good ops, that is at some level outside of the provisioning system--
Charity: The only thing you need.
Liz: I would almost counter and say you're starting with the wrong denominator.
When you start at the percentage of your infrastructure budget, rather than saying that "Observability is a developer productivity expense--"
Beau: Yes.
Liz: I think that that's how we change the conversation, is that it's not about improving the cost of operations for your boss.
It's about making your devs be able to go faster.
Beau: I don't know, obviously, what your guys' sales world is, but that is a tool for engineers not a tool for ops.
Ops had this struggle constantly, because they're like "No. You're going to use Splunk."
Because IT and ops people love Splunk, and we're just like "No."
Charity: I feel very offended the way people talk about ops, because that's not been my experience of ops.
Operations engineering is excellent in my world, but yes, this is a tool for people who write and ship code.
If you call that engineering, fine.
Beau: I should probably call this "Enterprise ops."
It's more the ops you would see in the enterprise type companies, versus what you'd maybe see at a Facebook.
Charity: I get that.
I think that what I was trying to get at is that I feel if you're spending $100,000 dollars a year on your infrastructure, then I do think there's something proportional to that where you need to spend just an understated amount on infrastructure.
I don't know that it scales with the team, I think that it probably scales with the complexity of the infrastructure itself, because it shouldn't really matter if you have 10 people using it or a thousand people using it if they can understand it.
But I feel like 30% is a pretty reasonable expectation, it's going to cost somewhere between 20 and 30% of what you're spending just to understand what you're doing wrong.
Beau: This is probably where I was little off on that, because when I'm saying "Ops" I really mean everything, not just writing software for the business.
I'm including people owning CICD, any of that. To me, you should be spending just as much on the observability as you are anything on CICD systems.
Liz: 100%. People have overfocused, "DevOps means CICD," and the answer is "No. That is not all that DevOps does."
Beau: Anybody who's been in this world knows, you can't write enough tests to cover everything.
You're always going to have this other stuff, and as our systems get more complex the testing becomes a little--
It's still valuable at the lower level, b ut that's not where your hard problems are.
Charity: You cover the basics so that you don't ship any dumb shit.
I think if you look at the door report, that percentage of teams that are elite or whatever, it was 7% in 2018 and it's 20% in 2019.
That bubble is getting bigger and going higher, and I feel like I don't have any data to support this, but I feel like that bubble consists mostly of teams that are doubling down and shifting their center of gravity to production.
Whether that's chaos engineering, observability and feature flags, and progressive deployments.
All of this energy that we've just been futzing around in the preproduction environments, and people spend months--
You can have one environment per developer and they do all this elaborate shit, and then they have no cycles left over for putting guardrails around production.
I'm just like, "I'm not saying there's no value to staging. There's some, but invert your priorities please."
I feel like the teams that have made that switch to production first, they are so much more effective and they are so much more able to just react quickly to identify what's going on and to be really powerful where it counts.
Liz: It's the mentality shift. It is 100% the mentality shift, it's easier to do a incremental change versus doing a dramatic change.
I think that's the hill that we have to push the rock up.
Charity: I'm not an incrementalist. I am a "Go to the end motherfuckers, burn it down, and--"
Liz comes along after me and makes it actually work.
Beau: I've been burned enough times, that I want to be the aggressive type but now I'm much more politically savvy about it.
But part of it is because I spent so much time having the same argument, and the unfortunate one with that is that argument was with devs.
Honestly, at Workiva even the leadership was like "No. That is where the problems are."
They were the ones pushing for chaos rules, a nd this gets back to the black box stuff at Google.
We were so constrained we had no choice, and at that point Google App Engine mocked out our local environments.
It was nothing, it wasn't even multi threaded. Just completely not realistic.
The non production environments, you would change the configuration so much that once again it was basically useless.
Then all of our problems happen in production in these certain environments, and it's like "Why can't we be there?"
And then it's like, "OK. We're building a SaaS product, and if we're building a SaaS product by default we support isolated user behavior. So if we can't have our own Workiva dev version running in production along with everybody else, what the hell does that say about our system and what we are doing to our customers?"
It's back to, again, empathizing with our customers.
If we don't even trust being able to run potentially risky code through our system, because your customers are putting risky stuff into your system every day and you have no clue what the hell they're doing.
My God, the amazing things they will do.
Charity: Customers are the original chaos monkeys, and they're so much better than anything you'll ever--
Beau: This is-- I mentioned my interaction with Ben Traynor, this actually gets to that.
Because it was their version of us doing that to them.
Where we, and this is why I got into observability, we were hitting them using their task key system.
Where they had pull queues which we were basically treating like Kafka, and it was an abuse of the system.
It was not meant to work this way, but we were on Google App Engine and we had no Kafka available.
We investigated running Kafka, we weren't going to do it. Way too expensive, and we didn't want to hire entire devs just to maintain Kafka.
We're like, "All right. But we want this capability."
So we hacked a task use to make this work, where we would build our own cleaner pull queue agent that we would insert with a task, basically a runtime, abusing the tasked naming system to do this.
Liz: Of course, Google had its own pub subsystem that they eventually released as Google Cloud PubSub, but it wasn't wired to Google App Engine at the time.
So, you were using a system designed for a completely different thing.
Beau: This is the brilliant thing with it, because I'm doing this and we're debugging it, and we're like--
What was happening is we were noticing that our thing to pull data off, those tasks were not running right away.
Google is like, "If you put a task in it'll run as soon as it can."
And generally they are pretty good.
We could measure that out and say, "Usually within a second or something, and depending on how you configure the queues, that would be faster or slower."
So we had a pretty good idea, but randomly certain tasks we were just seeing were taking forever.
Just forever, and then we actually started getting in and looking at our measurements and the weirdest thing started happening.
We go, "All right. So if it doesn't happen within a 2 second window, it's at 60 seconds. And if it wasn't within 60 seconds, it was 120 seconds."
So I literally built a whole tool to prove this to Google and I sent it to their devs, I'm like "My guess is you have some sort of background cleaner system that has come through looking for tasks that are stuck, and reruns them. I'm guessing that runs every 60 seconds."
That's basically what happened, and from the outside using these tools and measurements, I guessed their architecture from the outside.
But the beauty is that one of these Google conferences with some other folks from Workiva, Ben Traynor walks up and he goes, "Who are you all?"
And we say, "We're Workiva."
He goes, "I know you. We've been having some interactions with you."
And I go, "I think I'm that person."
And we're walking through it and he goes, "I tore into my team after that."
Because he goes, he walked in there and he's like, "Why is this still a problem?"
They're like, the devs basically said, "They're using our system wrong."
And he goes, "It's not your fucking job to tell your customers how to use your system. It is your job to support how your customers use your system," and I don't know.
That could be apocryphal, I don't know. I was just like, "He is correct."
Liz: That is exactly the mentality. Your job is to make your customers happy.
Charity: I remember being at Parse and being on the other side of one of these, where we had engineers and we gave them SDKs so they could build mobile apps, and I ran all the databases on the other side of it.
They did the most horrendous things with queries.
They would construct these queries just doing a 5x full table scan just to return a single--.
Just terrible, and I used to get super pissy at them. Just like, "I hate our customers."
Then I realize, "They've got SDKs and it is in no way apparent to them what is happening in our API layer, in MongoDB.
We don't feed any of this information back to them, they have no way of seeing that it's doing anything but taking an unpredictably long time."
I feel like it's our job when we're dealing with engineers as customers, especially, t o support and empathize but also to try and feedback enough information so that they can make the decisions, and so they can form a mental model of how the system works, so that they aren't just-- It shouldn't be a black box.
I feel like I have often said the part of the world that is on the cutting edge of how to instrument for observability is Serverless, and I guess Google App Engine.
Because it's the idea that you may not have any access to what's going on under the hood, you may not have the access to the system metrics.
You don't give a fuck, all you care about is "Can you request execute from end to end? Can it get the resources that it needs? And if not, why not?"
That's all you need to know, and you should be able to tell all of that with the instrumentation that you can write for yourself.
Beau: We learned that even at Workiva ourselves, because we built basically an Excel-like product.
Any time-- That's an IDE. Excel is the largest IDE in the world, and we're building a tool like that.
We allowed them to do even more powerful stuff where they could create formulas that cross this out.
We'd end up with calculation change that were just millions of pieces of data, and the stuff our customers would do was astounding.
They would use formulas to do language translations. Are you kidding me?
The more they did that, the more they complicated our system, the more we would react to that.
They'd overcompensate, they would just keep pushing us.
This is the world we're all in, and this is literally how the world works.
Somebody pushes you, you build your system to allow a little more space, and then somebody new comes along and pushes it further.
This is what you're seeing on the cloud, each group is just pushing them further.
Liz: There's this talk made by my former boss, Dave Rensin, of Google Customer Reliability Engineering.
He basically argues, "Every product eventually becomes a self-contained API. If you make it available to customers, people will develop an API around it and people will use it for things you didn't anticipate. You need to be in dialogue with your customers or else your product is going to be a non starter."
Beau: The tools we built for ourselves to understand what our customers were doing to us, we had our own customer conference with our customers and we did this nerdy thing.
We didn't even think anybody would like it, but we just had our back end engineers show up because we knew there were some customers who really got into it.
Like, "I do that to your system? That's what it looks like?"
We just set up a booth and we had all these charts and all showing, and we built diagrams.
They looked like starburst diagrams of the complexity of their system, and they become obsessed.
They were like, "Could you build these tools in our product? I want to know."
It was just like an engineer going, "Yes. You're finally letting me see what's happening."
Liz: This is what we really love at Honeycomb about sharing our graphs with people.
We're a very transparent company because we know people really value that transparency, and if we can help them get better performance they're happy to change. They just need to know that they can get better performance if they just tweak one thing.
Beau: It's un-intuitive for engineers, because I think we came from this mindset of "We should fix everything for you."
Honestly, we can screw too many things up with that where we fix it in the wrong way.
Where more ideally, do as much as you can but give them the tools to fix it themselves. Let them have some autonomy.
Charity: We're probably over time here, we might have to snip some of this out.
But I was hoping that the last question on our list, "How observability is eating into the corporation shouldn't stop in your software," maybe you can for just a minute or two talk about that?
Beau: It goes back to what I was talking about earlier, as much as software starts to eat in to the rest of the world in our enterprise, observability has got to come with that because we need to understand it.
To me, observability is just a way of saying "Understanding our systems."
The more it eats into the rest of the enterprise, the observability, by default it's going to have-- Something is going to have to come with it, because all of this is becoming too complex.
Part of this for me was at the same time I'm working on these systems at Workiva, we're growing as a company.
I'm moving up, I'm having to build out org charts, manage -- All of this is connected, everything is connected and I have no way--
I remember bosses coming in and being like, "Can you give me an architecture diagram?"
And I'm like, "It changes every day. We're shipping literally every day, we're changing it. It needs to be dynamic. It's another system that is just visualizing things."
Liz: Anything you put on paper is going to be stale 3 hours from now, 3 days from now, 3 years from now.
Beau: Even the business itself is changing that dynamic. Teams are changing that fast.
Like, by the time you get an org chart there, somebody is moving across to another team.
It's the same types of systems that are going to be more and more common.
Charity: Everything live and in real time. That's what helps you win.
Beau: The hard part is, I think maybe from your guys' perspective is, how generic do you go versus how specific?
And that's back to what you were saying at the very beginning.
Charity: Oh, boy. All right. We're going to have to put that off until the next podcast.
Thank you so much for joining us, Beau. This was super fun to have you. It's nice to get to talk.
Beau: Of course.
Subscribe to Heavybit Updates
You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.
Content from the Library
O11ycast Ep. #76, Managing 200-Armed Agents with Andrew Keller
In episode 76 of o11ycast, Jessica Kerr and Martin Thwaites speak with Andrew Keller, Principal Engineer at ObservIQ, about the...
O11ycast Ep. #75, O11yneering with Daniel Ravenstone and Adriana Villela
In episode 75 of o11ycast, Daniel Ravenstone and Adriana Villela dive into the challenges of adopting observability and...
O11ycast Ep. #74, The Universal Language of Telemetry with Liudmila Molkova
In episode 74 of o11ycast, Liudmila Molkova unpacks the importance of semantic conventions in telemetry. The discussion...