1. Library
  2. Podcasts
  3. O11ycast
  4. Ep. #64, Shared Language Concepts with Austin Parker of Honeycomb
O11ycast
34 MIN

Ep. #64, Shared Language Concepts with Austin Parker of Honeycomb

light mode
about the episode

In episode 64 of o11ycast, Jessica Kerr and Martin Thwaites speak with Austin Parker of Honeycomb. This talk explores how observability within an organization provides a shared language between teams to discuss reliability. Other topics examined include numeronyms in the tech space, OpenTelemetry, and Kubernetes.

Austin Parker is Director of Open Source at Honeycomb. Austin is also an OpenTelemetry community maintainer and has been with the project since the beginning. He was previously Principal Developer Advocate and Head of DevRel at Lightstep.

transcript

Jessica Kerr: We were talking about o11y and not everybody knows that that is pronounced Olly and means observability. I observe that also OpenTelemetry could be abbreviated to O-11-Y.

Martin Thwaites: Ooh, that's interesting.

Jessica: Yeah. 13 letters, so you get the O and the Y and there's 11 in between.

Martin: Interesting.

Austin Parker: So what you're saying is we can action someone here, right? We came up with O-11-Y and so now we can be like, "Oh, they're just following in our footsteps."

Martin: I mean, does anybody really come up with those things though? Because realistically the whole idea of concatenating and all of that kind of stuff and using those numbers, it's not an innovation, it's not something that we've come up with. It comes back from the I-18-N.

Austin: Yeah. It's a shorthand, a useful shorthand for how to say really long words without saying really long words. Actually, this brings me to one of my favorite little OpenTelemetry stories, which is the CNCF asked and we were like, "No, we actually don't want a trademark on OTel." We don't want the trademark abbreviated because we wanted to reserve that for the community and vendors and whoever else to be able to use OTel as a shorthand without having to go through trademark BS.

Martin: Interesting.

Jessica: Oh, that's useful to know.

Austin: Yeah. That's why we can do OTel unplugged and not have to mess with anything because OTel is just four letters.

Martin: I mean that would be O-2-L.

Austin: Yeah, O2L.

Jessica: It pronounces well.

Austin: O2L, it pronounces well and it sounds like a European synthpop band.

Martin: Yeah. I found out recently as well that that even just A11Y which would be-

Jessica: Accessibility?

Martin: Accessibility. These are not things that are as widely known. I think it's that whole bubble idea that, to me, o-11-y and "olly" is just a thing. I was talking to somebody recently, not only did they not know that o11y meant observability, but they didn't know how to pronounce it either. I was like, "Yeah, it's Olly. You just say Olly."

Jessica: Whereas all of our listeners know that already.

Martin: Exactly.

Austin: Because you're listening to the o11ycast.

Jessica: Dun-dun-dun.

Martin: But yeah, it was just an interesting thing about that bubble idea that there's this, as far as I'm concerned, a big, massive thing called o11y and so many people don't know it.

Jessica: Insider knowledge.

Austin: Yeah, I always joke, any individual thing that happens in OpenTelemetry, there's only like 50 people in the world that actually care about it.

Martin: Why I am always the one in that 50 as well?

Austin: Because you care a lot, Martin.

Jessica: It sounds like a you problem, Martin.

Austin: Yeah. That also kind of sounds like a you problem. I care about a lot of them, but I am definitely one of those 50 so...

Jessica: Okay. But collectively, as a whole, the OpenTelemetry project is useful to thousands and thousands of people.

Austin: Oh yeah, absolutely. And that's the wild thing about it, right? How do you balance-

Martin: The needs of the many over the wants of a few?

Austin: Well, there's that. I think it hit me a while ago and it wasn't even something that I did but it was something my partner did, is she typed OpenTelemetry into the settings app on her iPhone and it popped something up because there are applications that run on iOS that use OpenTelemetry libraries and so they have under the library section or the dependency section-

Martin: Copyright notices, yeah.

Austin: Yeah. The copyright notice, the whatever. And it's just like, "Oh, this thing that I have been a part of is running on millions and millions of phones." It's a scary thought.

Martin: Yeah. I have the same with Mono, I committed to the Mono project and the Mono project is in Mono game and all of these things.

Jessica: And Mono is like .Net on Android?

Martin: It's .Net on anything but Windows, way before .Net Core was a thing and Microsoft weren't really involved in trying to make things cross platform. There was a splinter group of people that decided to try and make something that would compile C# code into something that would run on Linux and that kind of stuff.

Jessica: And when you want something to be really cross platform, you get the situation that is an extreme in OpenTelemetry of just the breadth of the project, and it takes a community to even care about, much less write, all the different libraries that have automatic instrumentation in all the different languages and frameworks. Yeah, sometimes for all the different devices. That is OpenTelemetry.

Austin: Yeah.

There's an interesting sidebar note here too, which is one of the things that I like to talk about when I talk about observability as a concept is that I think observability gives us, as a community, a shared language to discuss reliability, really. When we get down to it, right?

Martin: I agree.

Jessica: Okay. What is that language?

Austin: So it gives us nouns and verbs, it gives us parts of speech. I think actually Martin and I had a really interesting conversation earlier today where we went back and forth for about 100 Slack messages.

Martin: And we nearly broke Slack.

Austin: Violently agreeing with each other, but missing the point because the language that I use and the language that he used are just slightly disjointed and I think that's very common. It's common in language, certainly, and it's especially common in technical language and even more common in technical language that refers to non-academic pursuits. OpenTelemetry is a good example. A trace means something in OpenTelemetry, it is a semantic, it is a thing.

A trace means a collection of spans and those spans are connected in this way and there's a data model, there's a specification, there's all this. But then there's the logical concept of a trace and I find when I talk to people who don't maybe have a strong observability background or are kind of new to this, they think depending on what language they come up in they'll think of a trace as something different. People will think of it as a stack trace, they'll think of it as trace level, line level information about what's going on.

Jessica: A trace of operating system calls.

Austin: Right. Like a kernel level trace or whatever. So what I think is actually very powerful at OpenTelemetry is it gives us as a community of practice the ability to redefine a lot of terms and get everyone onto the same footing when it comes to saying, "When I say trace and you hear trace, what do I mean? We mean the same thing, we're talking about the same thing."

Jessica: And that's because OpenTelemetry has a spec and the spec defines a trace.

Austin: Yeah.

Jessica: Yeah. So I completely agree with you that OpenTelemetry gives us a common language. Observability on the other hand, everybody has their own idea what that means and for some people it's kernel level traces.

Martin: I mean something different when I was talking about the shared language that observability gives us.

Jessica: See? Words are hard.

Martin: What I was meaning was observability inside of an organization provides shared language between teams, because once we start talking about how we observe our systems we start talking in a shared language, we start talking about HTTP root and we don't talk about the ASP Net Core Package, we don't talk about Flask, we don't talk about that kind of stuff. We talk about the service and the roots and the API calls it makes, the database calls it makes-

Jessica: Because we can all see them in the same trace, and look at the same trace and it's not my logs versus your logs.

Martin: Yeah. And hopefully the semantic conventions are part of that whole idea about us using shared language to say how our systems work, which means it transcends the operating system language, the frameworks, the coding languages that we use.

Jessica: It's almost like if you had a standard way to define infrastructure.

Austin: Well, to that point, I agree that, yeah, those are the... OpenTelemetry is providing the nouns and verbs, right? And observability is giving you the sentences and the way you talk about it. It leads you to things like how do we communicate reliability or expectations of reliability across these different organizational silos or teams or whatever, and it gives us SLOs as an example.

It gives us certain types of visualizations that are useful and those are influenced by the nouns and verbs OpenTelemetry gives us, so OpenTelemetry gives us histograms, it gives us traces that encode duration in a certain way which is very useful to put into a heat map. That is a better way of visualizing and exploring it.

If you think about Prometheus and you think about a lot of extant logging and metrics frameworks, they are leaky abstractions over the way the data is stored and queried, and that is what derives into how people actually think about the data that goes into observability. So by having OpenTelemetry be designed to provide a native framework for observability data, then it influences those... OpenTelemetry is the notes and observability is the composition, right? It builds on each other.

Jessica: And I see Kubernetes as supplying that kind of language for the infrastructure the code runs on.

Martin: Because everything is a CRD. Everything is a Custom Resource Definition.

Jessica: No, no. Lots of things are built in resources.

Austin: Everything is in objects.

Jessica: But there's power when you combine Kubernetes and OpenTelemetry because then we have some standard infrastructure concepts being expressed in a standard way in our traces and metrics and then what do we get?

Martin: This is that how do we... We've talked about correlation a lot, how do we correlate logs with metrics, with traces? When we bring in infrastructure stuff, now we've got this common language of pod because that's a unit of deployment, a bounded idea of how we deploy something. Now because we're using, say, OpenTelemetry, we're using their semantic conventions, we can now have that shared language between what is metrics that relate to this particular pod or this unit of deployment that we've done, versus this other unit of deployment that we've done?

What about this customer centric stuff like the tracing ideas? How do we relate that down to some of these infrastructure components which we didn't have when we were deploying things to, say, EC2 instances or Azure VMs because we didn't have those shared language concepts at that point.

Jessica: Speaking of shared language concepts, hey, Austin, you want to introduce yourself?

Austin: Yes, I'd love to. So hi, everyone. My name is Austin Parker and I am Director of Open Source here at Honeycomb.

Jessica: Hooray. Tell us about your work.

Austin: The short story is I am an OpenTelemetry community maintainer, I've been a part of the project since pretty much it was formed. I was a maintainer on OpenTracing and I have been there since the jump for OpenTelemetry. Mostly I work on community facing things, so I help run our communications sig which is responsible for documentation and website.

I am one of the maintainers on the OpenTelemetry demo which helps people get started to see how OpenTelemetry works, and I work on a lot of our events and how we show up in the bigger ecosystem. So if you've been to an observability day at KubeCon recently, that's something that I've helped put together.

Outside of that I have written a couple of books on this topic, Distributed Tracing In Practice is one of them from a couple years ago, and then I've got another one coming out next year called Learning OpenTelemetry, so please look forward to that. And if none of that rings a bell, then you might know me as the Animal Crossing DevOps Conference Guy.

Jessica: True, true. I always wanted to speak at that conference because I have an Animal Crossing character.

Martin: Did you create characters for people or did they have to create their own?

Austin: People made their own.

Jessica: You have to have a Switch.

Austin: You had to have a Switch.

Jessica: You can't speak at Deserted Island DevOps unless you have a Switch and you have Animal Crossing because you have to go to their island.

Martin: Do you not supply them?

Austin: No. What kind of budget do you think I was working with for that? For the last one we actually did have the option if you didn't have a character, then we would create you one because we did the speakers in person but everyone had one. There was actually one person that didn't, but some of the other speakers banded together to help them make a villager. It was very cute.

Martin: I think that should be the way that it works, that somebody else builds the character for you so you get a picture of what other people see you as.

Austin: One of the themes behind the conference is that the great thing about Animal Crossing or virtual events in general is that you have so much more control over who you want to be, right? It kind of puts everyone on an even footing because you can only be who you are within the very careful constraints as created by Nintendo of America.

Jessica: Kind of like your app can only run within the constraints of a container in a pod.

Austin: Great transition.

Jessica: All right. So what do you get when you combine OpenTelemetry and Kubernetes?

Martin: That sounds like a great starter for a joke.

Jessica: Well, first of all, we'd have 11 plus 8 is 19.

Austin: I think it's actually more interesting to talk about what changes. One thing that I've seen talking to teams that are bringing Kubernetes late versus building on top of it, and there's a whole spectrum of opinion about should you build on top of Kubernetes from the start, should you wait, whatever.

But especially organizations that are a little larger, that are going through their cloud transformation, what I see is people will come in and it's like, "Okay, Kubernetes. This is an ops thing so we're going to put this over here in our cloud center of excellence. We're going to put this over here in some out of the way place and we're going to make sure the developers never have to think about it, other than writing a bunch of YAML, and actually we'll get someone else to write the YAML for them anyway."

What invariably happens is it turns every single incident into this horrific data being thrown over a literal firewall in some cases, by these two teams via the app devs who are suddenly in a situation of, "Oh my gosh, my application is running and Kubernetes is just doing stuff to it that's influencing what's happening in the system."

And the ops people that are like, "I don't really understand why these things are happening. I'm trying to treat this like a VM or I'm trying to treat this like a traditional infrastructure component, and these worlds just do not have the right..." There's an impedance mismatch, to use a Microsoftism, between the needs and wants of the people building applications that run in Kubernetes, and the people that are running Kubernetes for those developers and for end users. That, I think, is the real what are you missing? You're missing that connection, right?

Martin: I think there is this, and I wrote a post for the News Stack on this recently, which is people treating the Kubernetes as if it's done and it's out of the way. But actually, Kubernetes isn't like we used to use VMs. Like you say, these things are now so much more intertwined. A pod is so much closer to my application than a VM was, even though a pod is kind of that same thing. It's what hosts my application. It's so much closer to my application than a VM was, and I think that is where things really differ nowadays.

Austin: Yeah. I think there's also conceptually, if I'm trying to develop software then it's a lot, mentally at least, it's a shorter path from, "I'm running this on my laptop," to, "I'm putting this in a VM." And these things mostly work the same, conceptually. But then when I shove it into a container and I shove the container into a pod and I shove that pod into a node, and everything is different from DNS resolution to how limits are applied to resource contention to storage. Pick your poison. Basically everything is going to change somewhat running in Kubernetes versus running it locally unless you're also running Kubernetes locally, and then-

Jessica: How different is it if you're running in a container locally?

Austin: I would say it's slightly less different, but still different enough. There are things about the lifecycle of a pod or a deployment in Kubernetes that are unique to Kubernetes, that aren't just, "I'm in a container."

Martin: Scheduling.

Austin: Yeah. The scheduling.

Martin: So when we talk about the scheduling, just for the people who don't know what we're talking about. Scheduling is this idea, and baring in mind I'm not a Kubernetes expert which may make me the best person to explain this or the worst person to explain this, so tell me if I'm wrong here. But the idea that there is a whole ecosystem inside of your Kubernetes cluster of different applications working together, and they will decide where your individual pod is going to run. It might be on node one, it might be on node two, it might be on node three.

Jessica: There might be six of them.

Martin: It might be on all of them, it could be two on one node and one on another. As things change over time, it can decide that actually, no, we were running two on this node and none on this node, but actually now I can reschedule one of those pods onto this node and it will decide to change that.

Jessica: Okay. So as a dev, I don't want to worry about any of that. I don't want to care how many of my services are running where, I don't want to care about limits, I don't want to care about storage, I don't want to care about DNS. Does OpenTelemetry help me care about that stuff when I need to?

Austin: It gives you the framework, right? To Martin's point, that's a pretty good explanation. The scheduler can reschedule things for all sorts of reasons, and depending on... This is one of those points where it gets complex the more you integrate Kubernetes into your architecture.

I would say there's a distinction between even saying cloud native and Kubernetes native. Kubernetes itself is an object database with a reconciliation loop and you give it a bunch of objects. Those objects can be pods, which are containers with metadata about those containers.

They can be an ingress, which is how do I get traffic from the outside world into various pods. It can be a custom resource, it can be something that you just create and say, "Hey, I want to manage this." But consider, I'm going to make up a guy to get mad at real quick, but imagine I have some sort of SaaS service and I have free users and I have paid users.

Paid users get higher performance, so in my cloud I have my standard nodes and I have my high performance nodes. With Kubernetes I can pretty straightforwardly say, "Hey, when this service runs, if it's being run for a paid user, it needs to run on these nodes over here and these are called taints." So we can apply taints to a Kubernetes manifest and Kubernetes will figure out, "Okay, I need to run stuff over here and I need to run stuff over there."

\Like maybe a lot of paid users are signing on, "I need to actually provision and spin up more nodes for those users to run on." And it can do all this and that actually makes my life as a developer a lot easier because I don't have to, as a developer, then say, "How am I going to code all this?" I just give it to Kubernetes and say, "Figure it out."

But what happens if there's an incident? What happens if it's like, "Oh, there's some weird interaction between my code and maybe certain paid users are getting scheduled on the free nodes due to some weird interaction between the Kubernetes API, the manifests I'm writing, the position of the sun and the phase of the moon."

I need a lot of data and I need it to be all contextually linked together and I need it to all be visible in the same place so that I can ask interesting questions about the data and find the answer. That's really hard to do with not only traditional tools, but also the traditional way that organizations seemed to think about monitoring and observability because they do silo those off.

They say, "These are infrastructure problems. The infrastructure people are handling those. You should focus on your dev problems." But the devs do need more than just what they're getting out of app logs, app metrics, whatever. So all of that is amazing and there's loads and loads of terms in there that if you're not familiar with Kubernetes you'd be tearing your hair out from taints and all of that kind of stuff.

I think that's part of the problem, is as a dev I don't want to care about those but what I do want to care about is when my stuff bleeds into infrastructure. Not completely, like I don't care specifically about whether there is too much network traffic happening between nodes. What I do care about though is when something's affecting my app, as in I'm getting slower requests, I'm getting more errors. What's the commonality in the infrastructure for those metrics? Because like you say, it becomes this he says-she says type thing.

He said, she said, they said, that the infrastructure is the problem. No. You can't just pass it over the wall. That's where we were, we were there before DevOps where, "Yeah, it's an ops problem. Ops will solve it." What we really want is for the engineers to have just a little bit of information that allows them to go, "No, it's distributed evenly across all of the nodes and, yeah, the pods were running fine. Right, go and sort your stuff out. Don't try and blame infrastructure."

That's where I think there's this idea now because we're closer to the infrastructure, because our applications are getting closer to it, like you say we're more native inside of that infrastructure, we can do a lot more things. We can reschedule things, we can have pods that are labeled, have an annotation, or a label of paid versus unpaid. We can do all of those interesting things really easily but we need to be able to say, "This particular request was served to a paid user, it was on a paid pod, which meant it was on this particular node, which had these particular taints attached to the deployment that came from that particular thing."

That's where I think this whole idea of using OpenTelemetry to observe Kubernetes and observe our applications allows us to be able to allow the engineers who write the applications for customers to be able to reach into the platform world, platform, infrastructure, whatever you want to call it. The people who are building these large scale clusters for us to deploy our applications to.

It can act as this interface between the two. Your point earlier around the shared language, if we're both using OpenTelemetry, the infrastructure people are using metrics because that's what they've got, but they're using infrastructure metrics with specific OpenTelemetry semantic conventions around them which gives them that shared language.

We're using those same conventions inside of our OpenTelemetry tracing and potentially logs and even application metrics. If we start using that shared language we can talk in better ways to those platform engineers and say, "Look, I'll tell you now. We're having some problems and it just so happens that every single one of those problems, they're on this node.

Honestly, this node, this node name, I can give you the node name, I can give you the times and I can give you the pods that it was happening on. I need your help to look into it." I think the other really cool thing OpenTelemetry let's us do is it gives us a shared and unified pipeline for getting that data right. Things like the OpenTelemetry Collector, the various OpenTelemetry APIs and SDKs.

You can instrument with OpenTelemetry at the app level, you can send that data to an OpenTelemetry Collector which can talk to the Kubernetes API and ensure that all your telemetry from the container, from the pod, from the Kubelet, from all these various components is consistently and accurately tagged with the write metadata, the write attributes, the right resource attributes.

All this, so that when I'm looking at a screen and the ops person is looking at a screen and whoever else is looking at a screen, we're all looking at the same raw data and we're maybe using different tools to interpret that data or using different lenses and views, but the data itself is the same. That also makes it a lot easier to configure and roll this stuff out because now I don't need a billion different proprietary agents or whatever, I just need OpenTelemetry and then I take my OpenTelemetry data and then I just split it out wherever.

Jessica: Hooray.

Martin: And that's the weird questions that we want to ask, because you're not going to have dashboards for that, you're not going to have prebuilt metrics that correlate all of those things which comes down to those weird and wonderful questions that you want to ask, which is all about the observability side.

Jessica: Okay. So OpenTelemetry can help application developers connect what's going on in their apps with what's going on in Kubernetes and that can help break down the silo that constantly tries to reemerge between dev and infra.

Martin: Hopefully.

Austin: Sounds good to me.

Martin: That is the hope because we don't want to end up where we were with dev and ops.

Jessica: Right. Right. Okay. So question for Austin, and maybe Martin, if people are like, "Okay, that sounds pretty good," where should they go to learn more about how to put this stuff together?

Austin: There's a lot of really good docs on the OpenTelemetry website that are getting better at a regular cadence. If you want to just see a very neutral version of this, the OpenTelemetry demo has a Kubernetes version you can use where it will deploy and show you like, "Hey, here's how you would set up the Collector to pull all of this stuff in," then you can send that to Honeycomb or wherever else.

Martin: I think that it's an area that's lacking, at the moment, let's be honest. The use case side, there is an area of lacking. We're trying to plug it, which is we've got a load of docs on OpenTelemetry so if you want to go and build these things up, looking at the different processes, we've got an entire section on the OpenTelemetry docs around Kubernetes which will tell you all of the different tools that you can use, why they're useful and how to make them work.

The progression from there, as Austin said, is use the OpenTelemetry demo. It will bring a load of those together to give you an example. I struggle to find a lot of the docs at the moment that tell us the opinionated use case of how do we take all of those things and run them. I think it's an area where we need to do a lot of work, which I know that that OpenTelemetry community is working on, so maybe by the time this comes out there'll be some nice docs that do all of those things.

Austin: Yeah. I think one challenge, and I tend to agree with you, the problem with OpenTelemetry in general isn't necessarily that the information about how to set it up and use it isn't there, because it is. The problem is more it's a framework for data, and yes, we need better docs about how do I actually use this data to solve a problem.

But a lot of the how do I use this to solve a problem winds up looking really different depending on how you're analyzing the data, right? What tools are you using? So someone could sit down and say, "Here's how I solved this problem in Honeycomb, and that's great." But I'm not sure how useful that is for someone that doesn't use Honeycomb, right?

Jessica: So we need lots of different examples.

Austin: Right. So we need lots of opinionated examples about this is how to do this with Tool X, and I also think there's a little bit in there of, as a project, we need to say, "this is the type of data you need. This is the resolution of the data you need. You should have these kind of traces and these kind of metrics and these kind of logs, and once you have all that stuff then you're able to ask those interesting questions and the way you ask those questions is going to differ based on what tools you use." But the questions themselves should hopefully be pretty universal.

Martin: Yeah. I think that OpenTelemetry has tried to be completely unopinionated, to a point where it might be becoming a little bit of a drawback right now because you can do all of these things, have all of this data, but like you say, how do you use that data and what's the right level of resolution for those things?

It's something I know the OpenTelemetry Project, at least in its early days, was trying to stay away from and try and maintain the data models and the protocols and the SDKs, instead of going into the, "No, no. You need a trace for this area. You need a span around this. This is where you should write a log file, and the log file should contain these bits of information."

And maybe there's a now getting to a point, like you say, where, right, we've got all of that stuff and now there is a bit of a gap where we need more people to tell us how are you doing it and what is your opinion on what you should be doing.

Jessica: And maybe this is where the community wides from OpenTelemetry into observability.

Martin: Yes, 100%.

Austin: I think one of the things that's really exciting about being at Honeycomb is I believe quite a few of your people that we serve use OpenTelemetry. Honeycomb has really been a super early adopter of OpenTelemetry and so there's a lot of people actually using it in anger, so I'm really curious to hear from people that use Honeycomb, what have you found that works?

What are you finding is the right level of detail and resolution and whatever else to use for OpenTelemetry to answer those interesting questions? I think it's a good general question as well, right? Even if you don't use Honeycomb, if you are using OpenTelemetry, bring those lessons in to the community and write about them, blog about them, whatever. If your problem is that you have a great story and you don't know how to tell it or don't know where to tell it, look me up and I promise I will help.

Jessica: Where can people get a hold of you, Austin? Ooh, wait, wait. I know one place. If you're in North America and going to KubeCon in Chicago for 2023, come find us. Find us at the Honeycomb booth. Well, okay, you can find Austin anyway, and you can find Wren and you can find a lot of other people who will be thrilled to talk OpenTelemetry and observability and what the heck questions are you asking of your observability and how did you tell it to give you those answers?

Austin: Yeah, and if you're looking for me at KubeCon, I will be at observability day, I will be all over the place. Be sure to check out the OpenTelemetry Observatory which is going to be our cool little lounge space we've got. Check OpenTelemetry out in the Project Pavilion. If you want to find me on the internet, I am at @AustinLParker most places. LinkedIn, Twitter, so forth.

Jessica: Great. Well, Austin, Martin, thank you for talking about observability in Kubernetes or o11y in K8s.

Austin: Martin had a great numeronym here. KO19C. KO19C.

Jessica: That looks like a super in-club license plate.

Austin: Yeah, it kind of does look like a license plate.

Jessica: So thank you very much, and tune in next time for way more talk about observability.