Ep. #70, Evangelizing Observability with Dan Gomez Blanco of Skyscanner
In episode 70 of o11ycast, Jess and Martin speak with Dan Gomez Blanco of Skyscanner. Dan shares his expertise on evangelizing better observability practices at Skyscanner and offers insights from his experience on the OpenTelemetry governance committee. Discover how observability can minimize organizational costs, the future of auto-instrumentation, and valuable advice on detecting and avoiding over-instrumentation. Plus, learn about the importance of tail sampling and OpenTelemetry’s semantic conventions.
Dan Gomez Blanco is currently Principal Software Engineer at Skyscanner and a Governance Committee member for OpenTelemetry. He is the author of Practical OpenTelemetry.
In episode 70 of o11ycast, Jess and Martin speak with Dan Gomez Blanco of Skyscanner. Dan shares his expertise on evangelizing better observability practices at Skyscanner and offers insights from his experience on the OpenTelemetry governance committee. Discover how observability can minimize organizational costs, the future of auto-instrumentation, and valuable advice on detecting and avoiding over-instrumentation. Plus, learn about the importance of tail sampling and OpenTelemetry’s semantic conventions.
transcript
Dan Gomez Blanco: I work with many teams to basically adopt best practices in observability. And there are many patterns and many themes that I see, that I just come up over and over again. And there's a lot of experience that I've basically got from, from working with those themes to adopt those best practices.
And SkyScanner as well has been going through a journey of like adopting new standards for observability, adopting OpenTelemetry. So I just wanted to share that basically with everyone else. And also to share that with all the engineers, not only a Skyscanner, but across the world that may be adopting OpenTelemetry and going through a similar journey.
So that was the reason. Also, I had five weeks sabbatical that I decided to spend focusing on that book. So I just could basically get it done.
Jessica Kerr: In five weeks!
Dan: Well, it was not only done in five weeks, but I did take five weeks to really focus on it, which helped a lot. Otherwise I would still be writing it probably.
Martin Thwaites: Yeah, there is that idea of, you know, you need to get a little bit of focus to at least get yourself over that initial, like what's it going to look like? And then you get your motivation and then you can start doing it. I have that problem quite a lot.
Jessica: So when you were convincing teams about best practices, I imagine it added a sense of authority to be like, here's the book, someone published it.
Dan: There is that as well, yeah. I just wanted to put that into one single place that people can refer to, basically as in like, okay, these are the things people can identify. Hopefully some other end users as well can identify with some of the themes and some of the little pieces of advice and little notes that are across the book.
So yeah, we face that problem with like cardinality explosion or we face that problem with like really high costs on login. So that's what I was trying to sort of like hint at during the book and say, well, this is how you may solve it with OpenTelemetry. This is how OpenTelemetry helps.
Jessica: Nice. Would you introduce yourself?
Dan: Yeah, my name is Dan Gomez Blanco. I'm a principal engineer at Skyscanner and I work in the production platform tribe, so platform engineering, but I work across the organization to basically drive the strategy of our observability and operational monitoring. And recently as well, I've been elected to be part of the OpenTelemetry Governance Committee and to bring as well a bit of an end user perspective to the community as well in that sense.
Jessica: How did you become in charge of observability? Like what got you excited about that?
Martin: Who did you hurt that made you in charge of that?
Dan: I think it's been a journey I think for me as well. So before I joined Skyscanner, I used to work for the startup that it was acquired by Skyscanner. It was related to JVM tuning, but then moved, when I moved to Skyscanner, I started to work on, on RUM, on real user monitoring, basically improving the performance of our websites as users see it, right?
And I started to work on pipelines for, back in those days we didn't have like some other instrumentation that we've got now. So basically working on pipelines to ingest all that, all that data from our users devices, from from browser.
And then from then on I moved on to distributed tracing, which at the time we were using OpenTracing and then it just sort of like evolved into a wider observability including metrics, login and so on, being able to integrate all that. And then as part of that, basically driving the strategy to adopt OpenTelemetry and then changes happening in how we simplified our infrastructure and our pipelines and our telemetry back end and so on.
Jessica: So it started with RUM and with trying to improve the experience for real users.
Dan: Indeed, it started with client size with RUM, and correlating basically the impacts of client size performance on things like user conversion or even revenue, right?
Jessica: So it started directly from value.
Dan: Yeah, I mean that's basically where it started from. Like this is why speed matters and that's sort of like when I say speed matters, it was that Google forgot the exact name of the blog post, but when Google came up with core web vitals and it says something like around speed matters and why it matters to users and what it matters to your organization as well.
Martin: Yeah something like, is it like an extra half second is worth X amount in conversion or something like that.
Dan: Yeah, it was that one.
Martin: It's the one that everybody cites and I can't remember who it was that pointed out, but one of those surveys was very specifically 20 years ago or something or 30 years ago and things have moved on as to whether core web vitals is actually the right metric nowadays.
But you know, it is a metric and it's a good number that we can start to improve on and I think everybody can accept that we hit websites, they run slow, we get annoyed, we either move on or we wait and come back the next day and eventually we'll just move on anyway.
So I think it is very visceral to everybody that they know that slow means bad experience, how slow, you know, and that there's diminishing returns. Did you have like targets that you put in yourself that you were monitoring, that you were monitoring those conversions that you saw that that half a second, that one second for particular journeys? Was that where this was born out of or was it just that Google says?
Dan: There's two parts of that. Like the first one is like, Google says this and then you get ranked on-- Like your SEO ranking may depend on those couple with vitals. So that's a direct impact, right? But then there's also other areas where like, you know, not just related to core web vital tools in terms of like performance of different aspects that are custom, that's specific to your application.
Like for example, for Skyscanner, like we have done experiments that show that if we take too long too low results from search results for a flight searches, then that affects how people interact with the product. And if they continue to book or if they drop, I mean that is clearly like we can see a direct correlation of basically at what point to, there's normally a threshold and that's where you normally have your threshold for your SLO, right? That you're saying, okay, so after this threshold people start to get angry.
Martin: And I suppose is that the reason why you went for tracing first? So it sounds like you went from RUM into OpenTracing?
Dan: Personally, yes. Personally, when I found distributed tracing and tried to find the correlation between that all the way. I was already starting to think of the concept that we're talking about now in OpenTelemetry with the client instrumentation or the client side instrumentation working group and sort of like how do you connect what happens in the browser with end users, how do you connect that to your back end services?
But yeah, so like what drove me to OpenTracing and to sort of like to the other areas was like, it was more of a change of teams at the time, but then trying to get the sort of what happens in your front end, sort of like your back end for front end, how does that correlate to all the back end services and their dependencies all the way down to the infrastructure basically at the end of the day.
Jessica: So you had RUM and you started with the generic core web vitals and then you moved on to stuff that was specific to your application and you're measuring that and then when you want to make it faster, then you get into tracing.
Dan: Yeah, well that was part of why we started to rely more on tracing at Skyscanner is to understand what is happening within a complex distributed system, right? To serve a particular flight search, for example, you'll be hitting many, many microservices within our service mesh.
So how did you know what's actually contributing to that slowness that you can see, right? G ive me the slowest samples of that, like the slowest traces and then start to identify patterns on that. That is why we started to rely more on that and then that's why I was trying to sell that as well to the rest of the organization and saying like, actually this is important.
This is why we need tracing. What we were missing at the time was that link between what happens in the browser and what happens in the back end, which is something that OpenTelemetry is now helping to bridge that gap between RUM and your distributed tracing in the back end.
Jessica: Yeah. And for that you need the application-specific, hey, this is what matters measurement that you created. So it sounds like you've got a direct path from value to the organization and then you had to roll this out across teams.
Dan: Yeah, I think the reason why this is coming from sort of like a platform engineering team is that at Skyscanner we have invested quite heavily on developer enablement. So as I said in a few talks already, what we're trying to achieve is that the golden path should be the path of least resistance, right?
It's making it super easy for every engineer. Like we've got like 80 teams that need to do the same thing over and over again, right? Which is like instrument your back end or basically set up the OpenTelemetry SDK or some configuration for like how you want to emit your metrics or your traces, where to emit them to, and all that basically configuration around the default is something that we integrate with the rest of our applications in a way that it just comes out of the box.
Jessica: Do you have like a wrapper library?
Dan: Yeah, exactly. So it handles more than telemetry, but we sort of like implement that into that library and into the base images, into the Docker base images that we maintain at Skyscanner so that service owners get a lot of this thing like out of the box. We don't use specific OpenTelemetry distro for example.
In a way that's our distro, right? We set the defaults as an enablement function within the organization and then we sort of like promote those over different channels to the rest of the company.
Jessica: Nice. So you have, I mean effectively a distro, but it's not only a distro, it does a bunch of other do things, right?
Martin: Yeah, exactly. I think the idea of make the right thing, the easy thing is true not just in telemetry. It's true everywhere. Like nobody's going to choose to do the hard thing if it's the wrong thing as well.
Jessica: Well, a much smaller number of developers.
Martin: Yeah I mean there are some people who do like that kind of environment, but what I'm saying is, I mean I think inherently developers are lazy people. We want to do things that make our lives easier. We want to do--
Jessica: We want to get a particular thing done and that thing is not typically configuring an OpenTelemetry exporter.
Martin: Yeah, we want to do the thing that's fun for us. We want to do the thing that's on our plate. We don't want to have to care about the things that are around that thing. We don't want to have to go and do all the yak shaving.
Like that is not something we want to do. But if something doesn't exist, we're going to go and do that yak shaving and we're going to try and make it interesting.
Jessica: Then we're going to get distracted and then we're just going to, we're just going to focus on observability and then we're going to have to move to the platform team.
Martin: So you know, that idea of just get all of that out of their way, make it easy for them so that well oh I need to do that thing, I could build it all myself or I could just add this one line into my code and then I can get back to do my job. I think that's what a lot of people miss when they start to build platforms.
Jessica: But do the teams at Skyscanner, do they use that telemetry data? Do they go look at traces?
Dan: Yeah, all the time. And that's something that is specific to each team. Some teams will rely on distributed tracing. So every team gets it out of the box basically. Some teams use it all the time, they find it super valuable and some other teams don't use it that much.
That's the caveat of like having things out of the box as like you have it out the box as super easy, but then you need to make sure that people use it, that people see the value that people use, the tooling that people use, the observability data that is provided to them to change those legacy practices that they may have for like years or sometimes decades.
Not in our case, but like just imagine for example, one team that you know, they relied on logs and metrics always, right? Then you give them traces out of the box, but their runbooks still basically point them to run this particular query for logs or like look at this.
Jessica: This, oh the runbooks still point to the logging tool.
Dan: Exactly.
So there's not an incentive to sort of like use tracing if you've never used it and nobody explains to you the the advantages of it. So I think that's one of the most difficult things of observability adoption is people and the practices of different teams rather than the tooling.
The tooling is, you can generate a lot of data out of the box and super valuable data.
Jessica: So when teams like get the observability out of the box, then their services traces are fitting into the wider picture and that can help everybody.
Dan: Yeah.
Jessica: But then we want that development team to get the value too. So how do you go about evangelizing tracing?
Dan: One of the things that I found recently that I found really, really powerful is to do hands-on exercises with the OpenTelemetry demo. So in the demo you've got feature flags and you can create incidents, or you can create regressions in the OpenTelemetry demo and then use your tooling whatever tooling that is because it integrates with many vendors and open source solutions.
So whatever you're using in your production environment, you can then tell your teams basically drive that through teams and say, okay, I'm going to put that in there. Go and debug it like you don't know anything about this application, about this OpenTelemetry demo. Go and debug that particular regression.
And then they see actually that when you explain that here's a happy path for to debug this, they will see that, you know, the way to get to the regression to the root cause is normally through distributed tracing first. So that is the entry point. And then when you divide that, you sort of like gamify it a little bit.
Jessica: Okay. Because they don't have run books for the hotel demo that point to their logging.
Dan: Exactly. So without run books stop from scratch and then you normally divide it in teams. What we do is like we divide it in teams and they compete against each other who finds the root cause faster.
And then the one that finds it first they get a little bit of a prize as well. So it's gamifying the incident sort of root cause analysis and then normally what we see is that, I mean always the ones that are using tracing will be the ones that find it first.
Martin: Yeah, we had a customer recently who, they were having this root cause thing where the lots of incidents were going on and one team was always the people who found the root cause first. They said they were cheating because they were using distributed tracing, they were finding it quicker a bit. Well they're using distributed tracing, they're obviously going to be faster.
Like okay, could you just say that again in your head and listen to yourself, you know? So what's going on with the governance committee then? You've done all this amazing stuff and that's one of the things that I was pushing for with the committee was to get somebody like yourself on there. That was for the end users and not for vendors because you don't work for a vendor, you have no affiliation with a vendor.
So getting somebody like that on the committee I thought was one of the priorities that we needed to do anyway. So I was very happy when you did. So what are your plans for the committee? What is it you are trying to do with the community around observability?
Dan: Yep. So I think I'm still trying to get my feet on the ground, but I'm starting now to think about the future basically I think part of the reason why I was chosen to be there is to represent end users within that group, right? And my idea is to try to incorporate that end user feedback and all the fantastic work that's being done in the end user working group as well. And then try to basically close the feedback loop.
One of the things that I sometimes see that in OpenTelemetry we're not that good at is that like saying linking the user feedback that we're getting, the challenges that users, that end users have with the actions and with the features that are getting delivered, right?
So I think it's there, it's just clearly like , we're doing things that matter but how does that relate to end users? And that's something that sometimes it's just a matter of explain or like clearly articulating that this is helping all these people basically on their day to day and trying to get that feedback loop basically to be closed and to and to a trade over it.
Jessica: Sometimes as developers we get really fascinated by the how and we lose touch with the why.
Dan: Yeah. I think that is one of the key themes for me is yeah, try to make end users be, I think they're heard and they're listened to but try to see how things relate basically.
Jessica: I'm really glad you're on that committee. Also thanks for writing the book.
Dan: Oh thanks for reading it.
Jessica: You mentioned something earlier that I wanted to dig into. You were talking about patterns for dealing with different challenges of observability. What are the patterns for controlling costs?
Dan: That is probably one of the things that drives some-- I mean at least in my case, as I've seen that drives some of the moves towards better observability. Believe it or not.
What we've done is give users an idea of how much data they're producing. Because at the end of the day, it might be one way or another the amount of data that you produce or that you store or that you transfer through over the wire, it will have an impact on cost, right?
So one of the things that we do as Skyscanner is try to basically put that data in front of the engineering teams themselves, not just for telemetry but also for cloud costs and all that. So like a split-by teams.
So each team is responsible for their own costs and what we've seen, for example, in terms of costs were teams that were like, the moment that we started doing that with telemetry, they said, oh, the majority of my costs are coming from login, right. It's very verbose. And like, basically they saw that and the moment that they saw that they said well actually maybe we should look into this tracing thing where we've got like tail sampling. And with tail sampling we can keep only, you know, what matters.
Like we can keep the traces that matter, right? I can go into more detail on how tail sampling works, but in general, basically we're keeping what matters like the lowest transactions, the ones that contain errors. And then when they did that they said, okay, so now that we've got that we can then stop all that debug level login or the sort of like access login. And then some of them even got to save 90% of our costs on login.
Jessica: Wow, okay.
Dan: That's quite a lot of money in the long run. And then basically just stop like very verbose login add attributes to spans, get the traces to be tail sampled and then get better at observability because they get more context with like well a distributed trace and the services that are part of their dependencies and then also save costs. So that's why I keep saying that more data is not always better observability. That's one of my mottos.
Jessica: Nice. So by keep giving people visibility into their costs and not just their observability costs, but using observability to give them visibility into AWS costs or cloud costs, you were able to reduce costs around the board.
Dan: Indeed. Yeah.
Jessica: And I love that that that it actually helped drive people toward the practices you are already wanting them to use because they also get more value.
Dan: Yep, indeed. And also like I would say that one of the things that helped us here tremendously is OpenTelemetry's semantic conventions because it makes it super easy to split things by, well service name space for example, which we can match a service namespace to the owners of that namespace.
So then we can say, okay, so all the telemetry that's produced by this particular service will be annotated with a service name in the service namespace. So it's just a matching exercise later to attribute it. Before that, before we were using OpenTelemetry, there was just a log, a data point, but you wouldn't know where it came from, right?
Jessica: Where did they come from?
Martin: And then you go to the team and they say, yeah, yeah we tag it. It's like, okay, so how do you tag it? It's like, yeah, yeah, we use NS equals.
Dan: Exactly. Yeah.
Martin: It's like, oh, oh right, okay. And then somebody else says, oh yeah we use names. Yeah where we just abbreviate space to S but keep names.
Jessica: Oh yeah, I put initials on all my Debugs.
Dan: Or someone else. We were like, oh yeah, we call it SVC underscore name.
Martin: And it's like, oh wait, which environment is it? Well it's in pre-prod or staging or UAT or pre-production or PP or like can you just use the same names please?
Jessica: Right, yeah. There's a consistency there that we need. And OpenTelemetry has done a lot of work in figuring out what we need to be consistent stuff like service name and like some standard HTTP fields and so on, while allowing any kind of attribute that might be specific to your specific service or application. It's not restrictive.
Dan: Yeah, it's a mix of best of both worlds I guess.
Jessica: And then you didn't have to tell people not to log, you just like showed them what it was costing them.
Dan: Oh no, we did, we did as well. We have a carrot but we have a stick. No, I think we need to, I mean there's a little bit of governance as well, right? To make sure that we're producing valuable data, we try to help people with visualizations and how to basically get the most out of it and then there is like the evangelizing as well of best practices and the training and so on.
Jessica: I love that training. Yeah, because I think of it as like as a platform team, you don't have to tell people exactly what to do, but you can give teams an API to meet and that API includes things like put your service name in the service name.
Dan: It makes it easy as well for platform engineers as well. If you're supporting infrastructure, if you're supporting pipelines and everyone that's done that at a certain scale, will know that it's very easy for a company to DDoS themselves.
So like, you know, we've seen that we're like, oh yeah, someone decided to add just like a new metric tag or something that is called like infinite cardinality and then everything goes boom. But you need to basically make sure that you get the accountability for that as well. So it's easy for us with OpenTelemetry conventions to say, okay, well you know, this is where it's coming from. We know this, we can block it, you know, we can limit it.
Martin: So semantic conventions are the Git Blame of the OpenTelemetry community.
Dan: No way.
Jessica: Some of them. And you mentioned people started adding attributes to spans instead of logging and yet with with metrics you have to be careful what attributes you add.
Dan: Yeah, and I think that's one of the, well when we were talking about how enablement teams make it super easy to follow the golden path, the other side of that coin is that when we've got all these layers of abstractions, like everything comes out of the box, it's quite easy for teams to forget the best practices, forget what matters.
So then when they do their custom instrumentation, they may not be completely aware that adding a new attribute to a metric that has the unique user ID for a particular endpoint will have a detrimental impact on your pipelines and on your cost and so on.
Jessica: Right, because metrics for every piece of information you add that the number, the amount of data you need to store expense geometrically.
Dan: Yes. Yeah. One of the things that we've got in our sort of like guidance as well and where OpenTelemetry helps, is try to make people understand that you don't need that high cardinality metric if you can sort of like aggregate it and then correlate it to high granularity spans and traces, right?
And this is where it all comes into like the same sort of context and the same stream of correlated data, which is what OpenTelemetry tries to provide, right? So if you've got lower cardinality metrics that are correlated via exemplar, for example, to high cardinality or high granularity spans and traces, then you get the best of both worlds. You can then go and aggregate your metrics as something that is sensible that you can graph that you can put on a dashboard or that you can alert on.
And then when there is a regression in that, you want to be able to see what are the individual samples that basically correlate to that particular regression. So that's basically the sort of advice that we're trying to give is like use the telemetry, each signal for its best use case and then correlate them all. So then what you end up, if you do that normally you end up with better data, lower cost, but also better observability in general.
Even for people that do not rely a metrics or they don't want to rely a metrics that much. It does come with, as I was saying, like the legacy of using metrics. It's difficult to get people out of that as well, right?
Jessica: Right. So use the signals for what they're best at.
Martin: Yeah, yeah. And you know, I've been accused more than once of hating on one particular signal and the reality is they all have a use and it's like anything in life they have a use. It's when that-- The whole sort of hammer and nail, you know, the screw and the nail. If all I've got is a hammer, the entire world is a nail.
Jessica: Yeah, I can fall in love with tracing but then I go try to apply it to a front end application and note, it doesn't work. But logs do, with correlated logs, right. With session ID and trace ID on them because like you said, hooking with signals together lets you get the strength of each.
Martin: Yeah. And it is, you know, do you use those signals for the right things but know when you're using the signal for the wrong thing? I think that's the hard part. It's is this the right signal for the sort of task that I'm trying to do? Am I using metrics when I should be using tracing? Am I using tracing when I should be using logs? Am I using logs when I should be using metrics?
Jessica: Do you get questions from teams about that?
Dan: Yeah, all the time. Yeah. So questions from teams around instrument basically, what should they be using? Yeah, absolutely. I think there's another one as well as Martin was saying is what they should not be instrumenting themselves, which is another topic entirely now with auto-instrumentation, right?
As like we're moving from a world where like if you wanted to instrument a metric or or tracing basically for a particular library, you sort of like had to do it yourself. We're moving a world where like, you know, we've got so much auto-instrumentation out of the box with OpenTelemetry that you should not instrument something that is already instrumented by either library authors or instrumentation authors.
Martin: Yeah. So don't add a metric for HTTP count. Don't add a span for a HTTP call because well those are already there. Augment the HTTP call with more data. Don't add a new span for it if you don't need to.
Dan: Yeah, we used to have even like wrappers for like, you know, Redis clients for example. The only thing that a wrapper would do for a Redis client is like try to capture some metrics and our custom. So back in the day in our custom protocol or custom SDKs and all that and now it's just use the Redis instrumentation.
Jessica: Okay so you said you have an internal library that does some wrapping of things for due things, right?
Dan: Yeah, we're trying to move away from those.
Jessica: Well now all your library has to do is bring an oTel.
Dan: Yeah.
Jessica: And you don't have to be responsible for making that up.
Dan: Also there's a point of like if you've got something that you're interested in instrumenting and it's not there, instead of doing it yourself, you can contribute that back to the community. So you can contribute back and then say, well this is not instrumented by this library, it doesn't exist yet.
We're going to propose that, work on that, and then basically instead of like building something that's only specific to your company, you're sort of like sharing that as well, right? And then getting the benefits from it.
Martin: The great thing about that is now you've got 400 people that are arguing about naming instead of the three in your organization.
Dan: That is also true.
Jessica: But once it's in, then it's an industry standard and no one in your organization can argue about it anymore.
Martin: I mean there's benefits and drawbacks.
Jessica: So if you want a different set of people to argue with...
Martin: And arguing with people on the internet is so much fun. I think one of the things that you get out there is over instrumentation. How do we stop people from doing over instrumentation? Well how do we stop people from creating more data than they need?
We did that with logging you, you've alluded to that at the start where 90% of that logging was something that people weren't using. That was over instrumentation. I mean you fixed that by moving to tracing and having fewer of those spans kept by reducing that over instrumentation.
But I think it's also prevalent in tracing, it's prevalent in metrics and there's a lot of organizations that are now taking that metrics idea and looking at the metrics that you do use and getting rid of some of the stuff that you don't, which is a really interesting concept around this idea of over instrumentation, where--
Can your back ends tell you that you're over instrumented rather than you actually looking at the code and saying you're over instrumented? And that's, I think some areas that we're going to be seeing a lot more of over the next sort of year or two around how do we detect over instrumentation.
Dan: Or detect instrumentation quality in a way, right? So I think that would, that's a really interesting concept if you could measure how well your system is instrumented, right? Are you correlating things, are you propagating context correctly? Have you got duplicated things that, you know, as you said, well you've got too much logging but that's already covered by your traces and so on.
So that's, that's a really interesting area and a really interesting topic I think at the moment. Basically what we've got is enabling teams that have that purpose to come up with those defaults, right? For like what's the minimum instrumentation that you need to operate something and then build on top of that, right?
That approach I think has worked quite well for us. We're like, we start from the minimum and then we build up on top of that rather than like try to cut back later. You say, okay, do you need this? Do you need this instrumentation? Do you need this metric. Is it used in any dashboard or alert or are you just adding data just for adding it?
Jessica: And if you have visibility into cost, then you have some sort of signal of when it's time to maybe look at that metrics you don't need anymore kind of thing.
Dan: Yeah.
Jessica: I like the part where your team has this expertise and shares it with other teams as they need it because while we want every developer to implement a tiny bit of instrumentation and to use the traces, you don't need to be an expert. We can't all be an expert in everything.
Dan: I think there's another group of people that help if your organization is a sort of like a particular site where like you've got, for example, dozens of teams, right? Having one single team that is responsible for just basically following best practices does not scale that well because they might be the experts in telemetry instrumentation, they might be the experts in OpenTelemetry like configuration, but they're not going to be the experts in the custom things that people care about to monitor and to instrument in their applications.
So this is where having a group, like a cross-organization group, right? We call it observability ambassadors or champions or heroes, like multiple names that you can have for that. So like group where like it may not be exhibition observability, but they're interested in the topic, they can learn and they can share that knowledge within their teams and then it becomes like the point of contact, right? Because they know the context around what matters to them.
Jessica: Yes. Okay. I love that structure where you have someone in the team who is the designated observability kind of liaison. They don't have to be experts. If they have the interest, they will become experts, as they go to your group and get their questions answered and take that back to that team and integrate their specific application knowledge with the more general practices that you can share.
Martin: I think that's one of the keys of observability versus sort of the older monitoring school is it requires your context, it requires that team's context. It's not something that you can rely solely on auto-instrumentation.
Jessica: You don't just tack it on.
Martin: Exactly. You need that team to say, oh yeah, what would be really interesting here is the duration of the flight when they're away. That's actually a really nice thing for us to know about. Like there's no auto-instrumentation package that's going to take the flight number from the HTTP request and add that to a span.
That might be important in the context of one team. It might be important in the context of one company, but it's not going to be important in the wider community. So you need to provide that context and I think that's where that difference comes in.
So having those champions in your organization that know that domain and say that's going to be really important. You should probably think about putting that in and this is how you might do it and I know who I can talk to about making it the most efficient. Bringing all of that together, that's gold.
Jessica: Yeah, I appreciate that we're starting to think more systemically now in the sense of if you want quality, you don't get that by hiring a quality team and tacking it onto the organization. And it's the same with observability. A lot of the work is with the people, is integrating the practices into every team. It sounds like you are deeply engaged in that.
Dan: Yeah, absolutely. I think they can help on the other side, I mean we've talked about mostly about like instrumentation here, but like they can help on the other side as well on the alerting and the best practices. So if you start to go through the sort of like SRE practices of like, are you suffering from alert fatigue or you know, they can be the ones that that can give advice as well.
Or what is a good SLO or what is a good SLI basically to set a target on that sort of like that level of like knowledge, which they can apply that to their domain. And then that's not something that, I mean, someone that is outside of a team will not be able to tell you what a good SLI is for your service. That's so specific to each service.
Jessica: Do you help them create the SLI once they define it?
Dan: Yeah I mean, well that's in a way, they are sort of like responsible for that, right? So they, if they've got an SLI. So they would be responsible for that as well.
Jessica: Martin, there was an invitation earlier to talk more about tail sampling.
Martin: Yes. Something is amazing and it's one of those things that's done wrong so many times.
Jessica: So how do you do it? Does your team take charge of sampling?
Dan: In our case, we use our vendors capabilities for tail sampling. The reason for this is that we have a multi cluster service mesh. So in a way, you know, like one of the caveats of tail sampling, it is amazing. You can do it in the OpenTelemetry collector as well yourself. I mean that is, it's got a tail sampling processor, but then all your traces, so all your spans for a given trace need to end up in the same collector replica.
With got multi cluster service mesh, i t is a bit more challenging because then communication can go across clusters and it becomes a bit more challenging. The way we do it is we use our vendors' capabilities to be able to do that. But the idea here is the same, right?
You're basically what we end up storing more or less is around five to 6% of the traces that, that we generate because we're keeping, well every single trace that's got an error in it, keeping traces that I think it's around 1% of random of all of them.
And then we're keeping the 1% slowest or the, basically there's lowest traces depending on the, so like the root span, right? The entry point of the system. What are your slower and slowest transactions? So that's basically the beauty of tail sampling is that that comes out at 5%.
Jessica: Nice.
Dan: I think it's really difficult. It was originally, I think I remember when we with tail sampling in a past, like we used to use another vendor right? In the past and it was a really hard sale originally. It was a really hard sale because the moment that you tell someone we're only going to keep 5% of your data, they were like, oh yeah, but what if I miss something.
And then the moment that I start using it and they realize during an incident is that no, everything that you need is there, you don't need that other 95%, which is like all two hundreds, all goods no slower, blah, blah, blah. So the more that they used them, they were like, actually no, you're right. We don't need that other 95%.
Martin: I mean that is, that is where the magic is, is trying to work out exactly where the right point is to draw the line. Also being able to move that line as well and actively change it. I think that a lot of people, like you say the trauma of oh we'll just take 5% of the traces. They're like, yeah, but that other 95% I might need.
Jessica: But with tail sampling you get to be careful about which 5%.
Martin: Exactly. It's like I only get the 5% that I need. It's like, but how do you know? I'm like, tell me what it is that you need. Well, if something's going to be slow, I'm going to do it. Well okay, well let's keep the slow ones. And they're like, but I can't do that. And yes you can. That's what tail sampling is. That's what using those rules is all about.
And if you then maintain that context, you can still get back to those real values. And it's that idea of how do you group traces together? How do you say that things are similar and it's complicated. It's not as easy as just tweaking a little bar that just says keep 5%, keep 4%, keep 3%.
Jessica: Or debug, info, warn.
Martin: Yeah.
Jessica: It's not that simple, but it is way more effective.
Dan: Yeah. I'm just trying to imagine what it would be for logging to be to do any type of sampling. You just wouldn't be able to do it. I mean you just probably would be like keeping some debugging logs or keeping a percentage of the debug logs. But then you may be missing, I mean you can do it, but you can do sampling on logs. Of course you can, but then you're actually, you don't know if you're missing something important or not.
Jessica: You could include a request ID in all your logs. And then sample based on that so that you'll get the entire story of a request. But which ones were slow. Good luck figuring that out from logs.
Dan: There's a lot more logic you need to do. So you're saying you're put in a trace ID and your logs and then you're sampling your logs and then you're basically reinventing tracing.
Martin: What you need to do is you need to, on your logs, basically what you need to do is put this trace ID, give each one of a unique ID. And then what you need to do is you need to say how long it took for that logs to actually run. So put a duration on it.
Jessica: Yeah.
Martin: And then, oh, what would be really good is if you say which log was happening before it as well, like maybe it's parent, if you put all of that stuff on your logs, it'd be amazing.
Dan: I mean, what we did, and many, many, many organizations had previous to tracing implemented their own ways of propagating context or like their own ways of propagating request context. With a transaction ID or a correlation ID.
And the pain that that is with maintaining your own custom instrumentation for HTTP clients or GRPC clients and then propagating that is something that I think is undervalued. The amount of like pain that OpenTelemetry Collector is taken away from companies by having a standard way to propagate context. That is one of the most impressive things.
Jessica: OpenTelemetry is the second biggest project in the CNCF by contribution. That's a lot of contributions. You don't want to recreate all of those.
Martin: Yeah, exactly. I mean it's like how many different languages do you use? I mean most people are maybe, I mean I would probably say majority of organizations, it's one, maybe two. Although there's the outliers with acquisitions. Like you were saying that your company was acquired that oh yeah, now we've got Java, like we've never done Java before.
All of those custom things that we created to propagate correlation, like correlation isn't new, like the trace ID concept isn't new. Correlation has existed for so long. What we've done is standardized it, standardized the format, standardized names, standardized all the rest of it so that every language can say, oh yeah, trace parent, I know that.
Jessica: Yeah.
Martin: If this was a video call, I would've put up the meme from Jurassic Park. It's like, oh yeah, correlation. I know this.
Jessica: You don't provide a link to that meme in the show notes. Speaking of show notes, Dan, how can people get in touch with you or find out more?
Dan: Well, they can get in touch with me on the CNCF Slack. I'm always there. There's Dan Gomez Blanco, you will be able to find me. That's the main one. And then of course on LinkedIn as well.
Jessica: And they can buy your book.
Dan: And yeah, of course as well.
Martin: You said that you've been doing some talks and stuff. Have you got any more talks coming up this year?
Dan: Yes, I'll say that. I'll be at KubeCon in Paris. We're doing a talk on Observability Day related to some of the stuff that we're talking about today. So how do you actually get value out of observability and how you can basically take OpenTelemetry and then migrate to OpenTelemetry and still not have an observable system. Basically it's like there's, you know, you can do it badly as well.
And then a panel as well on Observability Day where I've put together some folks from, other end users basically. And we'll be talking about how OpenTelemetry, so this is completely driven by end users and we'll be talking about some of the challenges of adopting OpenTelemetry at scale, some of the best practices, what's working with people, what's not working, and so on.
And then as well on the maintainer track, we'll be there with the rest of the governance committee.
Jessica: Great. And for people listening to this after KubeCon in Paris in 2024, look for the videos. Thank you so much for joining us.
Dan: Thank you very much.
Content from the Library
The Kubelist Podcast Ep. #43, SpinKube with Kate Goldenring of Fermyon
In episode 43 of The Kubelist Podcast, Kate Goldenring shares her journey from Microsoft, where she contributed to Kubernetes...
The Kubelist Podcast Ep. #39, Live From KubeCon 2023
In episode 39 of The Kubelist Podcast, Marc and Benjie recount their experience at KubeCon 2023 and share interviews from the...
O11ycast Ep. #62, Adopting OpenTelemetry with Doug Ramirez of Uplight
In episode 62 of o11ycast, Jessica Kerr and Martin Thwaites speak with Doug Ramirez of Uplight. This conversation covers many...