Library Podcasts

Ep. #44, Examining OpenTelemetry with Vincent Behar of Ubisoft

Guests: Vincent Behar

In episode 44 of o11ycast, Liz and Charity speak with Vincent Behar of Ubisoft. Together they pull back the curtain on OpenTelemetry, exploring its popularity and standardization, as well as Vincent’s continuous journey to better understand the developer experience.


About the Guests

Vincent Behar is a software engineer with 14+ years of experience. He currently works for Ubisoft and was previously principal engineer / software architect at Dailymotion.

Show Notes

Transcript

00:00:00
00:00:00

Vincent Behar: Yes. The choice was pretty straightforward to use OpenTelemetry because it's clearly defined as the new standard.

Everybody is working on it from multiple companies, so you can see everybody going in the same direction.

There is no really a discussion about whether to use it or not, but yes, exactly.

Once you start using it, the specific part of the API are still moving and it can be a bit difficult to upgrade.

Well, not difficult, but it takes time.

Every time there is a new version, if you're still on something that's not GA, not version one something, you need to set some time aside to update your code and so on.

The benefits are much more important than the little time you will lose upgrading your code to adjust to the new API because there are exporters for every systems out there.

It's a good benefit.

Charity Majors: I've been really surprised and delighted by the rapid uptake of OpenTelemetry.

I think that it seems to me to be rolling out and getting traction much faster than I can think of any other new standard.

Certainly, thinking about OpenTracing and all those, they were not nearly as prevalent two years in as OpenTracing has become, which is pretty cool.

Vincent: Yeah.

I think it was very slow at the beginning because I've had it in my radar for a long time ago, and for a while, it seems like it was under development, but nobody knew or really used it, and at some point it really started to be used more and more and people are talking about it, writing blog posts, talking at conference.

Charity: Yeah.

Vincent: A lot of publicity around it.

Many have the GA, so general availability, and support from a lot of different vendors to integrate directly with any kind of tool or back end you might be using.

Charity: Yeah.

Vincent: I think with that much availability, everybody starts using it so you get all of a sudden lots of people using it and you don't really have a choice.

Charity: Right, right.

Vincent: Everybody's working on it, everybody's using it, so you're like, "Do I stay on my old API that nobody really support?"

Liz Fong-Jones: Sounds like a kind of virtuous cycle where once something reaches the appropriate point, it rapidly gets better because it is getting exercised often enough.

Charity: It's definitely an idea that has not just come, but it should've come a long time ago.

I've been complaining for years about just how we're so far behind where we ought to be as an industry in terms of best practices for instrumentation and so forth because it's just been these crappy one-offs and hacks that everybody does slightly differently.

I kind of thought that the revolution would come much more looking like a structured log format, but it came in OpenTelemetry form, and that's pretty cool.

Otel supports these arbitrarily-wide structured data BLOBs and that's great.

Vincent: Yeah. Even if people are still using it, there are still a lot and lot of people who are not using it and who are still stuck with the old way of managing their monitoring and observability systems.

They're pushing their logs to one side, trace to another if they have trace.

Most likely it's metrics to another side, no correlation.

I think that's a really big advantage of OpenTelemetry, is to be able to build correlation between the different source we have.

I see too often people who don't see the benefit of correlation and who are still treating their log and their metrics as very different system pushing one to Prometheus and Grafana example, and the other to Kibana and no relation between them.

I think that's the real benefit of OpenTelemetry is to bring that correlation as close as we can to the instrumentation of the code.

Charity: And to eliminate vendor lock-in because theoretically, once you've instrumented your code with OpenTelemetry, you can just switch from one vendor to the next pretty seamlessly without having-

Vincent: Yeah.

Charity: Which is huge because vendors, they haven't been competing on the value delivered to their users so much as they've been trying to get people stuck within their walled gardens.

I like this new world much more where users can switch and it's not that big of a deal.

They don't have to go and reinstrument their entire system, but it just feels like it's good for users, it's good for everyone.

Liz: I guess that's the bet that Vincent was telling us at the beginning, that it's a small amount of work to update your code as Otel advanced towards a 1.0, but that it's still much easier than having to rip everything out and replace it if you were to change vendors.

That's the appeal of the system.

Vincent: Yeah, exactly. Yeah.

Charity: Right.

Vincent: I totally agree. I wouldn't change my few hours of updating my code each time there is a new version for what we had before.

Liz: And now it's 1.0, so hopefully you shouldn't need to do that in the future.

Vincent: Yeah, exactly.

Liz: So now would be a good time for you to tell us about yourself and how you came to use OpenTelemetry at your workplace and who are you and what you do.

Vincent: Yeah. My name is Vincent.

I'm from France as you can tell for my lovely French accent.

I'm now working at Ubisoft, the French game company, but it's been only one week.

Previously, I was working at Dailymotion, the French YouTube, and lately at Dailymotion I was focused on continuous delivery and observability.

I managed to bring these two together. I'm a contributor to the Jenkins X open source project, which we were heavily using at Dailymotion.

I've been working quite a lot lately on bringing some observability feature to Jenkins X so that we can use them for ourself at Dailymotion, of course, and so that can benefit other people using the open source project.

Liz: It's really great to see people investing in their tooling for developers, that too often people think about a CD system as a second thought like, "Oh, it's fine if it's slow," but at Dailymotion it sounds like you cared a lot about that developer experience.

What motivated that?

Vincent: The first thing is that it's a critical part of our system.

Continuous delivery is a critical part of our workflow, how we develop and mainly how we push our application to production.

So that's our flow. If it's broken, it's going to impact everybody.

We want to release often, release early and so on.

That's a critical system, that's why we were investing in it and we are still investing in it.

One thing we wanted to have is some visualization on our workflow, if it's good, if it's slower, what's impacting us, what makes the developer slower to push their code in production and so on.

That's how we wanted to get some business metrics on the classical DevOps metrics, do a deep dive, and so on so we started with that.

Because we're using an open source product, we wanted to give that back because open source is only good if you can contribute back to it what you gain from other people that have been building it. It started like that.

We were also using observability for our own application production, and we wanted to bring that, all the value we gained from be able to understand our production system, we wanted to have the same thing for our continuous delivery systems, so our pipeline and so on, because most often, when you have newcomers in the team, it's very difficult to understand what's going on, what is taking time, where it's executing and so on.

I think that's something distributed tracing is a very good tool or practice to help you visualize your pipeline and what's going on, all the steps and what's going on underneath.

How it's executed? So for example, you're new to a company that's using Jenkins X.

You don't know how it works, the pipeline and so on.

You just have to click on one button in the UI.

Next to the original pipeline, you can see that it's executed in Kubernetes cluster, it's going to put some pod and so on.

You quickly understand what's happening.

Liz: That makes a lot of sense.

When you're talking about making the build pipeline faster and more reliable and more visible, what were some of the numbers when you started the project?

Was it taking days for a code to reach Dailymotion's production?

Was it taking hours? Was it taking minutes? What order of magnitude?

Vincent: On our project, before we've been really implementing continuous delivery, it was two weeks.

Charity: Two weeks.

Vincent: Two weeks. The duration of this point.

Charity: Two weeks to.

Vincent: Because we were doing it, so it was a years ago, but we were doing it old school.

We were doing your release at the end of this point every two weeks and pushing that to production.

And it was even worse because once we had a release, it went in two weeks window of validation before it can go to production.

It will be one month at some point. That was one of the reason.

Charity: How did you start hacking away at that?

What was the first thing you did and how much time did you recover?

Vincent: The first thing we did is trying to apply all the good practice that other people in the industry have already defined and using and we get to use it, we know it's a good practice.

Every time we have a pull request now, what we do is every time it merge, we are doing more test at the pull request level using what we call in Jenkins X preview environment.

We are asking a specific environment for the pull request to run our test and so on. We are shifting left.

All the tests that were previously executed after the release are now executed before the pull request is even merged to master.

Liz: We did the same thing at Honeycomb and it was hugely impactful.

The idea of setting up a front end binary per pull request, or a telemetry ingest binary per pull request. It made it so much faster to catch things.

Vincent: Exactly. Once you start using it, you can't go back.

Once the pull request is merged, it's going to create automatically a new release deployed to staging.

For the moment, we have a manual deployment to production, but for some application, it can be automatic.

Liz: That shrunk it from two weeks to a month to how long? Basically a few hours?

Vincent: Yeah. Or minutes. Yeah. If we have a quick fix, it can be minutes.

When it's merged, it's a few minutes to be deployed in production.

Charity: How long does it take if you need to just chip one line of code, how fast can that go out?

Vincent: That depends on the project because we have some that have a very small compilation timer, like a Go application which are very fast to compile, and some other like C++ which we have a big C++ application, so it can be much slower.

Charity: Yeah.

Vincent: Yeah. Smallest time will be I'd say 10 minutes, between 10 minutes and half an hour, I'd say.

Charity: Do you automatically deploy after each merge?

Vincent: Yes. To staging, but for production it's still manual.

Liz: You don't just let things pile up. The idea is it goes to staging, you look at it, and then you press the button, right?

Vincent: Yeah. Most of the time, yes. That depends of the change.

Sometimes, we have change that are not really related to our production application.

For example, somebody changing test, unit test or read me or something like that.

We don't really care about deploying that.

Charity: Are you planning to hook it up to production as well?

Vincent: Yes. It was planned.

Charity: Excellent.

Vincent: Just a bit of time because it's a big mindset change to automatically deploy to production, but for the moment, one click because we're doing it like GitHub style, so it's creating a new pull request in a GitHub environment repository.

It's just merging the pull request and then automatically deploying.

Charity: Let's talk about the cultural changes here. Were people scared?

Vincent: Yes.

Charity: Yeah. Say more about that.

Vincent: It's mainly people taking care of the platform because all of a sudden, they're not necessarily the one pushing the button.

It's having somebody else pushing change inside their platform that they're responsible for.

That can be difficult, but it takes a few months I'd say before people are fully adapted to it.

Charity: Just gaining trust.

Vincent: Yeah. Exactly.

Charity: Yeah.

Liz: The change really wasn't overnight though. I think Charity asked this towards the beginning.

What were some of the intermediate things that you first got people doing?

You said you did the pull request based shifting that testing left. That must have gotten you from two weeks to a day or two.

What got that frequency to minutes to hours? What was that next step after the pull request?

Charity: Did you have to do a lot of parallelizing of tests and refactoring your test stuff to run faster?

Vincent: Yes. That was pretty much a technical phase. It was not very difficult to do.

When it's difficult is when you have people mindset to change.

Changing your test like integration test or end-to-end test or however you call them to run in a different environment, it takes a bit of time because it's not something that's top priority usually.

For us, it was because we spent some time aside to make sure we could have a smooth workflow and be able to push quickly to production, but that's not what the real difficulty is. The real difficulty, as you said, is changing people mindset.

Charity: How did you get consensus?

Was there any difficulty between products who wanted engineers to be spending time on features, and I assume other people who wanted them to be spending time or--

Or did you actually do all of this work for them?

Vincent: No. There was a big consensus in the team because it was a new project.

We were building a new project, so for the first year, year and a half, it was only building, so nothing in production.

That was not an issue, our way of working, releasing every two weeks, even if it's bad I wouldn't do it now even for something that's not in production, but at the beginning it was not impacting us.

When we started to be in production and we wanted to have quick fix and it took two weeks or one month, that's when everybody felt the pain, both the developers, the product team, the management.

Everybody was aligned to say, "We have a big issue and we need to fix it."

There was no real issue of that not being the priority for everybody.

Liz: It also sounds like you were able to reduce the scope by trying to only change one team first before you started rolling out that practice across the entire company.

Vincent: Yes. The other team in the company were doing things a bit differently, so they didn't have the same issue.

They were already having a more smooth workflow. It was not really a challenge.

They have different challenge, but not that one.

Liz: So you decided that you wanted to make things faster.

Where did start the idea of instrumenting the Jenkins workflow specifically?

What were these things that you discovered for the first time when you implemented it?

Vincent: It was a bit later when that workflow was well implemented.

I'm sure you have the same experience of after weeks, months, years, you take a pipeline, build pipeline or whatever, and it's going to be slower and slower and slower, seconds, a few seconds and so on, until at some point people are going to stop and say, "Okay. This used to take one minute and now it's taking 10 minutes. What's wrong? We need to do something now."

Liz: Right.

Vincent: We had a few application where the pipeline took more and more time slowly and slowly.

That's when we wanted to measure it to be able to put alerts in them so that we can react before the developer get set up with it.

We wanted to have a way to debug and to understand what's taking time.

That's how it started, debugging purpose.

Liz: So basically you had gone from two week--

If your build takes two hours, after two weeks it's not a big deal, whereas once you're pushing every hour, it taking 10 minutes as opposed to one minute starts mattering. I see.

Vincent: Exactly.

Liz: So then, that was your incentive to instrument it to prevent it from regressing and to make it faster.

Vincent: Yes. So that was one of the use case we had.

We had sometimes a few pipeline that were slow taking like half an hour, but everything else was taking 10 minutes, and we had to understand why.

Sometimes, it was because maybe it was scheduled on a new node to pull all the container images and so on, so we had to put some light.

What was happening under the execution of the pipeline, because people usually see just the logical steps, but when it's running in a Kubernetes cluster, you have lots and lots of pieces underneath, and it can be very difficult to understand everything that can happen.

Liz: Right. It's easier to understand a monolithic process, but the instant you make a distributed system, it has all the problems of a distributed system.

Vincent: Exactly. It's using Tekton, it can schedule a new node.

You don't have enough available resources under existing nodes. Lots of things can happen.

Liz: It wasn't just that people's tests were getting more complex.

It was that the underlying infrastructure executing it was experiencing problems that you needed to surface in some way.

Vincent: Both. We had both issue, but it's true that some pipeline were getting more and more complex.

Yeah, exactly. That was our main incentive for doing that.

As we did that, we got another benefit which I think is very important, is that when you start to put the light on how is your pipeline running and what's happening inside of it, I think it's very interesting for newcomers in the team to understand what's the pipeline and how it's running.

That's something we didn't expect at the beginning, but it's the same when you have distributed tracing for your production system.

It helps you understand how the system works.

You don't only use it for debugging, but you also use it to understand how the system works.

Our request flows between different microservices.

Liz: Right. It's almost a form of documentation that your system is generating on its own.

Vincent: Yeah. We can say that. Yes.

Dynamically generated documentation, so you don't need to maintain it.

You get the benefit of documentation without the pain of documentation.

Liz: Now, I guess one of the questions here is hey, so you've mentioned that some of the pain came from Kubernetes.

Was the benefit from Kubernetes worth the pain?

What are some of the trade-offs someone should use when deciding, "Should I use Kubernetes to run my build system or should I keep it hosted on one box or a small set of boxes?"

Vincent: For the first question, is the benefit more important than the cons, I'd say yes without hesitation, but maybe I'm biased because I've been working on Kubernetes for a while now.

Of course, once you are used to it, you can say, no, it's easy Kubernetes.

Nobody say that, but at least you have a few basic understanding of the main component you're using.

You feel at home inside of it. Somebody that doesn't know Kubernetes will have a hard time understanding it, sure.

After your question of should you use Kubernetes for your build system, I think the answer depends on are you already using Kubernetes or not.

If you are already using Kubernetes to deploy, yes, absolutely.

You should use it too for your build system.

You already have all the knowledge and the infrastructure, so you already use it, and you will see all the benefits.

Charity: You should make it match as much as possible just generally speaking.

The idea that you could have a staging that matches prod is a total myth, but you should try to make it be composed of the same elements as much as possible.

Vincent: Yeah, exactly. I totally agree with you.

Having your staging or pre-production environment which is 100% the same as production is impossible, but yes, if you could try to get with a minimum effort, to get it as close as you can, and don't take too much time trying to make it exactly the same. You can't.

Charity: Yeah.

Liz: Yeah. That's super interesting.

We're going through this same transition right now in that we're starting to run some of our production workloads on Kubernetes for the first time and it's a lot of knowledge for people to keep in their heads of how does Kubernetes work and how does the old virtual machine based workflow work.

We wouldn't want to do that in the long term, but in the short term, we're in this in between state.

That sounds like you're saying, "Don't stay in the in between state forever."

Vincent: Yeah.

If you're not using Kubernetes for your production system, I wouldn't recommend to use Kubernetes for the build system because it's going to be a lot of complexity for something which is critical, but which is not your production system, unless you are a big company and you have lots of energy.

If you're in that case, you're already on Kubernetes anyway.

Liz: Tell us a little bit about the work you did to integrate OpenTelemetry and Kubernetes, because some of that sounds like it was Jenkins specific, but some of it was more generic.

How did you come to the decision to instrument Kubernetes rather than instrument in Jenkins X code itself?

Vincent: Yes. The challenge? The idea was to be able to get distributed traces.

Well, it was not distributed at the beginning, but traces for pipeline, so just a visual representation of the different steps.

That was the beginning. And then, when talking about it, we said, "We should be able to get tracing for everything that's happening underneath, so all the components."

Jenkins X is a continuous delivery platform, it's open source and so on based on Kubernetes, and it's using Tekton and those open source component, which is used to execute the pipeline.

What we wanted was to be able to see what Tekton was doing, what was Kubernetes doing, pulling pod, scheduling pods, scheduling a new node and so on.

We could have instrumented all the component in Jenkins X, all the dependencies of Jenkins X like Tekton and so on and Kubernetes, so that would have taken a long time. Still, you still have a few components like the Cluster Autoscaler you will need to instrument too and so on. It's like a lot of energy. When you have a platform like Kubernetes where the main benefit of Kubernetes is the API, you're starting to think maybe I can use the API.

What we did was using the API to retrieve all the different custom resources we wanted like Jenkins X pipeline, Tekton pipeline, and task run and so on, Kubernetes pod and the Kubernetes events.

We'll receive all the information about everything, and with that, we've been able to build the tree from the pipeline to everything that's related to that pipeline.

The pod which is linked to the task which is linked to the pipeline and so on, all the events related to that pod, and generate trace using OpenTelemetry so that it can be pushed to whatever backend you prefer.

In the case of Jenkins X for example, we are shipping with the Grafana stack by default because it's open source too so it's easier, but if people are already using a different backend or whatever they can, they just have to switch a configuration flag and it can be pushed to somewhere else.

Liz: Yeah. It sounds definitely like essentially, if you instrument the app, you're not going to have visibility into the control operations that Kubernetes is doing.

Vincent: Yes.

Liz: So therefore, you may as well just instrument the control operations to begin with so you don't have missing time where it's just spinning, waiting for the pod to spin up.

You'd rather get the data about the pod spin-up before the application code and the pod starts running.

Vincent: Yeah. It's not 100% perfect, but it's something that you can do in a few hours and you get quick results because Kubernetes API is really awesome.

It's very easy to do, and you get huge benefits from it. That's exactly what we wanted.

Liz: Yeah.

It's one of those promising things about automatic instrumentation, is that the more ubiquitous it becomes, the more likely people are going to be to then start digging in and adding more instrumentation later.

Vincent: Yes, exactly.

Liz: What was your experience with the Grafana ecosystem forgetting the baseline levels of observability?

Vincent: I really like it.

I really like the fact that you get a full platform where you can get everything in it, your logs, your metrics and your trace, and you get correlation between them.

It can be Grafana, but other people are doing it too.

Elastic is doing it too with Kibana and other vendors.

So you have a lot of ways to do it, but I think the big benefit is correlation, and you can build a dashboard where you have your logs, you can filter on a specific time range, and you can see the log errors, you can see the graph with your metric, you can jump from that to the trace.

Liz: Right. The friction of switching tools is just so high, right?

It is definitely a lot better to see all of those things in the same place and named consistently.

Vincent: Yes.

When you get benefit from both OpenTelemetry and being able to do correlation by inserting, for example, some tags or labels between the logs and the trace and the metrics, your backend, be it Grafana or whatever, is able to reuse that to build all the visualization part for it.

It's great. Well, it's great. It's so much better than what we had a few years before. For the moment, I think it's great.

Maybe in a few years, we're going to say, "yeah. Well, it was nice."

Liz: Right. Exactly. From what we were talking about before the podcast, you were using an APM vendor before that.

Vincent: Yeah. Both of them at the same time for different use case.

Charity: Well, switching gears a little bit, you recently switched jobs. What was it like switching companies mid pandemic?

Vincent: Well, it was more at the end of the pandemic.

Charity: Oh, is it over where you live?

Vincent: No, not really over. No. It's not over yet. Well, it was everything remote.

Everything was already remote. It was an old interview and everything remote, but I think now we are used to it.

It's great to be able to see new people, new way of working. It brings some new challenge.

Charity: What do you love about the engineering culture of your new job so far?

Vincent: It's so big and there are so much energy and lots of different project and lots of people with a lot of different background.

I think it's something that's good.

It's to be able to see new people with different background, different experience, and being able to compare that and to change yourself.

I'm used to do that that way because for the past four years I've been doing it that way, and other people are doing it differently.

It can challenge you.

Liz: I love that particular Charity blog post where she talks about the idea of if you stay in one place too long and become prematurely senior there, you get into a rut just by being the person who's been there the longest rather than challenging yourself and learning new things.

Vincent: Yes, exactly.

Charity: Well, thank you so much for being with us, Vincent.

I feel like I learned a lot about Otel today.

Vincent: Thank you for having me.

Liz: And about Jenkins. Don't forget Jenkins.

Charity: And Jenkins.

Vincent: Jenkins X.

Charity: You know what? Honestly, I have blocked out most of my memories of Jenkins.

I remember the little Java console and it's all dark after that. It's just erased.

Vincent: Just one little detail. It's not Jenkins. It's Jenkins X.

It's a completely different platform. It's a project. It's totally written in Go.

Charity: It's cool now. Is that what you're saying? It's cool now?

Vincent: Yeah. Totally new code base written in Go.

Charity: It's cool now.

Liz: Awesome. Thank you very much. It was a pleasure having you.

Vincent: Thank you.