Ep. #46, Guiding Observability Teams with James Deville of Procore Technologies
In episode 46 of o11ycast, Liz and Charity are joined by James Deville of Procore Technologies. James shares the insights he’s gained guiding o11y teams and the difficulties tech-forward teams face in non-tech sectors.
James Deville is a principal software engineer at Procore Technologies. He was previously Principal Lead Developer at Malwarebytes.
In episode 46 of o11ycast, Liz and Charity are joined by James Deville of Procore Technologies. James shares the insights he’s gained guiding o11y teams and the difficulties tech-forward teams face in non-tech sectors.
transcript
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Jim Deville: So, the thing that struck me the most when I joined Procore was that, it is currently over 2000 people.
When I joined it was around 1500 people.
It was roughly the same size as GitHub, but so much of the company was dedicated to non-engineering focus, because of the fact that we are a sales heavy organization, and we have a significant presence in education.
And we have an entire arm of our company that's dedicated to that education, to reaching out to construction industry, to try to introduce them both to the benefits of our stack, but also just to moving forward as a technology with technology, actually not just as a technology, anything.
And so, it struck me at one of our orientations.
We have a week long orientation and the majority of that orientation is not just about the company, and HR, and things you'd normally expect from a normal tech company orientation.
A large part of it was, "Hey, here's how the construction industry works and here's how we fit into that."
Liz Fong-Jones: Right. How to have empathy with your users.
Jim: Yeah.
Charity Majors: Yeah. So, we're talking about, what it's like to be a very tech forward team in a non-tech sector.
And, it strikes me that this all comes back to mission, right?
In the tech industry we often get so like, "Our mission is to do technology."
And, we can almost forget sometimes that the only point of technology is to enable other people to have other missions that are much more people facing.
Jim: Yeah. And our mission overall, it is about connecting people across the globe.
It's very people focused. But again, in this industry that is very technical phobic and has been for a long time. So, it's different.
Charity: So, it's not turtles all the way down is what we're saying.
Jim: Yeah.
Charity: It's not necessarily turtles. There are other industries in there too.
This feels like a really good type for you to introduce yourself.
Jim: Yeah. I'm Jim Deville, I'm a principal software engineer at Procore, currently the tech lead on the observability team.
Charity: Woo hoo.
Jim: And I've been at Procore since 2019.
Charity: Nice.
Liz: So, we've had these discussions on O11ycast before about observability teams.
What's your definition of an observability team? Right?
What should an observability team be doing itself, versus working with other people to do?
Jim: I think one of the thing that defines it to me is, an observability team isn't just using the tools, or looking aside and saying, "Yeah, we have some tools over here that help."
It's about guiding the company towards a observability vision.
I think one of the things that really comes into play there is that nuanced distinction between monitoring and observability, it's captured best with the whole idea of known unknowns, versus unknown unknowns. And, being able to really understand your system from the outside. But it's really easy to get caught in that trap of traditional monitoring, speak, and mindset.
One of my team's major focus is, is to shift the culture of Procore to start thinking about, that whole observability space in a new way that isn't just falling into the old mindset, but it's actually using these tools to the best of their abilities and gaining the benefits from them.
And I think that's what makes a good team is to drive that conversation forward at a company.
Charity: This is so fascinating. Be more specific, what are some of the questions or the problems that you were falling into that you weren't able to deal with your old tool set, and what really impelled you to start driving the conversation forward?
Jim: It's interesting, because I almost came in a backwards fashion to this.
I started on a team that had a very massive dashboard, and we'd just check it almost on a daily basis, and it would help us identify what a sense of normal looks like.
But then we'd go into an instant and we'd still fall into the same pattern of, "Well, that dashboard shows us there's a problem, but doesn't really help us dig into what that problem is."
And sometimes, it ended up meaning that I'm jumping between my metric dashboards, my logging system, and a tool like Bugsnag, trying to find a period of time that I'm seeing elevated errors in my metrics.
And then, logs and stack traces from those other two tools, which are completely distinct tools at this point.
Charity: Right.
Jim: Trying to make them match up, so I can identify what's going on.
Other times it meant I'm looking at a dashboard, I have no other signals, and I'm doing the Cowboy coding of, "Let's go add some metrics here where we think the problem is, deploy it on the fly during the middle of an instant, and hope that gives us the save information we care about."
Charity: Right.
Liz: It's so much of a pain when you have to deploy new code, and if you don't get it right, then you're just introducing all this churn, and all this churn, and all this churn.
And what's even worse, it's like the Heisenberg uncertainty principle, right?
You perturb the system and it stops doing the bad behavior.
Jim: Definitely. And that's one of the things that's driven me forward is, we're growing, we're trying to move more towards a distributed of systems.
And, when I read about observability, when I've learned from other people, from charity, from yourself, from Ben, the places I see the most value is, we're going to get into those places where you can't redeploy to find out a problem because by redeploying you maybe lose the problem, or just exacerbate the problem.
Liz: Yeah.
So, you mentioned having done these investigations yourself, to what extent are other people at your company practicing production, ownership, and owning their own code?
Or does it fall onto the platform team every single time?
Jim: One of our values is ownership. And so, it's not just me that's doing this.
I'm just speaking from my personal experience of what made me see these in that light.
There are a lot of teams that do their own production investigations.
We have traditionally fallen towards when it's an infrastructure problem, you call in the SRE team and they own that.
And we're trying to shift that mentality more towards a DevOps, you build it, you own it, you run it mentality.
Charity: Right.
Jim: But again, to do that, we need better visibility.
Charity: Yeah. So, have you had to switch your attitudes towards instrumentation's part of this?
Jim: We're actively working that, to be honest.
We are identifying places where we have high noise because nobody's--
For example, I mentioned Bugsnag, we have a lot of Bugsnags firing off, and many of them get ignored because they're becoming expected.
And so, we're working towards driving better instrumentation practices, both with internal libraries to try to be opinionated, and also educational efforts to be like, "Hey, let's not log a user failing to log in as a Bugsnag.
That should just be, maybe, a metric or just a general contextual signal, so that you can identify aggregate trends there, but not fire off a Bugsnag that you expect.
Liz: Yeah. That feels almost one of the challenges of, if you have the wrong data abstraction, but it's really easy to write data into that abstraction.
People are going to do it.
Jim: Yeah.
Liz: And it's going to generate garbage, right? Garbage in, garbage out.
Jim: Absolutely. Yeah. And we want to try to make sure we get a better abstraction there.
It actually reminds me of, I've been recently talking a lot to my team about dashboards, and about how for certain areas of monitoring and observability they are useful, and for certain areas they can be--
Well, I think Charity put it best, perfidy.
Liz: Yep. Dashboards are technical dead.
Jim: Yeah. And, I liked one of the points she made in there, where, when you're starting from pure nothingness, a dashboard's a great signal, it feels like a great improvement.
And I just want to make sure we don't get caught in that rut of like, "Oh yeah, yeah, yeah. We got an improvement. That's better."
Well, we can still go better.
Liz: Right. Local optima versus global optima.
Jim: Absolutely.
Liz: Where these things that worked in a monolith don't necessarily work as you go to a distributed service.
Jim: Right.
Charity: Yeah. And, if you come to your dashboard with a belief or with an assumption and you get it confirmed, it's too easy to go, "Aha!"
Jim: Yes.
Charity: That's what's happening now.
When in fact, you might be observing a symptom, you might be observing one of many symptoms, you might be observing an effect, you might be--
All you know is that a graph went like this. You don't actually know.
What humans do is we bring meaning to things, right? And, for good or for bad, right?
The machine can tell you if the data's going up or down, but only humans can say, "Ah, this meant to happen."
Jim: Right.
Charity: So, I wanted to go into the people piece of this and the team aspect, which is, you were describing earlier how you were running into these barriers with your previous monitoring tooling.
And across probably many teams at your company, what caused you to get funding to create an observability team to tackle the problem across the entire organization?
What was that process like?
Jim: It's been almost organic. It was organic until it wasn't even, to be honest.
We had some internal conversations, I had them with my director, and we'd talk about tools like Honeycomb.
And, moving away from just monitoring towards this idea of better observability, something more.
And then, we started to bring up new teams and combine the focuses of that team.
They'd own these other things and observability.
And then, we had some new leadership come in as we grew, and they had experience running organizations at the scale we're trying to go towards.
And, they straight up said, "We'd like to have a team that focuses on this."
Charity: So, what's the value of your team?
Do you write code that interfaces between your developers and your third party vendors?
Do you define standards? Are you on call?
What is the mission of your team and how is that different from the other teams nearby you?
Jim: So, the mission of my team is both the educational-- It's broad, unfortunately.
And, we're trying to manage that and bring in partner teams to help us share that debt.
We're both owning the technical debt of the existing tooling.
We are also working towards building better systems.
We're trying to build a pipeline that will help us manage our various collection agents, and try to streamline, and make our data more consistent.
And that we're all also pursuing a developer side, which is building libraries that encapsulate collection techniques, opinions, configuration, in order to try to have a consistent story for the developers outside of our platform or organization.
Charity: Right. You can't just go in and say, "We're shutting off all of your own tools."
Right? You have to offer a migration path.
Jim: Right.
And that's one of the things we're looking too, with a pipeline, is being able to run a pipeline that can continue to send data to our old tools, while we then explore newer tools and explore changes to those tools.
Charity: You must be using OpenTelemetry.
Jim: We're looking at OpenTelemetry and Vector right now, OpenTelemetry's our favorite in this.
And, we just want to do our due diligence, is why we're actually looking at Vector at this point.
Charity: Yeah.
Liz: Yeah, that's fair.
Especially, it's one of those interesting things where they started off as a vendor neutral project, and then they got acquired by Datadog.
And, now there are some questions about their roadmap.
Jim: Yeah.
Liz: Another one that I think is really interesting in that space is Cribble, because Cribble seems to be pretty committed to doing the vendor neutral route.
Jim: Yeah. I've seen it come up a couple times. I haven't had enough time to really dig into it, but it's a very interesting technique.
Charity: The cool thing that Cribble does is...
Well, as you know, observability really rests on these arbitrarily wide structured data blobs, that are the source of observability tooling.
Well, Cribble is great at taking all of the unstructured logs, all the messy logs, all the Splunk stuff, the stuff you would just fire off into logs and forget about.
And then, reassembling those into arbitrarily wide structured data blocks you can then feed into an observability tool.
Which is pretty sweet, right? It's pretty dope. Cool.
Jim: Yeah. No, that's very sweet. It's one of the things we're trying to get to ourselves, and.
Charity: You want to refactor, you want to redo your instrumentation.
But in real life, you're only ever going to be able to redo so much.
Jim: Right.
Charity: And yet you still need to understand that your old systems that have got a lot of data coming out of them.
And so, it's a way to just reconstitute that stuff.
So that, the work of moving to observability, you can't just stop what you're doing and do it.
Jim: No.
Charity: I think of it more you put a headlamp on, and everywhere you go, everywhere you have to look, you add observability first, such that, over time you cover most of the territory.
And, you're certainly covering the territory that is most important to any given time.
Jim: I agree.
My blog post, The Better Monitoring and Observability at Procore, that was the idea there, that's supposed to be a long term vision that is-- We call it a North star internally.
And generally, this isn't just hand waving or anything like that.
The goal is, "Here's where we want to be three, five years down the road. We want this vision to be our truth."
So then, as we're looking around with that headlamp, as you say, that document helps us guide towards, "Well, what do we want to do here to move us in the right direction, and help define one way over another."
Liz: Yep. You don't snap your fingers and get there overnight.
Maybe you can if you were working greenfield, but certainly not in a brownfield environment.
I think another important piece of that is understanding the history.
Jim: Yes.
Liz: How did we get to where we are today? Right? Why did people pick these tools?
And, how can we show them that there's a better way?
So, what are the strategies that you found there, in terms of, indexing to understand what's out there, and how people are using this stuff?
Jim: So, our journey was very unguided. And so, we've had a lot of leeway to that.
We have a lot of debt, that's one of our areas of debt, is understanding how teams are using these tools.
And then, going that next step and starting to try to encourage them to start using tools in certain ways to help shift the conversation if you will.
But it's been hard honestly.
And, what I continue to foresee as one of our hardest journeys, is understanding, changing minds, changing that mindset at Procore.
I think, that's the case for any company though.
Liz: Can you give some examples of teams that you've worked with and challenges that you've seen and helped them through?
Jim: Yeah. The biggest I think is, we have teams who will come to us just asking directly, they want to just start dropping a certain metric, for example.
And, we have to have a conversation with them to have them step back and talk to us, "Well, what metric are you trying to get? Why?"
And, often what we're seeing is that if we engage them in a conversation about better SLOs and SLIs to begin with, to lay a foundation, that often eliminates the need for a random metric, or we're trying to encourage...
And this is a newer initiative to attach those metrics in the context of a trace, of something larger that gives you that data point in context.
Liz: That context is so important, because it gives you actual correlation, not just temporal correlation, but actual correlation to a real event that happened in the system.
Jim: Yeah.
Liz: I also love the thing that you said about getting people to stop treating you like a service provider. Right?
I've seen so many teams fall into the trap of, "We run the ELK stack and we do whatever people ask us to including handling absurd volumes that they shouldn't be sending in the first place."
Jim: Right. Yeah, we take it very seriously that one of our main missions is to run these stacks, manage these stacks, and educate the company on how to best use them.
That means pushing back and saying, "No,"quite often. Even with SLOs, we'll have teams come and say, "We just want this."
And it's like, "Well, no, that's not really answering that core question of what does it mean to be reliable," for example.
Charity: Right. Have you ever succeeded in actually deprecating and getting rid of a tool, for as long as you've been there?
Jim: So, my team's been fully staffed since May.
And so, the answer to that directly is no.
Charity: Yeah.
Jim: However, we are on the journey to doing that right now with one of our smaller tools.
Charity: Congratulations.
Jim: Our major tools are going to be a bigger headache, of course.
Charity: Well, so I hear that you just got promoted actually to principal engineer.
Jim: Yeah, I did.
Charity: Congratulations. Can you talk about-- I assume it was on the strength of some of this work that you were put up for that promotion?
Jim: Yeah. A large portion of selling me as a principal engineer was on my work of basically building the observability team.
The idea for it came from one of our leaders, as I've said.
But, the vision that we've had, which is covered largely in that blog post, and in our other internal visions has come from where I've been trying to encourage the company to go.
And, the team's been grown by me and it's gotten teams.
Charity: We'll put the blog post in the notes. And that's fantastic.
But, that's so refreshing, because I think that observability, operations--
Operations, in general, tends to be the less flashy parts of the company.
And especially, for a company that isn't about infrastructure.
You're about a customer facing thing.
And I think it's so great to see that these skills, which are so crucial can be seen, can be rewarded, can be part of the developer promotion path, and that there's recognition that is due equal to the impact that you have on the company, which I believe is probably immense.
Jim: Yeah. I was really happy to see that as well.
It's something I've noticed in other teams, in other companies in the past, where the infrastructure team is taken for granted sometimes, or they have to work harder to get the same recognition.
I think, I got set in a good path in my last two companies.
Liz: The other really interesting pattern there, and that I think is really neat is getting rewarded for creating a roadmap to turning off tools. Right?
Because so often, people get rewarded for, "I built this shiny new thing."
Jim: Right.
Liz: Right? But the thing, we don't need more things to run. Right?
We need fewer. And I think that's a very positive thing.
Charity: One less software.
Jim: Absolutely agree.
I've been really thinking hard about how I want to take these next steps as we really start to introduce, like I said, that distinctive idea of real observability into the company.
And, at one point I did have this idea of, "Oh, we need to cover all the use cases that our existing tool covers."
Which is a really law ask. And more and more lately, I'm like, "No, we actually don't necessarily need to."
And we can shift people to thinking about it, just saying, "I don't necessarily need all this. I can answer those questions just by having really good structured events, having the links between them, having the ability to do queries of aggregate of SLIs, SLOs over time, all these faceted forms of looking into systems. I really believe we can get away with less."
And that's something I'm really wanting to push.
Liz: Right. The data signals are not Pokemon. You don't have to collect them all, right?
Jim: Right.
Liz: It's instead about the quality of the signals that you have, rather than trying to lather your data everywhere.
Jim: Yeah.
Liz: So, I heard you talking about service level objectives and service level indicators.
How much momentum for that was there at your company even before your observability effort, has that been a core part of your story?
Jim: Not really. Before the team really formed, it was talked about by a few people on one of my previous teams, as we were standing up a new service, my director really leaned in on it, and encouraged us to define SLOs, SLIs.
But at this point, we are making it one of our CORE foundational pieces that, if you're bringing up a new service, we being the platform organization, we want to see SLOs and SLIs in place.
And in addition, we're trying to have that discussion about what does it mean to define them in a way that doesn't just measure random numbers, but really answers the question, what does it mean to be reliable to your user, and identifying who that user is?
Charity: Resiliency is not about making sure things don't break.
Jim: Right.
Charity: It's about making sure lots of things can break without impacting users.
Jim: Yeah. And then, we want to use that as this foundation that we can then build the rest upon.
We have to have this in place, we feel like, in order to make a bigger observability story work.
Liz: And also, I think that does a good job of addressing people's instincts to page on more things. Right?
More monitoring good, more paging good, right?
There's a certain point where there's alert fatigue. It burns you out.
Jim: Oh gosh, yes. And then, you have people not paying attention to it.
And I've seen that. Like I said, I started on a team that had a massive dashboard.
And, we'd review it all the time, but I also know we didn't review every chart on that dashboard, and we ignored a whole lot of charts in that dashboard.
It's the same idea. It's not alert fatigue, it's monitored fatigue or metric fatigue, but it's the same general idea.
Liz: So, that's really interesting to see that joining together of, "We should care about observability. We should care about SLOs."
It feels like those two pieces-- I've been pushing on SLOs for six, seven years at this point. Right?
But I didn't really get the traction behind it. Right?
Even with all of Google's backing until people understood how to debug their SLOs in addition to measuring their SLOs, right?
The SLO is meaningless if people just, again, treat it like another dashboard.
Jim: Right. And, yeah, that's one conversation we're having actively is the idea of, we're getting teams to think about SLOs, but that has to be a living process.
And that's something that is, I think often missed.
I know in my previous experience with SLOs before this team and this recent push on it, it was missed a lot, that it was, you define SLOs and then, "Yay, they're set, move on."
But no, revisiting them regularly to make sure they're still serving their purpose, and answering the right questions, and revising them.
And then you said, being able to really go from an SLO, an SLI indicator, digging into what's causing the problem there, I think is a really important detail that's also often missed.
Liz: So, you mentioned earlier also the idea of a platform team, right?
We're definitely starting to see this as a trend, right?
Where you don't have disconnected SRE teams, or you don't just sprinkle SRE's across every team.
What is a platform team means your organization?
How did that arise at your company?
Jim: I think my organization isn't necessarily special in that way, and it's, as you grow to a certain scale, you start to realize that you can't expect everyone to be operating at the metal level of either servers, or Kubernetes, or AWS.
You need to give them an abstraction so that they can start to think about their business logic, but also deliver new services as seamlessly as possible.
And so, for us, it's about just building a set of tools that help abstract that and make these systems more and more self-service.
Again, without requiring the entire company to become an expert on Kubernetes, or AWS, or pick your other orchestration platform.
Liz: Right. People at the end of the day, want to write code that moves the business forward.
And your job is to help get all the other concerns out of their way.
Jim: Yeah. And even goes back to, like I said, we're talking about building libraries.
The way I think about that library is, I want to make it so that as a team, you can install the library.
You have a simple unintrusive standard way to configure it.
And then, all you need to think about is I want to collect this metric, this trace, this log entry.
You're not needing necessarily to even think anymore about, "I want to get something into Datadog, or New Relic, or Splunk, or whatever."
You're just thinking, "I need to collect this piece of information."
And then, we can go on from there to put it into the right tool and have good documentation to say, "Here's how to access your metric. Here's how to retrieve it."
And by doing that, we're allow them to focus, again, on the business value that they can provide as that team.
Instead of having to think about all the intricacies of observability tooling at the same time.
Charity: You're also making it easy for people to move around within the company, from team to team, because there's a consistency there, a shared language so that they don't have to relearn an entire new way of developing--
Which is a trap that I think a lot of companies get into, and that makes it less likely that people will move around, which means that it's less likely that you'll retain your best engineers.
You can only really work on a project for two or three years before you get bored, and itchy, and you want to do something else.
Well, companies should be interested in retaining those people by giving them opportunities to move around.
Because that also prevents these dark holes from getting created where the way everything's done is different, the conventions are different, and it's just bad for everyone, right?
Jim: Yeah.
Liz: Yeah. It's one of those things where I had the joy of working with a great developer platform, when I was working at Google for 11 years. Right?
I was on nine or 10 teams in 11 years at Google. Right?
I think what made that possible was the interchangeability, that you could jump into a new team and use the same exact debugging tools, use the same exact build tools.
Right? To understand the code base from day one. That's really powerful stuff.
Jim: Yeah.
Charity: Yeah.
Jim: That's something I don't think that I've been at a company that's quite gotten to that level, but that's something I would absolutely love.
And, I can relate that struggles between jumping across projects has often been a barrier to me staying at a company.
And so, I'll choose to go pursue something new, instead of choosing to switch and grow.
Liz: Absolutely. So, your company is mostly a rail shop, I think you said.
Jim: Yeah.
Liz: How much of that is standardized across teams versus is, "Yes, we share rails, but we're doing completely different things?"
How much have you been able to fix the libraries and be done, or is it like, you have to get every team to integrate your library?
Jim: No, our main rails app is a large monolith.
And so, we do have a level of consistency on how things are done across that system, where the challenge is going to be coming in is as we grow, we are looking to embrace more distributed system thinking, service oriented architectures, macroservices, microservices, et cetera.
And, maintaining that consistency as we expand is where I think that challenge comes into play for us.
Liz: How did you wind up with this tool sprawl and during teams using different tools, if the core app is a monolith?
Do people just wire in their favorite APM tool and it just picks up everything everyone else is doing?
Jim: A mix of that. And also, with that favorite APM tool, it also came down to choosing things for individual feature.
We have multiple tools that handle the so-called three pillars of observability, but we only use a slice of each of them.
And, sure they may excel in some of those slices better than the others, but there wasn't a cohesive strategy to say, "Hey, instead of bringing in another vendor to handle this other area, let's really see how far we it with what we already have."
Liz: Yeah. That can definitely be challenging because, yes, you do want to use the best tool for the job.
But on the other hand, there was a cognitive cost when you have to switch tools.
And I think you were describing that at the very beginning, right?
Jim: Yes. Absolutely.
Liz: Jumping between your logging tool, and a separate massive tool, and just--
Jim: Yeah. In my blog post, I speak to it, and I try to call that out particularly.
And I think I have a line in there about the idea of minimizing juggling of tools.
And, as I wrote that line... I originally wrote that as single pane. I wanted it to be a single pane.
I had somebody push back and point out, "Single pane, it's a beautiful concept. But let's be realistic. We probably aren't going to be able to get to a single pane for a long time."
And I thought about it, and I'm like, "Well, what am I really trying to get to with single pane? I'm really trying to get to avoiding that juggling and jumping around between tools."
So, yeah, I refer to it as minimizing juggling.
Liz: Right. It's the ease of use and the ability to dig in more than having this glossy thing that will "Show you everything you need."
But then, it's non-interactable. Right?
I think that's the trap that a lot of folks in our space fall into when they are designing these single pane experiences.
Jim: Yeah.
I definitely appreciate the tools that have really good integration points or ability to link out of themselves and support working even with competitors instead of tools that try to just tie you into everything.
Liz: Yeah.
It's been really fun, at least on the Honeycomb side, we've been doing some fun stuff with Grafana Labs recently, and it's been exciting to see the new ways that people find to integrate our product with the other different data sources that they have.
Jim: Yeah, I bet.
Liz: Cool. So, the last point that I wanted to talk about while we had you here was the idea of resilience engineering, and is it hype?
Is it not hype? How much of it does your company do and are they ready for it?
Jim: It's on a similar journey to observability, I'd say.
We are aware of the fact that we need to do better at it.
We have some great practitioners that are really interested in that space at the company that talk about it, that share articles, that try to draw up the idea of it, including myself.
And one of the things I'm trying to do currently is, we have this observability team that I'm working with.
We have related efforts in the resilience engineering space, and I'm trying to see about, how we can make them work more closely together to drive forward that idea as a whole. Because I see observability as a supporting piece of resilience engineering. You can't have really good resilience engineering if you don't know what's going on in your system.
And so, that's where I see us slaying is, this is a foundational data source, how we handle incidents is going to be tied into this.
If we ever get to the point of doing chaos engineering, that would also be another point.
But again, chaos engineering isn't just unplugging something as seeing who screams.
The things that I find fast thing about reading some of Netflix's early documents on their Chaos Monkey project is, when they just bumped up latency between systems and saw cascading failures just from a second of increased latency.
But to be able to see those cascading failures, you actually really need really good observability that shows you the system map and these errors flowing between systems.
Liz: As it's said, "Chaos engineering without the engineering piece is just chaos."
Jim: Right.
Liz: So, yeah, I personally find that people really need to get their fundamentals of observability down, because if you can't even understand the chaos you have in your system, you have no business injecting more chaos.
Jim: Oh gosh, yeah. Don't worry, your users will do that enough for you.
Liz: You mentioned that you have a monolith, is it a multi-tenant monolith? Is it operated as SaaS?
What are some of the trickier things to debug with cardinality, that you found in your architecture so far?
Jim: Two of the biggest areas actually are related to user counts, because it is multi-tenant.
So, working with large numbers of unique tags that are to users, or companies, or projects, becomes an issue.
The other thing is, at our scale we run a decent chunk of infrastructure.
And so, things like host identification, container identification is another high cardinality area that has been biting us sometimes.
Liz: Oh yeah. Because a lot of folks charge you by the host, because they know that the host is a unit of cardinality for them and will impact their cost metrics.
Jim: Yeah.
Both that, and then also just, because we have so many hosts, we've run into situations where we don't have the right metadata attached to our metrics, it's hard to narrow down to the area that is actually seeing a problem, because we don't have tools that handle high cardinality well, in some cases.
Liz: Yeah. So, what does the journey look like for you in terms of starting to introduce people to tracing, starting to introduce people to query and traces?
Do you think that, that's going to be a very soon think?
Or do you think that, that's something where people just need help with the very basics to begin with before they even start down the tracing path?
Jim: I think it's a very soon thing.
It's something that my team's talking about actively, trying to get started on in the next few quarters.
And I have other teams expressing interest that they want to explore these ideas more.
Because I think that that can be a really good proxy for the distinction between just monitoring and observability.
I'm hoping to drive those conversations forward together to use better tracing examples, to show the value of something more than just our current monitoring story.
Liz: Yeah. And I think the other thing that I've seen recently is that when you add instrumentation, that very act is adding comments or adding tests, right?
It inherently is value, it's not busy work.
It helps you get a better understanding of your system.
Jim: Yeah.
Liz: Even before you run those first queries.
Jim: Absolutely. And that's the other part of it as well as our journey, like I said, we're building a pipeline right now.
Our next journey of that is to take a look at our tool suite and really challenge ourselves on, "Do we need these tools? Do these tools serve us well? Are there other tools that we should be migrating to?"
And one of the things we're realizing is if we start now, even though we're still focused on this pipeline side of it, if we start working on tracing and instrumenting our code better, especially if we do that in the context of a library where we can swap out the back end, then that's beneficial work no matter where we land on in terms of vendors in the future.
Liz: Yeah.
Jim: Because the instrumentation itself is valuable.
Liz: Yeah. Time to first value, right?
DevOps tells us to shift value left, right?
So, the sooner you can do this experiment without waiting for the whole pipeline.
Jim: Right.
Liz: The quicker you'll be able to validate what you're doing.
Jim: Yeah.
Liz: Well, thank you very much for joining us today, Jim. It was an absolute pleasure.
Jim: Yeah.
Charity: Yeah. Thanks.
Jim: Thanks for having me.
Subscribe to Heavybit Updates
You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.
Content from the Library
Generationship Ep. #19, Auditability Matters with Stefan Krawczyk of DAGWorks
In episode 19 of Generationship, Rachel Chalmers is joined by Stefan Krawczyk, co-founder and CEO of DAGWorks. They dive into...
Jamstack Radio Ep. #145, The Future of Collaborative Docs with Cara Marin of Stashpad
In episode 145 of Jamstack Radio, Brian speaks with Cara Marin of Stashpad about collaborative documents. Together they explore...
Gridding: A Prioritizing and Planning Blueprint
In this special Speaker Series, product expert Craig Kerstiens explores startup communication, planning, & prioritization via...