Ep. #63, Observability in the Database with Lukas Fittl of pganalyze
In episode 63 of o11ycast, Charity and Jess speak with Lukas Fittl of pganalyze about database observability. This talk explores relational database management systems (DBMS), the open source library Sqlcommenter, query plans, and insights on supporting application teams.
Lukas Fittl is the Founder of pganalyze. He’s an engineer by trade, an advocate for Postgres, and was previously Principal Program Manager at Microsoft.
In episode 63 of o11ycast, Charity and Jess speak with Lukas Fittl of pganalyze about database observability. This talk explores relational database management systems (DBMS), the open source library Sqlcommenter, query plans, and insights on supporting application teams.
transcript
Lukas Fittl: Yeah. I think that's a very typical, frustrating developer situation. As you said, typically I'm just thinking about my app, I have all my activity in my app, I have my functions, I have my API calls. But then where the application monitoring usually stops is at the point where you have, for example, a SQL query that goes to a database.
Unfortunately, a lot of times then at that point application developers are like, "Well, I don't know what's going on." Throw our hands up in the air and pass it onto, for example, an infrastructure team, a DevOps person or a DBA if they have it. I think that separation is ultimately not healthy.
I think that compartmentalization that anything that's a database call gets passed to somebody else is not helpful for us to actually have a complete understanding of application performance, ultimately. So the reason that I personally care about OpenTelemetry and observability in general from an application perspective is because if you think about this... Lets think of the slow experience, ultimately.
What we're trying to improve is that somebody sitting there, waiting for your webpage to load, and we want to know what's going on behind the scenes. Why is it actually slow? Is it slow because you write inefficient code? Is it slow because the database chose a bad query plan? Is it slow because somebody forgot to add an index? All of these are possibilities but if we ignore the database we only have half the picture.
Charity Majors: I think you're right, and this is something that's always bugged me too, is the way that there's this... I can't tell if developers have been trained into this feeling of helplessness or if they've just been shocked too many times, but the database is scary. It's like with your code, your more or less in control, but with data you have the code that you're running and you don't know what you're running it on necessarily.
That silhouette star could return instantly or it could take a week, based on the data that it's scanning and the order that it was inserted and whomever built an index. Sometimes you wake up overnight and maybe there was a security thing going on over night, and suddenly the data is no longer like you remember it being and your entire performance profile changed.
But like you, I think that that separation is not healthy because in the same way that your code is nothing without production and you have to care about production, your code is nothing without the data either. These are integrated. There's room for specialization, of course.
But you can't hope to understand and debug your app without a reasonable understanding of the trips and traps of data and how it affects your performance, and how you need to modify your code in order to handle it. This feels like a good time for you to maybe introduce yourself.
Lukas: Sure. I am Lukas Fittl. I am the founder of pganalyze. I am engineer by trade, I started using PostgreSQL 15 years ago. PostgreSQL is my main database of choice. I run a small company called pganalyze and we offer performance optimization, performance monitoring and observability for PostgreSQL. We're happy to be here and talk about observability and how you can integrate application perspective as well as the database perspective together.
Jessica Kerr: So what does it look like? What does it look like when you can see out of the application and into the database in your same trace?
Lukas: So I think the hard part is that even with pganalyze today, I'll be honest with you, that solution is kind of still not as nice as I would like it to be. This is not me pushing our product, actually this is me pushing Google's product. Google actually tried something here which I think is interesting. It's unfortunately not a great user experience but it is, I think, interesting.
What they did with CloudSQL is they actually offered an integrated tracing perspective with Stackdriver and CloudSQL. They actually have a way to combine those and link those traces together. What I'd love to to a little more about also is they actually created a project to support that. They created a project called Sqlcommenter and Sqlcommenter is now part of OpenTelemetry.
But the idea is you essentially annotate queries in a way that you can look at it both at the database side, you can see what's the query plan for a particular trace. There's still that step though where it's not really seamless.
If you think of a trace and with the different spans, the way it still works is that you're making it jump from the application perspective to the database perspective. So it's more about linking things together, versus necessarily-
Jessica: Oh okay. So it's a link between traces?
Lukas: That's right. It's a link between traces, and more specifically it's a link between the execution plan of the database. What they implemented and what we also have in pganalyze is essentially a way to say, "Here is an execution plan for that particular part of a trace." So if you see that query with that span or with that trace ID you're actually able to say what's the execution plan on the database side for that flow experience, essentially.
Jessica: Okay. So currently you still have to go to another tool?
Lukas: That's right. That's right.
Jessica: But you're able to get to the right place in that tool from your trace?
Lukas: That's correct, yes. Exactly. The big value here is in being able to drill down, the part where I think today most observability tools if you look at it from the application side, the SQL is where it stops. You know the text of the SQL query but you don't know anything else. You know how long it took from the application side, including network round trip, but you don't know where it used an index, whether it did something, what it did on the database side. That's really where I think this combination is so important.
Charity: Yeah. The pattern of where anytime you're jumping between application and database, you might get a completely different answer about what's happening if you're inspecting it from the database logs than you would get if you were inspecting it from the side of the application. The application might go, "That took 11 seconds."
And the database might be like, "Well, we ran it in .3 seconds. What the hell?" So yeah, I think that this has always been an interesting gap. I remember at Parse with MongoDB we were very much very cutting edge, MongoDBzers and we did some janky-ass shit to get what we needed out of MongoDB. We turned on full query logging for all queries for all MongoDB clusters and then we did some intelligent sampling which would dump out things like number of row scans and indexes used and all this stuff.
Then we would dump that into Scuba which is like a precursor of Honeycomb. It wasn't connected to the application. But that was the only way that you could go in and see what query plan is it using and why does it jump around and use different query plans depending on which chart it picks, and giving it nudges and all these things. Yeah. So you're saying that this basically integrates that so that in one trace you can see both sides of that interaction?
Lukas: That's ultimately the idea, exactly. The part where it's challenging and I think part of the reason why we haven't really seen a fully integrated solution yet where... Let's imagine a Honeycomb that has the database side just integrated and shown directly in Honeycomb, for example, or let's say DataDog or NewRelic.
They all don't do this integrated view, and the reason that doesn't really exist today is because on the database side you usually don't have exact measurements. That level of analysis doesn't really work the same way, so what we usually recommend people do is to enable query logging, as you mentioned. You could of course turn everything on, that's not good for performance reasons.
Jessica: That's the thing. That's the problem. You can't do that for performance reasons. You can't capture that level of telemetry without basically TCP flooding your entire node.
Lukas: Exactly. So what we usually recommend people do in PostgreSQL which we primarily work with, there's an extension called Auto Explain and that essentially logs execution plans but only if it matches the sample rate. So you can say, "Do 1% of all queries, log that to the log." Or do something like, "Everything that's lower than a certain threshold."
And so our recommendation is essentially do that for your database so you have some way of logging the slow statements, it's usually what we recommend with the execution plan. That way that is safe enough. Now, the problem is, where it gets expensive is the timing information. So if you think of an execution plan, oftentimes these execution planes have these plan nodes and if you want to imagine, "How do we represent an execution plan in a trace?"
Ideally each of these plan nodes would be part of the trace so it would be query, then it says, "Index scan took five milliseconds. Hash join took this much milliseconds." The problem is that that is actually expensive for a database to do. In PostgreSQL for example, we always recommend people to turn off that timing information because it's expensive to do these Get Time Of Day calls, essentially, very frequently. That's where it gets challenging.
Jessica: Okay. So specifically the What Time Is It is very expensive?
Lukas: That's right.
Charity: What Time Of Day is known to be very expensive.
Lukas: It's just like asking the operating system what time it is is expensive, and so there's various tricks to how you can make it less expensive but it's-
Charity: Yeah. Because for complicated reasons, but it's one of the slowest things a kernel can do because it has to be precise.
Jessica: Interesting. I want to get some definitions in here because we've been throwing around a bunch of database terms. Let's tell everybody what a query plan is.
Lukas: Sure. So when you tell the database "Select star from table", ultimately the job the database is doing for you is figuring out how to get that data. In most databases you call that part that's figuring it out the Planner, so the Planner or in PostgreSQL it's known as the Optimizer.
It essentially looks at that statement, the parse tree, and says, "You're looking at this table and you want these columns and you're filtering by this information. So the best way for me, as the database, to give you this data is to do an index scan." For example. The database then decides "Do an index scan on this particular index." Or if there's two tables involved, make a join between those two tables.
Jessica: You could see how this becomes really relevant when you've got really complex queries, right? When you're selecting from, say, a bunch of different call ins or doing joins or whatever, because you've got indexes all over the place, you've got all these different data types and using the parse tree.
There are ways that if you do certain things in certain orders it could be a lot cheaper, and I don't know if this is typical but I remember with MongoDB, the way it would actually... it would cache the query plans and the way it would do it is one out of every 10,000 times a query gets run, it would run all of the possible plans for that query and it would pick the shortest one, cache that and always use that plan until it refreshes.
The reason you have to refresh the cache is because sometimes the shape of the data will change and what was fast will no longer be the fastest way any more, when you've got compound indexes. Say you've got a data type that used to be a couple of integers and all of a sudden then it's a bunch of strings or something. Well, then your query plan needs to change in order to be efficient.
Lukas: The part where that behavior is generally good, you want the database to adapt to what you're doing with your data, but the part where I found that to be really problematic is when plans in PostgreSQL at lest, they're very particular to the values passed in. So if you search for something really frequent, it will often use a different index or different joining strategy than if you search for something really infrequent.
So the problem is when things become non predictable. Sometimes your queries are fast and sometimes they are slow because the database decides to use a certain plan that's actually not the best plan. That's where Tatino level of introspection becomes so important because you need to know what the database did and why ideally.
Charity: But I would kind of argue though that these should not be problems that your average developer is running into every day.
Jessica: But when you run into them it's really important to know, "Oh, I just did a full table scan."
Lukas: The missing index case in particular, so if there's really no index at all, that's I think a case that the regular developer will run into. But maybe the plan falling over, that's more of an edge case.
Charity: Yeah. But the planned stuff, it should be magical. The point is that the database should be able to pretty much do this the right way most of the time. I guess I'm just wondering, proportionally, at your normal SaaS startup, developers shouldn't need to spend much time thinking about this at all because it should mostly work most of the time.
Where is the threshold of complexity or size that you find that teams start to run into this stuff regularly enough that it's worth expending a significant amount of resources on capturing this data?
Lukas: What we've found is that if you've got larger companies where there's multiple application teams, usually what starts to emerge is some kind of central data platform team or data engineering team or whatever you want to call it.
Jessica: Or a DBA?
Lukas: Well, they don't really like calling themselves DBAs these days.
Jessica: Oh, I was going to ask why nobody has a DBA anymore.
Lukas: It's not cool anymore.
Charity: Administration is out of fashion.
Lukas: Exactly, yes. System administrators are gone, database administrators are gone.
Jessica: Okay. So what are they called now?
Lukas: Well, probably data platform engineers. Or sometimes application DBAs, just adding application in front of the DBA to make it seem like they're more on the application side, not just sitting in their castle, accepting index changes and stuff like that.
Charity: So you're saying basically around the time that a company gets large enough to have a dedicated data platform team?
Lukas: Yeah. And I think what we've seen a lot is these teams often struggle because ultimately they get passed all the problems, but then the ones creating the problems, so to say, is the people introducing new features. So the application team works on a new feature, releases it into production, and then they realize they forgot the index or maybe they didn't benchmark it will enough, then the database side falls over and suddenly the data platform team gets this surprising, "Oh my god. All hands on deck."
Jessica: And the whole point of platform teams is not supposed to be like, "Okay, pass us your problems." It's, "Okay, we're creating solutions so that you can self serve and understand your own problems." So this would mesh very well with that philosophy.
Lukas: That's right.
I think beyond just OpenTelemetry and observability, one of the other things that we've done in that space to support application teams better is to give them ways to get index recommendations. We're not the first to do this, SQL Server for example have had this for a very long time.
But one thing that we've found is that even just getting a starting point for saying, "Hey, you're doing all these queries and you don't have any supporting index for the where clause you're passing in there." Even that, telling that to an application team in a very forward way is a very big time saver because that data platform team enables the application engineers versus gets looped in all the time for all changes.
Jessica: So with the right tooling, can a small company postpone having a data platform engineer for longer?
Lukas: Probably, hopefully.
Jessica: Charity is nodding.
Lukas: I think what's challenging is I think that skillset is still valuable. Even if you have the right tools, you still want somebody thinking about these things. But maybe it's a backend engineer who has a tendency towards being interested in databases, and so that person becomes one fourth of a data platform engineer until you really have the need for a full team or a full person.
Jessica: Yeah. Because if you can see what's going on, you know when you need to take action.
Lukas: Right.
Charity: Yeah. I think the way I would characterize it is less maybe that you can put off needing to have a data platform, and more that once you need one you can leverage a much smaller number of people for a much longer time.
Instead of having to grow proportionally with your application engineering teams because the more people you have, the more problems you're getting sent, you could have a very small, limber team that knows which problems that they need to help developers solve for themselves. They're so much more scalable, right?
I love that we're in this post-zero interest rate world where we now have to think about engineering efficiency again. I think this is fantastic, because so many of our problems in the past we would just solve them by throwing bodies at them. We said that we weren't, but we still did, right?
And ultimately that leads into creating busy work for people, and not giving people that could actually solve the problems using automation which frees them up to solve better problems. I really love that Adam Jacob, one of my personal heroes, he was at Chef and they were going around, evangelizing automation to all of these companies that they've got system administrators still.
The anxiety was palpable because it's their jobs, it's how they put food on the table. Adam would always say to them, "Look. Everyone here, your jobs are safe. The point now is how do we make it so that the same number of people can do more and more and more, so that you can do more and more and more with your time. Instead of having to scale linearly with the number of customers or the size of your data centers."
This comes up over and over again, how do we scale people not linearly, but so that people can be more and more and more powerful and do more and more things with fewer of them?
Lukas: Yeah. I think we've seen that exact pattern in the database world, in the PostgreSQL world specifically where that data platform team or the single database administrator if they call themselves that, just is busy fixing slow queries. That's their day in-day out.
Charity: That's all they do.
Lukas: And engineer passes, says, "Hey, I need help on this." The next engineer says, "I need help on this." And they keep just looking at query plans, doing all this manual work.
Charity: And they're so busy doing that, and they need to hire more and more people to just fix queries all the time, and that's basically the job, right? Build indexes, fix queries, build indexes, fix queries.
Jessica: Or they're looking at the databases, at logs, and they see these slow queries and they have to chase down the developers and tell them to stop doing it that way.
Charity: But they're so busy, it's not like they have time to do other things, right? Because you always hire just enough. The whole point of this is to get ahead of that and to start building tools and free up time for these people, which is why you should really want your database teams to be using as many of the same tools, same languages, same patterns, same product development as your other engineers. The less of a silo it feels like, the better off everyone will be.
Lukas: Yeah. Fully agree.
Charity: What does OpenTelemetry have to do with databases these days?
Lukas: Well, unfortunately just a little bit these days. I think part of why I'm interested in talking about it is because I think it should have more to do with each other. I think where it's challenging is database people are not OpenTelemetry people usually. We don't talk enough with each other.
In the PostgreSQL world, how most people interact with OpenTelemetry is they would have PostgreSQL exporter that exports some basic high level metrics, and then that's how they send that information. These days that's OpenTelemetry, it used to be whatever custom format it used to be called.
But that's usually where OpenTelemetry ends. Really the other side of that coin is the Sqlcommenter story which it's really a niche project that Google contributed to OpenTelemetry . But I think it starts telling the interesting part of the story which as mentioned earlier is how do we get the trace information together?
Jessica: How does it get the trace information into the database tooling?
Lukas: So the way it works with Sqlcommenter is essentially an additional library that you add to your application. The same way that you would add auto instrumentation with OpenTelemetry, you would essentially add the Sqlcommenter library as well and there's bindings for Ruby, Python, Java, popular frameworks essentially.
What it does is it automatically adds a comment to each query that gets executed by the ORM or the database driver, and that comment can include a couple of things. But most importantly in the context of tracing, it will include the trace ID and sometimes the span as well for where in the application you're at, essentially.
What happens then is you imagine on the database side you have a slow query, the slow query gets logged into the database log, then that slow query will have that trace ID and that span ID next to it, next to that slowing location. That way you have that connection when you look at the database side of the house where you get more details. You know the query plan and all the other stuff, then you also know which trace that belongs to and so that way you can combine it.
Jessica: So could we take that database log and export that as a span?
Lukas: We could potentially, yes. I think that's ultimately one of the things that we're looking at at pganalyze is to optionally allow that to happen, because what we do today is we do a lot of log parsing. We look at these error logs and we're like, "Hey, this is log event such and such, and here's the query plan and it's a trace and blah, blah."
So we already get that data, and so one of the things we're exploring is how can we send that back into another system, let's say Honeycomb for example. If you saw some information there then you could actually have a link again between those two systems. In a sense, OpenTelemetry or Open Tracing, whatever it used to be called, the idea is the different services can send to the same trace so you should be able to just add additional spans that give more context.
Jessica: And could provide a link on the new span, you could provide a link directly into pganalyze tool?
Lukas: For example, yeah. Exactly.
Jessica: And that would be a place where your data platform engineers and your developers could communicate?
Lukas: Yeah. Right, exactly. And then they have a shared thing to talk about also, because oftentimes that's super helpful. If you're looking at the same thing, you're talking about the same thing. You're not just sending around huge text files or something.
Jessica: Yes. It's not My Logs versus Your Logs.
Charity: Our logs.
Jessica: It's our traces.
Lukas: Or a screen share from somebody.
Jessica: One thing I wonder about these days is new developers. When I was young I learned SQL really early, and I had a friend, made friends with a DBA in the company in the company I was working, and I got a book on SQL. He showed me, what was the tool in Oracle? Precise. It was Precise, and in there he could show me the query plans, and then he gave me access to be able to do explaining plan and ask the database, "Hey, if I give you this query, how are you going to run it?"
Charity: Superuser.
Jessica: This was normal, this was what I needed to do my job as a developer back then.
Charity: It's heady. It's exciting.
Jessica: It is, databases are actually really cool.
Lukas: They are.
Charity: They are.
Jessica: I don't feel like most devs these days get that opportunity, even to play around with SQL prompts. What's up with that?
Lukas: Or ORMs, right? People don't write SQL anymore, people write function calls.
Charity: ORMs ruin everything.
Lukas: Well, they do. But that's the reality, right? If you asked any person working in PostgreSQL directly, they're like, "How do our users, the people that use PostgreSQL, actually write their queries? Of course in SQL." There's this huge disconnect that most people that use databases like PostgreSQL, they don't write SQL. They write ORM.
Charity: They really don't write SQL?
Jessica: And that gets to when if the DBA comes to you with, "Look at this terrible SQL."You're like, "I don't know where that came from." And that's where tracing is really important because there's tracing, there's auto instrumentation for your ORM that you can see what you did in code to cause that query to come out, and then you need the ORM Wrestler and those people are rare.
Charity: It is a little unfortunate though because it's so much easier to spot potential problems in your query if you're looking at it in SQL than you are in ORM. At least for me, it's like code just looks like code, but in SQL you're like, "Ah, that doesn't look smart. Oh, look at that last join, that's probably not going to end well."
Jessica: And that's a plus that you get from tracing, is at least you can see the SQL that's generated. Also you can see the N+1 SQLs that were generated.
Lukas: Yeah. Definitely still a problem. The one thing that I would add, to be complete in the picture, I do think there is a bit of a trend actually back towards SQL. For example, in Go there is a library called SQL-C. What SQL-C does is it actually has you write SQL in your Go code and then it analyses that at I think compile time, and then builds essentially the type information, extracts type information from that query and auto generates a binding of sorts to that query.
So you're always writing SQL but you essentially get to type SQL in your code and that's really neat because I think it ultimately gets us back to the SQL writing, which I think is a good thing probably. But it abstracts it better so that the application can use it more effectively.
Jessica: Right. And in Java it's called Jooq. J-O-O-Q, which I find ironic because ORMs are for object oriented code, and Jooq has O-O in the name and it's actually not an ORM so the O-O in it confuses me.
Lukas: But it's good. If I was using Java I would definitely look at Jooq. I'm a huge fan of the blog posts that the author publishes regularly. Definitely a good one to look at.
Jessica: Great. So SQL is coming back, but also we're blurring the lines between developer and data platform engineer. We actually don't have that at Honeycomb and it worries me.
Charity: We have a team of Ian, Doug and Hazel.
Jessica: We write our own database.
Charity: That's our data platform team.
Jessica: Right. So our own database we understand really well, and it has a lot of tracing.
Charity: It has a lot of tracing. I feel like the other thing about databases is why they're so opaque to software engineers is because we can't really instrument them. They have logs and stuff, but if you want to... You've got the infrastructure code, the stuff that you have to write in order to get the stuff you want to write. Then you've got your crown jewel code.
Databases are this really important, really complicated piece of software that you should not be shipping this to, or instrumenting it to see what's happening. So I feel like that's contributed to some of the priesthood aspects of it over the years. It's just there's been so much knowledge that you can't instrument it, you can't see it, it becomes you know it in your gut or you don't because you've broken it so many times and you've learned the hard way or you haven't.
Jessica: They're also really mature.
Lukas: Yeah. PostgreSQL is really old. It turned 27 years I think this year.
Jessica: Wow. Yeah. Databases are magic for reasons, because there's many, many, many layers of expertise built into that database.
Charity: Yes and no. There are some old, stable databases. Stable databases, ha, ha, ha, ha. Data losses, not if, it's when. The DBRE book is way overdue for an update because we wrote it, what? Seven-
Jessica: What's DBRE?
Charity: Database Reliability Engineering. The O'Reilly book.
Lukas: It's a good book.
Charity: We wrote it like seven years ago, and there's nothing about cloud databases in it. There's nothing about your column restores, or even the Google Database or Amazon Aurora. There's nothing about how to run them.
Lukas: One thing I worry about the cloud databases and all the managed service is that really you're losing even more introspection abilities. For example, there have been multiple people doing work to do EBPF tracing with PostgreSQL, so essentially not just look at PostgreSQL from a, "Let's run a stats query to see which queries have executed."
But just use EBPF to say which functions are being called. That actually is very valuable because you can annotate things and trace things that otherwise you don't have data on. I worry that if we're in this future where everybody uses a cloud database that they don't have any actual, true, root access to or true understanding either, they can't look at the source, that we're losing that ability to really dive into the details when we need to.
Charity: I feel like this kind of comes back to the principle of if it's your core differentiator as a company, you need to know it intimately. For most companies, they don't really need to push their databases that hard. They can just kill it with money, provision a little more, be a little sloppy, whatever.
Jessica: The trick is don't kill it. The trick is if you need to push it harder, ah, split it up into two databases or something. Stay within the parameters.
Charity: Yeah, exactly. It's incredibly easy but once it becomes your core differentiator or when you reach a certain inflection point or threshold as a company, then you're going to have to bring it in house because you're a star and it's going to be a specialty.
Jessica: Hopefully then you're big enough.
Charity: And you're big enough, yeah.
Lukas: But I think that's where it's important early to choose something that has longevity. If you choose either the startup that creates their own database technology and then they go out of business, then suddenly your database doesn't exist anymore. That's the one risk.
Jessica: Yeah. That's why we don't let anyone else run our database. We run our database.
Lukas: Right. Yeah, exactly. If you have that.
Charity: It's very much about right sizing your risk. At Parse we made a lot of risky and interesting choices. We launched with MongoDB which was at 1.0, one lock per replica set at the time. We were running Ruby On Rails and some of those things came back to bite us. We grew up with MongoDB pretty well, but it was the Ruby On Rails one that really hurt us because fixed pool of unicorn workers, no real threading support.
Once we had hundreds of thousands of mobile apps and a whole bunch of backing stores, as soon as anything got slow, everything went down. So we had to rewrite it, the entire thing in Go Lang. But you almost never go out of business as a startup because of your technology. If they're good enough to keep you alive long enough to survive, they were the right decisions.
Lukas: I'm curious, would you these days, if you did that again and on the database side, would you choose MongoDB after all these lessons learned?
Charity: Hell no. Absolutely not. I mean, it depends on for what, right? It's great for some things. Anything super high performance it's not. I would argue that what they really got right was, ironically, the JavaScript interface instead of making you learn another language, and the administration aspects. The replication, the leader stuff, the way it automatically failed over whenever the primary... How is it that MySQL is still in a state where you can't? It's absurd.
Lukas: The sad truth is PostgreSQL is sometimes even worse on that side. That's the most common criticism that PostgreSQL still receives these days and I think a lot of justified. It's that the administration and that tooling for doing complex replication setups, it's just still not there and core. You can go purchase it, but it's not part of the core project, essentially.
Charity: Yeah. Anyway, would I choose it? We did not want to write a database. We looked at everything out there and came close to using Druid but it was written in Java and we would've had to rewrite a third of it anyway just to get flexible schemas and we were like, "Eh..." Honestly, if we were to start Honeycomb today, I bet we would use something like Clickhouse.
But I think ultimately that would've been very limiting for us because there are going to be so many Me Too things which is just they're built on top of Clickcast, they have a pretty good UI, and they get you 70% of the way to something Honeycomb does. But then you are limited by Clickhouse's built ins, and you can't do a lot of the things that we do that are just near miraculous for observability.
Jessica: Right. It turns out that because our data store is our core differentiator, it's a big part of it anyway, we actually get a lot of benefits of having that in house. All those maintenance costs of maintaining it ourselves are a win for us because we also control change.
Lukas: I'm curious, this is kind of a tangent but I'm curious, you mentioned earlier you have got instrumentation for your own internal data store. How does it actually look like?
Jessica: Heavily sampled.
Charity: Heavily samples and traced. Ian Whelks started writing the database the first month that we started Honeycomb. Interesting, and I know we're almost out of time, but the database started out looking like a fairly traditional columnar store, and then at some point we started a List Star database.
We actually shipped the query planner and query builder and everything to AWS Lambda jobs, which if you think about it, we would never have been able to run Honeycomb in any kind of cost efficient way if we had to provision SSDs.
Because all of these resources are idle almost all the time, except for when somebody issues a query and then it needs to be super fast you've got all these wasted resources, and so moving a lot of that to Lambda jobs and then moving a lot of the storage to S3 and then parallelizing those mergers in the Lambda job... It was also an Ian Whelks thing. Jess actually gave a talk about this, strangely, a year or two ago.
Jessica: Yeah. Because Ian didn't want to come, so I got to do it.
Lukas: It sounds to me like from what you're describing, one thing that you all are doing right at Honeycomb is to treat the database just like any other piece of software. I think that's part of the issue oftentimes, is ultimately the database is just software but we are treating it as something separate. What if we just think of a database as a function call? It's just another file in your source code, essentially. I think that would be much healthier.
Jessica: You still have to program it with indexes.
Lukas: Exactly. Yeah, you still do. You still pass some arguments.
Jessica: Sometimes. Well, thank you, Lukas.
Charity: This was really interesting.
Jessica: Where can people find you?
Lukas: Sure. These days I'm more on Mastodon. You can find me on the Hashyderm server as Lukas. I'm still on Twitter/X as well, though I'll probably leave soon. pganalyze you can learn more about on pganalyze.com or we also do weekly five minutes of PostgreSQL episodes which are essentially short YouTube videos where we talk about PostgreSQL. So if you're interested in PostgreSQL specifically, definitely look at our YouTube channel for pganalyze.
Charity: Well, this has been a fun little walk down memory lane for me.
Lukas: Yeah, thank you for making it. It's great.
Jessica: Bye.
Lukas: See you.
Content from the Library
O11ycast Ep. #76, Managing 200-Armed Agents with Andrew Keller
In episode 76 of o11ycast, Jessica Kerr and Martin Thwaites speak with Andrew Keller, Principal Engineer at ObservIQ, about the...
O11ycast Ep. #75, O11yneering with Daniel Ravenstone and Adriana Villela
In episode 75 of o11ycast, Daniel Ravenstone and Adriana Villela dive into the challenges of adopting observability and...
O11ycast Ep. #74, The Universal Language of Telemetry with Liudmila Molkova
In episode 74 of o11ycast, Liudmila Molkova unpacks the importance of semantic conventions in telemetry. The discussion...