MAR 18, 2025

46 MIN

Ep. #79, AI and Otel: Look at your Data with Hamel Husain

GuestsHamel Husain

light mode

about the episode

In episode 79 of o11ycast, Hamel Husain joins the o11ycast crew to discuss the challenges of monitoring AI systems, why off-the-shelf metrics can be misleading, and how error analysis is the key to making AI models more reliable. Plus, insights into how Honeycomb built its Query Assistant and what teams should prioritize when working with AI observability.

about the guests

Hamel Husain is an AI and machine learning expert with over 25 years of experience at companies like Airbnb and GitHub, where he led research efforts on projects like CodeSearchNet, a precursor to GitHub Copilot. Now an independent consultant, he helps teams build and evaluate AI-powered systems, specializing in observability and large language models. His work focuses on making AI more reliable, understandable, and actionable.

show notes

about the episode

about the guests

show notes

transcript

Ken Rimple: So Hamel, why don't you introduce yourself?

Hamel Husain: Sure. I'm Hamel Husain. I have been working on machine learning for over 25 years. I've worked at a lot of different tech companies like Airbnb, GitHub, and a bunch of other startups.

I've been working on large language models for a really long time. I've led research at GitHub called Code Search Net, which is a precursor to GitHub Copilot.

At some point I took on independent consulting, helping people work with AI, so that's where I met Phillip at Honeycomb, and I helped Honeycomb with their first natural language query product, which is a lot of fun.

People know me a lot from open source, so I've done a lot of work on machine learning tools and data science tools, and in infrastructure. So that's kind of a parallel kind of thing that I've worked on as well.

Ken: Sure.

Jessica "Jess" Kerr: Also, no money in that.

Hamel: Yeah, for sure.

Jess: But some glory.

Hamel: Yeah.

Ken: I mean, Phillip here too, so Phillip why don't you let people know who you are here.

Phillip Carter: Yes, hello. Yeah, I'm Phillip, I work for Honeycomb. I'm a principal product manager. Been at Honeycomb for a little over three and a half years now.

Prior to that, had been at Microsoft working in developer tools for several years, and didn't really get into AI until like a couple weeks before ChatGPT launched.

I'm proud to say I wrote the document talking about how we should like do AI at Honeycomb before that thing actually came out. But that was because I had been using GitHub Copilot prior. So like it's not like I saw something that nobody didn't see.

And then as Hamel mentioned, I worked with him and then eventually some others as well within Honeycomb to build our first AI feature called Query Assistant, which lets you write queries using natural language.

And it was sort of this like very targeted feature for new users to come on board. And they don't really necessarily know how to do observability queries, but they can describe what they want in natural language and kind of works pretty well for that.

And now I'm heading up our AI efforts across the company.

Ken: So I guess certainly from a telemetry perspective and monitoring data, I know that a lot of these startups doing open telemetry for example, are starting to use AI in their tools.

What are some of the challenges that you run into in analyzing streams of data with AI?

Hamel: So--

As far as AI is concerned, it's pretty nascent of a field. A lot of people have not been introduced to the concept of traces and spans until now. A lot of people have not been working on distributed systems or not been working on these kinds of things, they're working on AI. I would say it's so nascent that people don't really know where to begin. So a lot of folks don't have any instrumentation at all on their systems.

And usually that's the first starting point is okay, how do we instrument our systems properly?

And kind of like the first entry point to doing that is there's a lot of LLM specific observability and evaluation tools and vendors, some of them open-source, some of them commercial, and one easy way to get started is they often have a wrapper around popular SDKs like the OpenAI SDK, Anthropic SDK, whatever.

And like you change one line of code and you get a bunch of telemetry on language model interactions. And so that's where it starts. Like people then can have at least some view and record of the interactions that their application or the users are having with AI.

And then the question quickly becomes, okay, like what do you do with that data? And people struggle with that.

You know, right now the most common failure mode is people reach for off the shelf metrics. So a lot of the tools and vendors out there, they have off the shelf metrics like hallucination score, conciseness score, helpfulness score, whatever.

And it's very tempting to say, let me just create a dashboard with a bunch of metrics, those off the shelf metrics. And it feels like you are doing evals or evaluating your system, and it feels like you have some level of observability, but you really don't, you don't really have anything.

'Cause off the shelf metrics for the most part don't do anything. They actually can be very harmful because they just add a lot of noise and confusion to the picture, and often they don't correlate with what's actually interesting and what should be prioritized and so on and so forth.

People don't know where to start. There's a notion that, hey, this all seems very tedious, can this just happen automatically? Can I just plug in a thing and have something magically tell me what my problems are and what I should focus on? And that's where people get into trouble with these off the shelf metrics.

And that's when people reach out to me at this point, either when they don't know what to do to begin with or they've done some instrumentation but they don't really know what to do with it.

Austin Parker: What's interesting about to me is that it's in a lot of ways a mirror of what you see, I think with most like observability and monitoring tools, people will go, I am bringing in Kubernetes because my boss said we need Kubernetes.

And so we go and we sign up for whatever and they say, here's your Kubernetes dashboard and it has these dashboards and metrics and alerts and all of this stuff just packaged out of the box. And you install that and you go back and it's like, yep, we're good now.

And it's such a common pattern that it's tough to say, because people still do this every single day. Like it's one of the biggest thing.

As someone that's been working on observability for like 10 years now, and seeing it from like almost every side of the equation, one of the biggest I think, impediments to like innovators and observability space is that people will say it's like, well where's the thing that just tells me what I need to know?

And it's super challenging to kind of start with, well that's kind of a you question, not a me question. I think AI, people are certainly more susceptible to that pattern in AI because it is such a different kind of technology component than we're used to in the industry.

Hamel: I think that's right, yeah.

Austin: I would actually love to hear this from you, like how, you can't throw a rock without hitting some very strong opinions about AI, especially in the techie community these days.

What I want to know is like how do you, thinking about AI as like a building block, thinking of it as a component, like how do you describe that to people?

Like how do you kind of help cut through the noise as it were around AI and just be like, look, this is a tool or this is a component of a system and not like some weird world ending thing?

Hamel: Yeah, I can answer that question in multiple ways. I can start with like the observability angle actually. So like a lot of people, they have AI, they've instrumented it and it's pretty overwhelming to some people.

They're like, well what do I even look at? Do I just... What do I test or where do I begin? The outputs of the LLM are stochastic and I don't really know if I can test it.

If I use another LLM to test it, how do I trust that LLM, all these questions and quickly people become paralyzed. Now thankfully there's been a regime where we've already dealt with a lot of these things.

So in classic machine learning before LLMs we had the same issues. Like we have models that are making predictions of various kinds. Those outputs are stochastic and changing, and you have to figure out, okay, like it's not a deterministic system, how do you evaluate it?

So there's some pretty basic techniques. One technique that cuts through a lot of noise very fast is getting people to do something called error analysis, which is a very fancy term for looking at your data systematically.

And really what the all that means is, you start looking at your traces and you start with just writing notes, any issues that you find. You could start with like a hundred, there's no magic number, whatever.

Like just start looking at data. Every time that happens, so like we've done that in machine learning for a long time. It seems counterintuitive, it's like how can that be helpful? It seems like you're picking up grains of sand on a beach one at a time. Like how can this ever be a fruitful exercise? But it is, it's extremely fruitful exercise, is probably the most fruitful exercise you can engage in in a system like that.

And there's different ways to sample and things like that, you can get fancy with it, but to begin with, starting to look at data and interactions and like becoming really familiar with the kinds of failure modes that you're experiencing.

Jess: Are you looking for patterns, like common failure modes?

Hamel: Yeah, so in the beginning is good to just write notes about if things are wrong.

'Cause you don't even know what the common failure modes are in the beginning. You don't even know what those categories are or what those are.

It's important just to pay attention and go through some data and write down what is wrong, and then you can use like an LLM to categorize those notes later.

Jessl: Ooh.

Hamel: Or you could do something like that. But like whenever I go through the exercise, everything comes into focus, for example, like what kinds of failures are happening the most, what kinds of things should you be pay attention to.

You start to get an idea of what metrics might be interesting and where things might be failing, and you start to get a sense or a smell of, okay, like this is what I need to work on.

And it brings everything into a lot of focus because you can't test everything. You don't want to measure everything. There's too many components and too many things. But if you start to focus on what is actually going wrong, you can get a lot of intuition from doing the error analysis.

Jess: This reminds me a lot of my reading about social science research and grounded theory.

Hamel: Okay, go on, yeah.

Jess: Oh, Brene Brown describes some of her research techniques and involves interviewing people and then going through the interview transcripts and doing what social scientists call coding, which is first noticing themes that come up again.

And then you go back and you like tag, we would call it tagging excerpts that match each theme and then you can like put them together and look for patterns and get new questions and then go research those questions.

Hamel: Yeah, that's it, that sounds remarkably similar to what this is. It probably is like almost the same thing. And there's a structured way to go about it. There's a way to navigate the data.

You know, like when you start to become more familiar with your data, you can segment it by different dimensions and things like that, probably same thing you do in observability space.

And just looking at data, the people, I think amongst the machine learning practitioners out in AI right now, we're all repeating this phrase, like look at your data constantly, beating people over the head with it because it's pretty uncool but it's probably the most effective activity you can do.

Jess: We think it's not rigorous, like as developers, we think if we have to use our human eyes to look at it, then well it's subjective.

But use that there're structured ways. Yeah, and the social scientists have studied this for decades, of there are rigorous ways to do research on individual experiences and their descriptions.

Hamel: Yeah. And the key is to make that process like low friction. Because you can imagine looking at traces can be painful, looking at hundreds of traces, like you want to make it as smooth as possible and as enjoyable as possible so you can page through those traces easily, and write notes easily without clicking around at a bunch of stuff, without navigating through a bunch of stuff.

Jess: Without reading a very large text fields in a box this big.

Hamel: Exactly. Yeah. And you want to be able to render nicely. So like if your traces contain, let's say, I don't know, mark down or code or whatever, it's human readable so that you can just flip through it really fast.

Because you need to make it enjoyable essentially. And that's actually really interesting. Like at this point as of today, I'm still building custom tools to view data quickly, even though we have observability tools for LLMs, even though we have things like annotation cues, a lot of times they're not dialed in enough.

Like they're not, they still lose us a lot of times, a little bit too much friction or something like that going on. They're getting better. I talk to a lot of them on a regular basis so they are getting better.

So that's a key part. And then doing some data analysis on those, on the error analysis. So like, okay, you've done a bunch of error analysis, you can take what you've learned and then you can do all kinds of stuff.

You can do data analysis on like your notes and those categories and that coding that you described. You know, there's different terms you can ascribe to it, but it's really to figure out, okay, what's most important?

You can think about writing tests, you can do all kinds of things. And that exercise requires some amount of data literacy. You have to be comfortable with manipulating data, thinking about data, have some sense of maybe navigating data.

It seems trivial, but a lot of people don't have these intuitions. Like, oh, how do we summarize these notes? Or how do we summarize these things? What do we look for?

I think I have a intuition is probably the same struggles that you have in non-AI as well, based upon what Phillip has told me.

Austin: Yeah, I think that's what I was trying to get at earlier in a lot of ways, right? Like ultimately the observability problems are sort of generic observability problems.

You have all of this data about system state and you need to parse it and understand it and it's maybe a little more difficult because these are non-deterministic systems, and so that makes your life harder if you're trying to say like, I think that's the appeal of the out of the box, hallucination rate metric or whatever.

It's this comforting idea of like, oh there is some sort of best practice here. And that's just not the case I think in AI. And I'm not sure it'll ever be, maybe I shouldn't say ever be.

Like I think when you get down, I think as... well we've seen a trend in the industry getting towards smaller and more purpose-built task-oriented loops for AI as like a part of functionality. So presumably eventually you could get there.

The other interesting thing is like, where it does change is the type of data, or not the type of data but maybe the type of things that are in the data, like traditional observability tools tend to be--

They focus on like density and I have like all of these small points, and then when you get into AI and you're suddenly dealing with like just this tremendous amount of stuff inside each data point, like wide events or whatever.

And I'm looking at a thousand tokens or whatever in markdown format and like the tooling just isn't there to like let you display that in a conventional way.

Hamel: Yeah, definitely.

Jess: Yeah. And I love what Hamel said about once you have the categories you can ask an LLM to categorize things. So you could get the LLM to tag and like highlight pieces of the many character outputs that look relevant to the problems, and then you could spot check them.

Hamel: Yeah. And you can get LLMs to add dimensions to the data. Like for example, you want to slice and dice that data according to what's relevant to your business. So like what channel is the customer talking to you in or what kind of request is this or whatever.

And if you don't have that, you can try to get LLM to help you, say, okay, like categorize this interaction according to X, Y, Z, which can help with data analysis.

Jess: So that can scale your error analysis of looking at a hundred?

Hamel: Yeah, well it can make it easier to navigate the data. But it's important to start with, people are always extremely surprised when I tell them, okay, we're going to look at a hundred examples, and then after they're done they're like, wow, I learned so much.

Phillip: So I can make some of this concrete. When we built Query Assistant back in early 2023, I mean it's not that there's a ton of best practices for LLMs today anyways, but there were definitely a lot less back then.

So we did our best knowing very well that what we were going to launch it first could totally fail and be a disaster. So we're like, okay, like we have to be willing to rip this feature out if it's not actually doing what it should, but how do we know it's doing what it should?

Well there's like a million ways to do it, but what we ultimately landed on was like, okay, well let's not try to focus on creating the perfect query for someone yet.

Because if they're a newcomer they may not be the best judge of like what an ideal query for their question is.

And it's like more important that they engage with the product and iterate than it is that like one shot perfect query, one and done, out of the system.

So our success metric there was can we actually create a runnable query based off of their input text?

And you might imagine some people type in some bullshit and like there's no way that you can make a query, like someone asks a question about like, what's the capital of Michigan?

It's like, okay, yeah, that's going to fail. But I don't care about that, right, like I care about when you're asking about like how many things are there, but then you create a count distinct inside of Honeycomb that's like structurally incorrect.

And so then like we can't actually run that query and it's like, okay, we actually failed to do the job here. And kind of to what you were saying earlier, Hamel, there were like, it was so evident just immediately we just created groupings of like natural language input, the error if it existed, and what the output was, and just grouped by that.

And as you might imagine there were a lot of different groups, but like literally at a glance, I would say within like three minutes I would already have a sense for like what kinds of things are generally failing right now.

Like it was so simple to just sort of look at it in just like a table of data, and then be like, oh cool, alright, this is a category of problem, we're going to focus on that today. And then we fix it and we just like run the query the next day.

Or like we think we fix it, because we have the full trace, like we have the exact prompt that was sent, so like we could usually reproduce what the actual bad output was so we could sort of iterate on that in dev, push it, wait like a day, run the same query and be like, do we still see that same failure case, and just repeat this process.

And over time, like most of the time it would just work and like actually do the do the job. And there were a lot of these concrete cases where like the fix wasn't even like prompting.

It was like, oh well maybe it could be prompting, but actually we have all the information in the response that the model gave us to just like change it programmatically to be correct.

Jess: The same way as a person, if I looked at that query that it was trying to run, I would just type it incorrectly instead of the way they set.

Phillip: Yeah, yeah. So like using the count distinct one as an example, a very common failure pattern weirdly enough was it would do a count distinct as the operator for a visualization clause, and then it would have two columns instead of one.

And so we're like, well which one do we pick? I don't know, pick one, whatever. Like what's the worst that's going to happen?

And like actually we know the type of the column and so like we know that you count distinct on a string is nonsense, at least in the current Honeycomb type system.

So we just picked the one that's not a string and the first one we see is not a string, and add that in there.

And we did this a few times and tried to make notes, and we also added into our instrumentation when we applied it to particular programmatic fix so that we could track systematically, like how often we run into this problem so that we could then assess, okay, is this like something that continually happens which indicates no we really do need to use prompting to like fix this, or is this just a thing that happens every once in a while, much lower priority for us to like deal with?

And that was able to wipe out a tremendous amount of the errors alone. So that like the prompting that we did have to do was like very narrowly targeted on like, okay, we can't programmatically fix this and this is just a fundamental flaw in how we're asking the model to do something.

And over time like the feature just got stable to the point where like it's stolen the product today and it's what new users, when they sign up for Honeycomb, it's like we literally point them to like, Hey, type in a thing, look for what you're asking for. And like it usually works.

Austin: It works really well.

Phillip: Yeah, it works pretty well. I wish it could work a lot better. There's a lot of regrets that I have about like how it built it and all that.

But like I think it really speaks to that, it was a way of us looking at our data, but like it wasn't hard and in fact it was really, really easy because everybody internally at Honeycomb uses Honeycomb to like observe things that are happening on the live Honeycomb product.

So it was in a place where every engineer like knew how to look at that data and like page through all the different cases. If that wasn't the case, I'm sure we would've maybe even just gone with like a Google sheet or something like that.

Something just as simple as that. Like whatever is the easiest for people to look at it, it just makes the problem so apparent like immediately.

Hamel: That's great. Like you didn't have all these bullshit metrics, you had things that actually were going wrong. Specific to you.

Austin: Like corresponded to reality.

Hamel: Yeah.

Phillip: Yeah. No, no, I am actually curious though about when it is appropriate to roll something up into a metric that you do want to track long term?

And like my intuition, and I don't necessarily know if this is true, but my intuition is that this would have to come after you do data analysis and you find that like, okay we can derive a particular metric that if we monitor actually does tell us something meaningful.

Hamel: Yeah.

Phillip: But I've never done that before.

Hamel: Yeah. So when you do your error analysis and you can see failure modes that are not trivially resolved, like for example, there's a lot of error you'll see that, oh, this is just an engineering problem.

Like I don't have the right API key in here or something like trivial. You'll have some errors like, oh that's kind of really difficult, is this something I want to tackle over time.

And you can design a metric specific for the error. And when you just, is like an example of that is, okay, I like, I'm working with a, like an AI leasing assistant for property management companies right now, and their AI assistance gets dates wrong often.

They just make mistakes in dates, like a user will ask for, hey I need something next year and it'll give you the date for this year or whatever.

That's a failure mode we want to avoid. And you could construct like a data set in some ways of measuring that, like if your prompt is hitting these different errors.

And so like there's many different kinds of errors, there's ones that you can test with like assertions, there's ones that you might need to use the LLM as a judge.

So that date thing is kind of, you can construct a data set with a golden or expected output and you can do like almost an assertion against it. There's other errors where it might be harder.

So for example, you and I, Phillip, when we were working on, we did an exploration of fine tuning a model for the natural language Query Assistant.

You know, we had an LLM as a judge that we used to figure out, okay, is the query good or not? Because that's subjective. Like, you know, is it good? Like query can be valid but is it good?

And so LLM as a judge is a very interesting topic. It's like, oh that's interesting. You can use an LLM to judge another LLM, like how does that make sense? And the way that works is, you shouldn't just use an LLM as a judge blindly. You have to do some analysis of how much that judge agrees with a human.

So what I did with Phillip is, I had Phillip be a judge and write detailed notes about whether Query was good or not. We used like a Google sheet.

And he wrote detailed critiques of, okay, this query is, it's okay, but it could be a lot better. I told Phillip to make a pass/fail decision, and then also like detailed notes.

And I used what he wrote to kind of do iterative prompt engineering until the LLM as a judge became enough in line with Phillip. Like there was a lot of agreement between the LLM as a judge and Phillip.

So that's an example where, okay, like you have an LLM as a judge and you can have some confidence in it 'cause you did some principled analysis of, okay, this agrees with the domain expert.

Ken: So this is interesting because the hype curve on LLMs and such is that eventually you could get the person out of the loop completely and maybe someday you can.

And certainly as you train things better, you have less and less involvement because you know roughly the conditions that happen.

So you're not, I'm sure Phillip, you're not looking at the chat windows every single minute anymore, you know roughly what you want to focus on.

But in prep for this, I was reading some of your content and I found that really interesting article on Devin, and so-

Hamel: Oh yeah, yeah.

Ken: Yeah. So I guess the point being that, without a human in the loop, at least, I don't know if there's a human in the loop with Devin, but it seems like there's a lot less human hands-on, it kind of goes in noodles on its own and keeps working on projects.

Like what did you find when you were trying to get Devin, which is basically an autonomous coding agent to do work for you?

Hamel: Yeah, I mean with Devin it didn't quite work at all. I mean I think maybe it's like for the tasks that I was trying to do plus maybe the way the technology is at at the moment, that kind of workflow wasn't working.

Jess: Like the error analysis?

Hamel: No, no. This is like a... this is a little bit of a side quest, I think.

Ken: It's a tangent.

Hamel: Yeah. So Devin is a autonomous coding, guess I could call it agent. I don't really like to use that word because it's very nebulous, doesn't really tell you what it is.

But Devin is a kind of like a, it's a product, it's like a SaaS that you can purchase. I think it's like $500 a month.

You can connect it to a Slack channel and you can talk to it like a human being, say, Hey Devin, please look at this GitHub repository or look at this issue and fix it, essentially.

And then it'll go off for like hours or days and it will iterate on that until it tries to figure something out, and it'll even make a pull request on your GitHub repo.

And the idea is you can outsource work to it almost like an intern, and it should be able to tackle tasks of that a junior engineer would be able to do. That's the promise.

You know, in the article that I wrote on it, we just outlined like, okay, it didn't quite work for us.

So I think like at the end of the day, I know the promises you have is like, hey, you could take the human out of the loop. The issue is like you have to trust the AI somehow. And the only way to trust something is to check it.

Now the better it is, the less you may check it, it's just like anything else in life.

Like, okay, this thing is good, I trust it. Every time I've checked, it's like done the right thing. Maybe it gives you more confidence but you still have to check it.

So I don't know if you can completely take the human out of the loop, but maybe as AI gets better, you can take it more out of the loop.

Jess: Or look at it as, it continually increases our leverage as a human.

Hamel: Yeah.

Jess: Because you do a little bit of error analysis, you teach it to do that error analysis and then you come back and you check that too.

Hamel: Yeah.

Jess: But we can make it increasingly good, and ourselves increasingly powerful.

Hamel: That's right. Yeah. That's what I was doing with Phillip with the LLM as the judges. I was trying to get basically encode Phillip in some rough sense into an LLM with prompting, like all his different rules in his head and whatnot, at least for that narrow problem.

So I could scale his judgment across all the-

Jess: We scale his judgment, I like that. We still have to have the judgment, we still have to exercise our judgment, and we can use the AI to scale that.

Austin: I mean this like little snippet of conversation here is sort of the filter function I think for people both in the industry more broadly about like what is AI.

Because at the end of the day, I mean it is not just a stochastic parrot, but it's also, it's a token prediction engine. It's not a replacement for people, but it does augment what you can do significantly.

And I think what's frustrating is it's all, this is mostly a trivially testable fact. Like I've been playing around a lot with Cursor and various like AI IDEs and I think what's fascinating is that it's gotten good enough that you don't really, like for a constrained set of tasks, like you actually don't have to check it that much.

Or you can build-- the Cursor agent mode has gotten really good, I would say, at checking itself.

Hamel: It's really impressive. I really am impressed with Cursor.

Austin: It's hugely impressive, right? Like, especially if you're using something where the training data is really good and documentation is able to be fed into the context window really well.

So like if you are asking it to make a pretty basic, maybe not even pretty basic, but something that's conventional, like a react web app in Node, not using the latest and greatest stuff, but again, something that easy to write and most importantly easy to check.

Like that to me is the thing that makes Cursor really powerful is that if you are using it, the agent part, then yeah, it'll write something, it'll look at the linter, it'll look at the language server feedback, it'll try to run it with multimodal capabilities.

It can actually just look at, it can pop a browser open and then quote unquote "look at what is in the window" and continually iterate. And that is all like super duper impressive.

But as soon as you go beyond what I would maybe consider personal software, you start to see the flaws really... like it starts to break in really non-obvious ways unless you have significant domain expertise, very, very quickly.

And it is very, it feels brittle, it feels bad. Now two years ago it felt brittle and it felt bad, and we've probably had more or less exponential increases in capability over the past two years.

I might've won money on the Super Bowl last night, but I am not necessarily a betting man. So I don't know if I would put a marker down on a continual exponential increase in capability.

But I do think that it's still very, very early. But anyone telling you that Devin is a replacement for an SRE or whatever is blowing smoke up your backside.

Hamel: Yeah. Like on the Devin point, I don't think that's a place for the SRE.

I think actually the capability in the UX doesn't lend itself to working, whereas Cursor agent lends itself to working more because it encourages you to guide stuff along the way a lot more and quickly so that even though it's doing a lot, you can still course correct it.

Whereas Devin is more asynchronous in nature, where it like you kind of say something and completely forget about it.

Austin: That's a good point.

Hamel: And I think... I want it to work. I think that's really interesting, like the async is interesting.

Like, oh let me just, before I go to sleep, let me think of 10 tasks and just like give it to something and see what happens. That's kind of cool. It may work for some things really well, I don't know.

You can already see like a little bit of that, there's the async nature to OpenAI operator, but also Deep Research, sorry. So Deep Research is like-

Austin: Hmm. Yeah.

Hamel: I mean it doesn't take that long, but it works quite well for at least like scoped research tasks, it can take like two, three minutes, at least for that scope task. Like I found that it's often can be impressive.

Austin: And I think it goes to like the general point where, so much of AI I think is not the AI itself, it's the interaction modality and it's like how you're stitching together the functionality that you can, what the model lets you do.

It's a more or less non-deterministic natural language programming suite to use some words.

But how you interact with it is what matters and the completely asynchronous workflows, it's not a person, it doesn't have judgment.

Hamel: Yeah.

Austin: Like there was a story that went around over the weekend or like end of last week I think of someone that tried an operator and like it, they paid 30 bucks for eggs because they logged into Instacart with it.

Hamel: Yeah.

Austin: It's like, okay man, why did you log into it? it's like-- The AI can't save you from your own poor judgment, I guess.

Hamel: Yeah.

Jess: But I can amplify it.

Hamel: Yeah. A lot of times operator wants me to log into my Gmail, Google stuff, and I'm like, no.

Austin: It's like, no, no man, doesn't work for me. But I think this goes to... like the terrifyingly generic point about this is just that these are tools, they're not a replacement for humans.

They are certainly a replacement for like human work in many cases. And that has its own terrifying implications. But-

Jess: But not for human judgment.

Austin: Right. I understand why people get like all head up about all this, it's just, I wish we could all be a little more like intellectually honest about the field or like what is actually happening here.

Hamel: Yeah. To answer your question from before like you asked, how do you cut through the noise?

So another angle on it is, I actually encourage all of my clients to not use jargon as much as possible in meetings, so like things like agents. I don't allow people to say the word agents.

I just say like you're not allowed to say that word, like let's talk about what you're actually trying to do. A lot of times people, once you introduce the word agents it alienates all the non-technical people.

And there's like a lot of times the job to be done is not really that technical at all. It's like, hey, we need to write a prompt.

And what I found is, and I'm talking about jargon and all kinds, so like agents like, okay, what is exactly what do we need to do next?

Like what is this meeting about? Don't just say a word agents like do we need to like write a function, like have some tools that does web search, or when we working on retrieval.

or we working on search, or we working on... or we just working on a prompt. Same thing with like terms like RAG.

So I make sure people say, instead of RAG, say hey, we need to give the language model the context that it needs to make a good decision. And just like tell people to rephrase it that way.

And then a lot of teams they don't have engineers building everything. Like engineers are not the domain experts. Like for developer tools, you can say engineer is the domain experts.

But like a lot of cases like engineers aren't the domain experts. And so the people writing the prompts and things like that, they're not developers.

And so when you start throwing unnecessary jargon everywhere, it alienates those folks and it kind of doesn't bring clarity.

And then also like I find that in people use jargon unnecessarily it kind of, that's the one way to smell BS. Like you have to look into like what is it that you're actually doing?

Like don't just use this term like agent, like let's drill into, what do you mean by agent?

Phillip: Yeah, we're going to unleash a swarm of agents.

And it's like, well it's hilarious because like I took statistics in college and like I know when when you have like a failure rate and you multiply it, like it takes a very short amount of time before everything is wrong all the time, always.

And like... Yeah, it's very funny seeing that kind of, I'm like, you want to put something in there that can like multiply its failure rate.

And actually I think one of my favorite article to date on this topic is the Anthropic article about patterns in building agents and workflows.

And they very cleanly distinguish between like, okay, you might make several requests to an LLM in the course of like a feature doing a thing.

But if you know how many requests you're making, like that's not an agent, that's just a workflow and there's nothing wrong with that.

It just, instead of calling the LLM once you call it five times or however much, and that's like very neatly bounded.

And an agent is a thing where it's like, well you don't know how many times it's going to make however many calls it needs to make.

Jess: Because the LLM is going to decide what calls to make.

Phillip: Maybe. Yeah.

And as they work through this article, like the patterns for agents are much more like weird and complicated than they are for workflows because workflows are like substitute an LLM with any other like black box piece of tech, like a cloud service, like is very strongly analogous to any other software that you would build. It's just, now you have an LLM in the mix.

And if you frame it in that sense, I think a lot of people building this stuff can make more sense out of it.

But you throw an agent into the mix to like, yeah, there's this thing that's just going to run and do some stuff, and we need to determine what its exit criteria is, and like what success means and what failure means, and like all the variations thereof.

And I would argue most teams, especially if they don't have domain experts in the room, are just not set up to actually define that.

Austin: So actually this does leave me I think to an interesting question which is, the one piece of critique that I think is really applicable is, because AI doesn't have judgment, it can't say, I don't know.

Jess: I mean you can ask it for a confidence level and it can give it to you.

Austin: Yes. But the way that a lot of these systems are designed is-

Jess: You have to ask it or it'll just make something up usually.

Ken: It does make something up very well.

Hamel: Yeah. Yeah. I mean it can tell you it doesn't know, but it's less likely to say that I think.

Austin: Yeah, like, I mean the thing is it's a, for a variety of reasons that I don't fully understand because I am not an ML scientist like, the system is designed to give you output and because of the way it works, because there's not, even in quote unquote "reasoning models."

And I do think that there's probably arguably things that chain of thought will get us towards a like, I don't know state for LLMs.

But I do think that there's a... the only way to really counteract that in production systems especially, is to go back to the start of this observability.

Like the machine won't tell you it doesn't know, but you can observe independently that it doesn't know.

And I think one of the things that is a really underexplored area of LLM observability is being able to communicate confidence, being able to communicate these sort of observability measures back to the end user.

Hamel: Yeah, I think one way that I like that people are doing this is like citations. 'Cause you have to check stuff and so... Yeah, when you have citations it's really helpful.

Austin: Yeah, I like the citation stuff. I think like chain of thought I think has a lot to, that can go there.

There was a really interesting paper that I think Phillip maybe shared with me recently about embedding confidence in the token outputs.

Phillip: Yeah. That's also like kind of an explore-- Like I'd even thought about that like several years ago of like, okay, well yeah, as Hamel was saying, very, very early on, these things are inherently stochastic. So like that implies some kind of confidence score.

Austin: Hmm.

Phillip: Whether the system is aware of its confidence of its own answer is a wholly separate question, but like there is some kind of confidence involved.

Hamel: Okay. So there is a confidence score. There's things like perplexity, which is like when a large language model is conducting inference, it's giving you one token at a time.

And when it's doing that, there's a probability associated with each token that it's given you. And it's basically sampling those tokens according to the probabilities.

And so if a large language model is giving you something with like a very low joint probability, that's something that, it's an output that would've surprised the model.

Like surprised even like another LLM maybe or even like that it's just... And so it is an interesting bit, like there's a like an insult 'cause they're like, hey, like everything you say is low perplexity, meaning like you're not saying anything interesting.

Whereas like your high perplexity, maybe you're saying something more interesting 'cause like I can't predict these tokens are coming out of your mouth.

And so it's unclear whether or not like that correlates with, I mean it may correlate with trust, but a lot of times I want a high perplexity answer. Like within the nature of what I'm looking for.

Austin: Right. I want the temperature.

Jess: It's almost like obviousness, how obvious is this?

Hamel: Yeah. If it's like a set of words that are always in the training data, it's going to be low perplexity. If it's something that's like, wow, I didn't expect this, these tokens that's high perplexity.

So it's like, it's interesting. It's like, okay, like on one hand, yes, maybe you can say there's some correlation between high perplexity and some confidence, but on the other hand, I don't know, if you're really exploring something unique, maybe it is high perplexity.

Austin: And then at some point you get into the microwave, the r/microwaves problem, where-

Jess: The microwave problem?

Austin: Turns out there's a subReddit where people just post roleplay as microwaves. And so all of the effects is just the letter M over and over and over.

And so when you start getting too many Ms in a row, then the next most likely token is another M and just-

Jess: And and the rest of the output is microwave.

Austin: The rest are like, just (humming).

Jess: People are so funny.

Austin: Which I should say has terrifying implications for anyone listening to this that is trying to do anything about training foundation models on observability data.

Jess: So it's time to wrap up then. Hamel if there's one piece of advice you want to leave people with.

Hamel: Yeah. Look at your data.

Jess: Yeah, I knew you were going to say that. Fantastic. This has been great. Oh and Hamel, how can people find you on the internet?

Hamel: You can find me on Twitter, Hamel Husain. I also have a website, Hamel.dev.

Jess: Hamel is H-A-M-E-L.

Hamel: Yeah. H-A-M-E-L.dev.

Ken: Awesome.

Jess: If you're looking for Phillip on the internet, try the Honeycomb blog, Honeycomb.io/blog.

Ken: Yeah. Hamel and Phillip, thank you so much for both being on this show. This is a great discussion.

Hamel: Yeah.

Ken: I really appreciate you coming on.

Content from the Library

Visit library

Feb 11, 2025

Podcast

O11ycast Ep. #78, Exploring OTTL with Tyler Helmuth and Evan Bradley

Episode 78 of o11ycast examines the world of OpenTelemetry Transformation Language (OTTL) with Tyler Helmuth and Evan Bradley,...

Jan 15, 2025

Podcast

O11ycast Ep. #77, Observability 2.0 and Beyond with Jeremy Morrell

In episode 77 of o11ycast, Charity, Martin, Ken, and Jess welcome Jeremy Morrell to talk about OpenTelemetry, the future of...

Dec 17, 2024

Podcast

O11ycast Ep. #76, Managing 200-Armed Agents with Andrew Keller

In episode 76 of o11ycast, Jessica Kerr and Martin Thwaites speak with Andrew Keller, Principal Engineer at ObservIQ, about the...