JUN 20, 2024

40 MIN

Ep. #71, Evaluating LLM-based Apps with Shir Chorev of Deepchecks

GuestsShir Chorev

light mode

about the episode

In Episode 71 of o11ycast, Jessica Kerr and Austin Parker sit down with Shir Chorev to delve into the nuances of incorporating Generative AI into applications and testing the outcomes. Shir shares her journey from a data scientist to founding Deepchecks, driven by her vision to enhance the quality and reliability of machine learning and AI in real-world scenarios. Gain valuable insights on how LLMs are reshaping the development landscape, making advanced projects more accessible, and strategies for mitigating adversarial use cases. Whether you're in DevOps, infrastructure, or just passionate about AI, this episode is packed with expert advice and cutting-edge information.

about the guests

Shir Chorev is the Co-Founder and CEO and Deepchecks and is a former member of the Israel Defense Force's "Talpiot" program for technological leadership, as well as a member of the famous intelligence agency, Unit 8200. Shir is a passionate and experienced Data Scientist, skilled in machine learning and algorithmic research.

show notes

about the episode

about the guests

show notes

transcript

Shir Chorev: Probably one of the most prominent differences when you incorporate gen AI is suddenly everything is, well, in this case, it's also much less deterministic because you do have things like temperature, but even when your temperature is zero, it's still like you're output. Even if you give a temperature zero and theoretically want it to be like constant or as most constant as possible, even in that case, your outputs will vary quite greatly.

And then, of course, comes the question of, okay, if answer A was correct, and now I got a bit of a difference of variation, like answer B with a bit of a different wording. Is B correct or not? So even the basic thing of understanding " is this a good answer or has the quality of my app changed" is something that you usually really care about, but it's much harder now to balance it. So I would say that is one thing that changes.

And also infra-wise, of course, there's starting from-- Usually, it means you would also use a third party API or maybe even a, let's say, self-hosted open source model, but suddenly, you have another important component that is a big part of your pipeline. So obviously, that does require quite a bit of adjustments from your infra-side.

Jessica Kerr: Yeah, determinism is why I got into software in the first place because I love that it was deterministic and now that's out the window.

Shir: Yeah.

Austin Parker: So for people that don't know when you said temperature, can you talk through... just because not everyone in our audience is going to be like completely familiar with gen AI concepts, so maybe define temperature.

Shir: For sure. So when you, let's say, just for example use a OpenAI API and use a certain model such as GPT-4, GPT-3.5 turbo model. And then you have of parameters you can choose, specifically the temperature parameter is the number between zero and one, and it defines kind of how creative you want the outputs to be or how like non-creative, or decisive or specific.

So usually, if you would want more like a generative kind of, let's say, I'm writing a story, I would probably go for a high temperature. And if I wanted to answer a very specific question out of a very specific context I gave it, I would probably go for temperature zero, because I really want it to be kind of as a accurate and not take the liberty of expanding additional data.

Of course, that doesn't promise that it won't, but it is one of the parameters that are relevant for when you do experimentations and try to check like which version you would want to deploy or how good is the app you're building.

Austin: That process of tuning that and going through and trying to figure out, because it's not just temperature, right? There's a lot of parameters that you can pass to a model as part of a prompt, as part of a question, as part of a generation. And tuning those parameters and actually understanding not just in a single interaction but in aggregate across many, many, many prompts like what is the effect of these parameters is super, super tricky to do.

Shir: For sure.

There are lots of moving parts in general when building software. And in gen AI, there's also the big moving part of the model itself.

I think maybe it's worth here kind of noting a difference between when we talk about LLM evaluation, there's two types of evaluation. One aspect is the different parameters of the model itself that you're using whether it's again the things like temperature, for example, or which specific model, like if you're using OpenAI, right, there's lots of different versions of models, or which LLM model provider.

So there's lots of different aspects that you can use within like which LLM you're using. And, of course, there's also all of the aspects of your pipeline. Like let's say you're building a pipeline, which like a RAG use case. Maybe I'll say like RAG is a retrieval augmented generation. It basically means you're incorporating some knowledge base like an internal company knowledge base within the pipeline and then you're giving the LLM the option to use that knowledge as context to answer the questions.

So let's say, that's your pipeline. Of course, you have lots of different aspects there as well, right? So it's not only the the model but also the knowledge base. It's also the prompt. It's also all of the aspects in your pipeline.

Jessica: So in pipeline here, you're using it to mean like the path from having a question to getting the LLM to answer it?

Shir: Yes, exactly. It's everything from, let's say, the input of that maybe user or something within your internal flow to the final output of the LLM. There also may be some post-processing. So it's basically the kind of end-to-end aspect of it.

Jessica: Right. So you mentioned there's two kinds of evaluation. There's evaluating the model itself, which there's like benchmarks and stuff for that, right? And then there's evaluating the performance of your pipeline, which is like each request. So in observability, we sometimes talk about requests and we make a trace for each request, and we usually mean like requests from somewhere outside to our backend service, which calls in through other services.

So when we incorporate generative AI, there's components in the service of that request that form a pipeline including data retrieval and then asking the model. And sometimes do you go like to multiple models or back to the model? Is that part of the pipeline?

Shir: Right, so the pipeline is really use case dependent. I mean you may, for example, have a pipeline of, let's say, I'm building a customer support chatbot. So maybe the first step will be intent detection, which may either be with some LLM component or may be with any other extraction or more like a classic NLP. So that will be one phase, like what does the customer need?

Second phase is, okay, let's say, I identify they're now interested to a change of booking or to get some information or to talk with a person or so forth. It's likely that in each of these use cases now will have additional, whether it's LLM calls, whether it's specific data that would be queried and so forth.

So yeah, when I talk about end-to-end, it's like all of the steps and these steps may incorporate different types of models, different types of prompts and so forth.

Jess: Just when we think that maybe the LLMs are super complicated but magical, then we add all these complications on top of it, which makes observability really important. Oh, But, Shir, tell us who you are and how you got into this business.

Shir: Sure. So hi. And yeah, happy to be here and thanks for having me. So I'm Shir Chorev, the CTO and co-founder of Deepchecks.

So at Deepchecks we do continuous validation of AI based applications. So what that means is helping to evaluate, test, and understand the quality and behavior of your application from the research and development phase through to when it's deployed to production.

So that includes both the decision that it's good enough to deploy like during CICD and also production monitoring when it's already deployed. And the way I got to founding Deepchecks is from my experience as a data science and leading data science research that I saw together with my co-founder, Philip, both the huge potential and value that machine learning and AI bring to real life use cases suddenly have the ability to basically automatically adapt the algorithm or the software's behavior to reality.

So on one hand, that's great power and we all see to today that it meets us everywhere and really making applications so much more relevant to our use cases on one hand. But then on the other hand, there's also this big challenge in how do I make sure that it works properly and how do I make sure that it works properly over time and how do I know which version is better or how do I know that if I fix a specific problem, I didn't add new problems and so forth.

And that was when we were kind of working on our own data science projects and searching for tools that would help us solve those challenges. We felt that there's something big here and it's both interesting challenge and also one that we love to solve to help basically AI based applications work better and be able to be adopted in a wide variety of use cases and faster.

Austin: Cool. Yeah, I think it's so interesting right now where I've been working in technology for most of my life in one way or another. And I think what's super interesting about, you know, LLMs and AI work right now in general is that over the past year and a half maybe, we've seen just an absolute, not a linear development necessarily, but sort of exponential growth and interest in the field and also in terms of new approaches, right?

Like we've gone from the size of models, the complexity of models, you know, how we can train them, and how people train them, right? Like there's a friend of mine on Bluesky who wrote like a really interesting classifier for screenshots and she wrote it so that you could basically filter out is like, hey, is this a screenshot from Twitter or is it from Facebook or whatever? And was able to do all this kind of just using the cloud, using a SaaS on her own time in a couple weekends and something like that. I don't think we would've been able to do that, you know, two, three years ago.

Jessica: Yeah. Does your friend have like an AI data science background?

Austin: No, not really. Like she works in media and she's, you know, a developer there, but it's really cool how much these tools have been able to, you know, progress rapidly. But conversely, it's so much new stuff, right? Like what would your advice be to people that are seeing kind of all of the hype as it were and asking themselves like, what's the first thing I should do to try to understand this or should try to incorporate this into what I'm doing? You know, maybe in a a DevOps role or in some sort of infrastructure side.

Shir: I think you touched upon a great point, which we also see a lot, which is that--

The ability to take from an idea to a working prototype and even a working initial app has drastically improved. So it's just amazing that in a few hours, you can bring something that works. And so I think taking maybe one piece of advice from that is just try things. I mean, today, you don't even have to search for a tutorial, right? You can ask one of the LLMs like, how do I do X, Y, Z?

Jessica: That's true.

Shir: And then just like copy paste. So you're like the agent, right? You're talking with them and then you're like, oh, but this doesn't work. Okay. And so far, so you have all this. So it's really, really easy to experiment.

And I think the second phase, which I can say that for us, we're excited about it and it's also an opportunity, because the barrier to entry is really low. So you can get a sense of what's possible and whatnot or what do you think may be possible. And then also quite quickly, you do also start facing the challenges because, okay, it usually works, is this good enough or not? Or when it doesn't work, what does happen?

And in those areas, I would suggest to think a bit about it, like how important is it for it to work and how should it work, and all of those aspects. And the second part is then to also not neglect the idea of like, okay I have something working so cool, let's now go and take over the... Like let's change everything only to these types of experiments but really to approach it kind of step-by-step.

So understanding what is the potential, what is important, and like practically, I would say, also building like one thing that we really believe in and I think it it can help both for understanding the capabilities and also for evaluating them is really think about like building a good demo. Not like, okay I have this and these are three examples, but I'm now showing a wide variety of use cases. Like I have these 80 samples and in these 60, it works great, and this 20, it's really bad. Now, what can or what should we do with it. This is more for taking, let's say, a bit more to scale.

Jessica: That's so different from traditional software from deterministic software, because I want to write a test and then I want to run the test before every deployment and I want to run the test in production and I expect the results to be the same all the time. And here, one test can't cut it because even that one test is going to have unpredictable results.

And your point is that your test suite is not like a bunch of individual deterministic paths. It's a whole bunch of things that it can succeed on some and it's always going to fail on some and that can switch around. And it all comes back to the giant question of how do we know it works?

Shir: Yeah, so when you know the answer to that question, please do let me know. I can suggest some ways to have a look at it. And yeah, for sure, it's really different. I'll say that one thing that we saw initially like working with customers is that many times, there was like a software group running ahead and they already have this app and let's go.

And then like just a bit later, suddenly, they're like, okay, when does it work? When doesn't it? What are the metrics? How do we check it? Okay, maybe we should consult with other data scientists. It's okay that we need the whole software. And I think that the more mature organizations are now finding this way of how do we define what do we evaluate, how do we evaluate, how do we check consistency over time?

And my 2 cents about how do we know it's good or not and what do we check is I think in general in Deepchecks, like we split it into two. So we split it into is it good, and good means, is it correct, is it full? Is it, let's say, in the format or tone or sentiment? So does it adhere to all of the things that you wanted? And when I say all the things, it is really use case dependent.

So you do have to think what are the relevant aspects, how do I weight them, how do I check them? And then kind of make sure that you have the ticks on other things that you need. So this is one aspect and it is some experimentation of building these metrics, we call them properties, that check all of those areas.

And the other part is we call it like is it not problematic or is it not bad? And these are aspects such as like safety or what happens in potentially adversarial use cases. What am I afraid of, right? Like am I afraid of someone trying to, I don't know. Let's say I wouldn't want when someone asked my customer support chatbot, how do I do X, Y, Z with my product?

I wouldn't want it to offer the competitors features, right? So this is just like an example that for many that's not something like relevant or troubling and it sounds okay. I mean it's giving me a honest answer and it's good and everything, but, well, not for my company or not from under my policy. So it's all a really like use case dependent but it's possible to do it.

Jessica: The is it not bad part, yeah, this is where you ask, what's the worst that can happen? It reminds me of security because developers are used to making software do what you want it to do. And so we tend to test all the happy paths and the ones we expect and the ones that should get it to do what we want it to do.

And we're not as used to checking what else could it possibly do under adversarial or just unexpected circumstances. And it sounds like with generative AI in the loop, we always need to think more about that, the what else could people do with it?

Austin: One interesting thing that you said is this idea that, it is a two part equation, right? It's is it correct and then is it right, for lack of a better word. And I find when I talk to a lot of people, especially people that have been in kind of tech for a while, at the very similar journey, you have to go through when you start thinking about like modern distributed systems or cloud native systems, right?

Because in a distributed system, it's not just always working, right? Like what is right, that's not a binary up or down. You know, the server is on or off. There's so many different quasi right states that the system can be in. And what actually winds up mattering isn't "hey, is this thing on or off?" But what matters is "Are requests being completed in under however many milliseconds are people happy? Am I violating some contract I have with my users?"

And so it's actually a very similar problem space in terms of how do you conceptualize like right and wrong or up or down or good or bad or whatever between gen AI and sort of distributed systems in general I think.

Shir: But I think one prominent difference or one prominent addition if I kind of compare to, let's say, observability in different areas or also evaluation, is that here we have both the aspect of the content. So for example, if the output is toxic, that is like a maybe safety related problem with the actual output.

And there's also the aspect of whether the latency is too high or whether if it's accessing a DB that it's not supposed to, let's say, give an external user the information from. And this is more like a classic kind of a problem.

So what I find in gen AI applications is that it's a bit of a fuzzier area between what is the desired behavior both in both content and technicalities.

And you also see it in the space like there are various AI security startups, which are like, let's say, doing things which are a bit more like classic cyber, like checking that data leakage. So like there wasn't any internal data that was in the output. Or checking again, now who are the users that have access.

So this is like more the classic the way I look at it. And then there is also the area of how is the content formatted itself.

Austin: Interesting.

Jessica: Yeah. So what do people do in practice with all this fuzziness? How do people set up continuous integration to be like, "yeah, I can deploy this change, I adjusted my prompt, and it's fine or it's better or it's worse."

Shir: I would say that a bit with classic machine learning in the earlier days and I don't recommend it but some organizations do just hope for good, which means like do something initial, do bits of manual testing, which would usually mean limits to a very like a small subset, be cause many times labeling like a specific sample can take, you know, a few minutes.

So if you want to now check a hundreds and check them across versions and so forth, that would sometimes be quite a barrier for being able to quickly iterate and to improve your application. So we do see the more experimental phase of building something, checking it out, and then finding the problems.

Jessica: So this is the "works on my box" strategy.

Shir: Right? Where I think also here the "works on my box" is like the technical aspect, right? And here will also give out the context, but yeah.

Jessica: But like for everything I typed in, it worked fine.

Shir: Exactly. Exactly. So a lot of light questions were great. And then I think as adoption of gen AI matures and as organizations realize the different or more maybe elaborate challenges that they face, there are quite a bit of strategies to basically I'd say, the target is to make it automatic or semi-automatic to be able to really evaluate, find problems in scale.

So whether it's defining, you know, the specific metrics and then how do we judge these metrics and how do we pinpoint and investigate different problems. So there are quite, let's say, maybe initial approaches such as GPT as a judge, for example. So let's just ask an LLM if it's right or not, right? So this is already better than not doing anything.

Jessica: It sounds expensive.

Shir: Exactly. It does have chances of you are now essentially multiplying like the amount of... You know, for every call you have, you have another call with all of the data. So that is one challenge of that. And also, it's not necessarily that accurate, but I would say for sure it's a good way to start.

And then there are quite a lot more elaborate approaches, which is one of the challenges that we deal with daily and we have some cool ideas of how to really find these problems without using as much resources or really after pinpoint and customizing like what you care for and then checking for those things and just like raising a flag when there's a problem.

Jessica: Like raising a flag of hey, the rate of correct looking answers has dropped and then people go in and look at them.

Shir: Yeah, so maybe I'll give a concrete example just to kind-- So I mentioned before the RAG use case. So this is where like we have some information retrieved. So, let's say, I'll continue with example of I'm a chatbot now helping some user to understand how to do something with my product.

Okay, so they're asking like, I don't know, which API should we call in order to get the daily query limits? Okay, what would happen in the LLM pipeline or as we define it before, is that probably some of the information from the doc, from the company's docs or internal documents will be given to the LLM.

The LLM will process it, link with a specific prompt, and then provide some outputs, and they would probably output a specific function. So for example, like we have like a property that we call grounded in context, which checks is the output actually based on those documents that were given to the LLM in the pipeline.

So for example, that really easily helps detect hallucinations of, let's say, now the LLM just made up a new function name, right? Oh, the API is-

Jessica: Oh, so you're detecting hallucinations, that's cool.

Shir: Right. And in this example, it would probably return back to improving the prompt, changing the information field, which is the bringing the documents themselves. And so that would be one type of like flag of you have low grounded in context values, check out these samples and have a look at what is the problem.

Jessica: To evaluate whether an answer is grounded in context, do you use an LLM for that?

Jessica: So you can use an LLM, it is a way to approach it, specifically in Deepchecks, both for considerations of scale and costs and also for performance, we do use a GPU-based model, but anything that's less than a 7 billion parameters, so we don't call an LLM but rather, let's say, more a brick based or transformer like smaller model. So we do it without an LLM.

Jessica: Nice. A smaller specialized model.

Shir: Exactly.

Jessica: So it's still an AI, you're not just gripping through. Okay, were these words in the answer also in the input.

Shir: Right, there are some things that you can do. So like maybe I'll give an example for a simple property that is relevant. So you can imagine quite a few examples where you wouldn't want the output to be I am a ChatGPT, trained by OpenAI, and therefore, you know, I can't answer et cetera, et cetera.

So in this case, you can have a property just, you know, checking if you have chat GPT or OpenAI in the output, and that would be a super simple property. And you, of course, don't need a model for that, which would also find some potential problems.

So not all of the properties or all the metrics you want to check for are necessarily GPU based, specifically around in the context is, yeah, it does require some specialty and it is one of the things like we work on quite a lot to have a good performance for.

Jessica: So in evaluating whether our LLM output is good enough, whether it's working, we're way beyond like checking the equality of things in input and output in our unit tests. And we have like all these different aspects to check. At a conference a few weeks ago, Josh Way who is a data science VP was talking about how in their LLM based application they created an entire evaluation system and like all the concepts that were required to evaluate whether their LLM was performing.

And yeah, when I just developed an app and I'm like, look at this cool thing that I made work, I don't want to do that.

Shir: Right. For sure it is. I would say that, well, when we talked before a bit like how do you start and what do you look at? So initially, probably look at a few samples and like try to find a few problems or try to see if it works well and do it manually usually, but that really doesn't scale over time.

So the next step is either building a full-grown, like homegrown but full-grown system for evaluating different types of use cases, right? Because there's also problems if you have a summarization task. So the problems aren't exactly similar like if you have a Q and A task or if you have a generation task of like, okay, create a story or something like that.

So that is an option and usually that would for sure be very easy to build a whole talk about if you do that in-house or you adopt an existing tool for LLM evaluation. I think that's also on the content aspect and also on the like, let's say, resources aspect.

And there's also the part that Jess mentioned before about like looking at the traces and at the different parts and, you know, then really making sure that they function properly. So even does it give a result? How much does it cost and all of those aspects. So that is something also to take into account.

Austin: So on this train of thought, one thing I've noticed when talking to people, especially people that have been in the industry for a while that are very skeptical quite often about AI and gen AI, one thing that I always kind of come back to is there's a perception I think that a lot of the sort of bursts in creativity around LLMs, gen AI, things like that, is a factor of the popularity of the chat model, right?

And we've actually talked, you know, if you listen back to this conversation, we are going back to chat a lot, right? Like you have a chatbot or you're doing RAG against your knowledge base or whatever.

Jessica: It's the obvious use case.

Austin: Right. But one of the things that I... When you sit down and think about it, it's like, okay, that's actually the chat model is such a departure from traditionally how we've used ML models in the industry in the past, right?

What I'm really interested in is as we pull large language models, as we pull transformers, and all this stuff into application modalities that aren't chat-based, for lack of a better word, right? That aren't just assistance. What do you see as like the difference, the big thing someone should look out for, for example, if I have something that's doing sort of like a supervisory role around data quality, right?

Maybe let's put it in an observability context. I have a bunch of telemetry data coming out of a system of a Kubernetes cluster and I want to have a LLM evaluate that and score it and then give me kind of a report of like, hey, here's stuff that's new or here's stuff that doesn't look right or here's stuff that doesn't meet what I have prompted it with. For example, tell me if you see things that don't appear in this set, right?

Does that change sort of the evaluation criteria? Does that change how you want to think about doing these kind of area evaluations? Or is it very similar, you're just kind of doing it in a different modality?

Shir: Okay, so you touched a few great topics. I think I'll address each of them separately. And then talking about chat and like kind of going back to this use case, I think it's not that all use cases are like chatbots but rather that one of the big changes that LLMs brought is making things much more accessible for everyone.

So for me here in Deepchecks, it's much more accessible to explain what we do in Deepchecks, but for people that before didn't know, okay, so what is machine learning? But also in practical aspects, many times it's like we're talking about a chatbot, but under the hood, it is actually summarizing financial data and giving you bullet points.

Okay, so maybe I interact with it in a chat manner. And that is something that I think is really amazing that now basically everyone can get value from whether it's internal organization data, whether it's how to do different stuff. And it's in a chat because that's our programming language as all human beings are not only programmers.

So I think the chat is a bit more like the external part of it. When we go into the actual like internals which may or may not be LLM based, I will split it into two. I think that one thing we will continue seeing non-LLM based-- So let's say, machine learning. Sorry for using such a old word, but like for example, let's say I want to analyze logs.

So probably, if the logs are very, very diverse like different and sparse. So in that case, I would assume that an LLM can give added value just because it has some like context of the world, right? So it's more likely that it will understand something that it hasn't seen before.

However, if I want to, let's say, classify logs to different types, I would believe that both resource-wise also predictability-wise. And for many reasons I wouldn't, run quickly to use LLM. I would really stay with more classical approaches, because many times they're deterministic and they just work and really no reason to try using it everywhere. So I think that is a bit like the balance.

Jessica: By classical approaches, do you mean like a smaller model?

Shir: So it can be a smaller model, it can be a RegEx based, you know, extraction, it can be machine learning model, like a tree-based model or gradient boosting. I mean we can go to different use cases but maybe I'll just give you an example of a pricing model, not necessarily that I want to recommend a price specific product, you know, with a big model that understands human language, it's much more relevant for it to understand the exact parameters.

Who is the person? What are they looking for? What have they bought in the past? And, you know, you have these like maybe 50 parameters and it will take them and plug them in a mathematical formula.

Jessica: Might be a rules engine.

Shir: It can be rules and it can be a machine learning model, just it doesn't necessarily need the whole context of, you know, being trained or pre-trained of all of the world's data. Maybe it can help but in many cases, not necessarily.

Austin: Interesting. One thing I would actually ask with that then to follow up is to your point, the advantage of the LLM is that it has this much larger training set and is capable of kind of doing these surprising inferences or displaying understanding of more open-ended questions, right?

Like I really feel like as models get smaller, you know, to the point you can run them locally much more easily. You know, isn't there sort of an argument that someone that isn't like a classically trained data scientist or ML person would be able to take one of those models and do similar kind of applications by kind of leveraging the LLMs ability to interpret what they're asking for, like without having to kind of know like, oh, go do this regular expression or go do Bayesian statistics or whatever.

Shir: So I think it really depends what is your use case. If now, I'm a person wanting to easily build something that would classify my pictures to pictures with family or pictures, you know, of my cute dog, which I don't yet have, or anything like that.

Jessica: You could have an add the dog to your picture.

Shir: Yeah, yeah. So yes, in that case maybe, you know, it's probably easier for me to use overkill technology potentially in order to just, you know, get something working. But if I'm a company that has lots of data and scale and I really care about being as efficient as I can, whether it's, you know, cost-wise. I mean LLMs aren't expensive, but they're also not cheap.

Especially when we get to scale, we're about deterministic and like being able to predict what happens then in those cases. And as a data scientist in my history and well, also, today, of course, I don't want data science, you know, as a profession to disappear. So I have to defend that. No, just kidding. I think it will. It's there to stay for some use cases.

Austin: You actually hit something really cool there, which is this idea of like using LLMs to do rapid prototyping of other machine learning tasks.

Jessica: Right. Start with the giant one, see if it's useful, and then optimize to the most appropriate model size. Or even algorithms like a bunch of RegEx.

Shir: Exactly. And then you can also use the LLM to help you, you know, write the more specific solution.

Austin: Yeah. But I think that's a really great like takeaway for people, right? Like that it feels like AI, capital A, capital I is everywhere. But really, you know, an LLM is in a lot of ways, you know, machine learning principles and concepts that, you know, we've been working on that the industry has been working on for decades.

And it's more advanced in many ways, yes, but just because you have the big powerful thing over here, it doesn't mean you can only use that. That can be your starting point to prove out the concept or whatever and then make a smaller version, make a more focused version, you know, do the right thing once you've used the LLM to kind of say like, oh, okay, yeah we can, like this is actually possible.

Jessica: I could talk about this all day, but I want to be sure to get back to something you said, Shir, about as people are taking their prototypes to production and maybe moving from the more expensive model to a cheaper one and they need to evaluate the application as a whole with that change. You said that one thing they could do is use an existing tool for LLM evaluation for "is it working?" Tell us about some of those existing tools.

Shir: Generally speaking, I think I am a very objective, so when people consult with me here, probably it's a bit challenging not to elaborate mainly about about Deepchecks though there is quite a variety of approaches.

I can say that in Deepchecks, so like we do, as I said, the work like building a product for continuous validation of AI, and the big part of it is LLM evaluation. We mainly work on, say, two things. One is building all of these metrics. Some of them we mentioned, whether it's brought in context, whether it's understanding like the relevance, the correctness. Is the summary good like as one example. So all of these or is it safe?

So all of these metrics to enable then a company using us to really customize and understand which metrics are proper and which not, and to enable to improve and to compare between versions and so forth. So like a big part is all these metrics. And another big part is how do you actually use it?

So like, okay, I have an application or I'm building one, I have this prototype or I have this idea that is in progress and now, how do I go about? So like in bigger organizations, one of the challenges is that there is an area, like let's say there's product and there are analysts and there is data curation and there's a software engineers and there are the data scientists.

And each of them has their own kind of a special knowledge or domain knowledge of, okay, they build the model, they help in building evaluation metric, they understand what is a good answer and whatnot. And really putting all of these together, it's a big challenge too.

How do I... Let's say, now I have some type of problem I need to basically send it over to labeling to see if I fixed it and so forth. So this is the other aspect of a LLM evaluation, which we deal with.

Jessica: As a concrete example of that and Honeycomb in the DevRel team, we built an app that works with an LLM and we call out the Deepchecks API to get Deepchecks to evaluate the results of the production behavior. And then we can go to Deepchecks and we can look at those results.

But also, we wrote a webhook that Deepchecks calls and we get those metrics, those property evaluation results into our observability system. So right in our traces in Honeycomb, we can see and we can aggregate over the evaluation results about two minutes after it completes, which is pretty fast in my opinion. So Deepchecks is a platform that you can use to do the evaluation and it does integrate.

And I mean like any powerful LLM, it's a great place to start. You can always change your mind and build it in-house later. Please, please, please, don't build it in-house pre-preemptively. That's like a statement for all of DevOps.

Shir: Right? I would say that the typical flow is that people will try building it in-house and then at some point like, okay, okay, maybe that's not our main focus and let's use a tool.

Austin: This is too much

Jessica: Yeah, there's a lot of knowledge in like what properties to start evaluating and you have to iterate that too just as you iterate on your software.

Shir: Exactly.

Jessica: Is there anything that you want to leave our listeners with?

Shir: One of the main things I think I did mention is that--

In general, I believe like if you have an idea, you want to try something, just go for it and do it, experiment, take it in small chunks, do it in whichever way works for you. I really think one of the most amazing things of living in our time is that everything is accessible.

So I would just encourage everyone to not sit and think about how and should I and just try. And also personally, I'm always happy for any, you know, people, questions, consultations, so they can find me on LinkedIn and feel free to reach out.

Jessica: On LinkedIn. Okay, we'll put that in the show notes. Also, deepchecks.com

Austin: Definitely get out there and try it.

Jessica: And thank you. Thank you for this insight.

Shir: Thank you very much.

Content from the Library

Visit library

Apr 17, 2025

Article

How to Properly Scope and Evolve Data Pipelines

For Data Pipelines, Planning Matters. So Does Evolution. A data pipeline is a set of processes that extracts, transforms, and...

Mar 18, 2025

Podcast

O11ycast Ep. #79, AI and Otel: Look at your Data with Hamel Husain

In episode 79 of o11ycast, Hamel Husain joins the o11ycast crew to discuss the challenges of monitoring AI systems, why...

Feb 11, 2025

Podcast

O11ycast Ep. #78, Exploring OTTL with Tyler Helmuth and Evan Bradley

Episode 78 of o11ycast examines the world of OpenTelemetry Transformation Language (OTTL) with Tyler Helmuth and Evan Bradley,...