Digging Deeper into Building on LLMs, AI Coding Assistants, and Observability
- Andrew Park
Are Prompting and ChatGPT Programming the Future?
If you follow the headlines, AI coding assistants such as GitHub Copilot and other large-language model (LLM)-based tools built on foundation models such as OpenAI’s GPT and open-source frameworks such as LangChain seem like the future of software development. But will prompting and generated code replace the day-to-day business of writing net-new code for software engineers? Observability startup Honeycomb’s article on what it’s like to build on top of LLMs suggests that a futuristic utopia of computers doing all the LLM software development for us may take a while to get here.
We sat down with Honeycomb’s product expert Phillip Carter to discuss what he observed while actually building on top of an LLM for a live software product.
What It’s Like to Actually Build Software with LLMs
Heavybit: There’s a lot of hype around modern AI tools such as GitHub Copilot, and how they’re potentially poised to change the way software developers work forever. But in your article, you covered a specific, real-world use case incorporating them into actual software development.
Phillip Carter: Yes–by way of summary, the Honeycomb product has many different features, but almost all of them hinge on the core action of querying with our query interface. It's a really powerful querying tool, and I think we’ve achieved product-market fit with forward-thinking SREs and Platform Engineers– i.e., people who are looking for powerful tooling. But when you build something for an audience who knows observability tools and knows how to squeeze as much as possible from them, you might do so at the expense of brand-new users who haven’t used such tooling before, and may struggle with working in a querying interface.
Since using queries well is an important part of our product, we wanted to figure out how to get more people querying, and doing it effectively. We set out to build our Query Assistant feature, which uses natural language, to build up more product usage, particularly for newer users. And now that we’re on the other side of this project, we are seeing increasing activation and engagement from users who use our Query Assistant feature, but as I covered in my article, getting there wasn’t easy.
HB: What was helpful about your experience was how you identified the realistic shortcomings of modern AI models for developers. One thing that seems to make following the progress of AI for developers so challenging is that seemingly every few days, someone reports some kind of game-changing breakthrough that mitigates one or more of AI’s most notable drawbacks (such as the need for massive compute resources, the need for human feedback to reinforce learning, limitations on context window size vs. the need to engineer custom prompts, data privacy, or others). Does it seem we’re just a few algorithm tweaks away from a new world, or does it seem like we’ll be struggling with some of these challenges for some time?
PC: Actually, I do think we're going to be seeing those limitations for some time. To clarify, where I do think we're going to see quite a bit of progress is one of the issues that I brought up in the article regarding chaining.
I think a lot of these chaining tasks will get much more accurate over time. However, I don't think their latency will get significantly better unless we see big jumps in model latency. And so, that will naturally constrain some of the AI use cases that require speed, which brings you back to a lot of really hard tradeoffs with prompt engineering (which is where we were). One way to consider the problem, in programming terms, is compile time versus runtime. A compile time concern, when you're programming, may take a little longer, but can benefit you a whole lot–so that's fine. But you don't want a runtime issue to take a long time because your app has to run fast.
Chaining LLM calls was unacceptable for our use case because the latency was just so enormous–the whole process becomes so slow that you might as well not even use it. That might improve. (Or not). For example, we saw that over a relatively short time, GPT-3.5 went from being somewhat slow to speeding up a whole lot. About the same amount of time has elapsed for GPT-4, but we haven’t gotten an equivalent speed boost. There are a lot of resource constraints that are preventing that. And who can even say what GPT-5 will be like?
So I think there are going to be certain classes of features and products that can bypass some of the problems just by being able to have a long-running agent or agent-like thing be accurate enough in terms of constraining inputs and outputs consecutively. And then you arrive at some high enough probability that whatever output comes out is going to be meaningful for the task at hand.
I have not seen anything that will indicate that the same will be true for our runtime constraints. When somebody interacts with most modern software products, within a couple of seconds, they expect a result. We’ve continued to see such issues even with leading foundation models. For example, Anthropic’s Claude model–which is amazing, by the way, great tech–released 100K context windows, which is incredible. There seem to be a million different techniques to manage prompting and instructions and context to get models to do the task you want. In the future, there may be an AI methodology that can help you auto-select one of those things, but it's not going to get rid of the problem, which is that these models still struggle with hallucinating values, or mistakenly flag facts as incorrect, or misinterpret things.
Here’s a specific example that is still reproducible: We have an internal query that checks token usage and cost. We have a way to count token usage and cost in a specific setup within our own schema. So the natural language query for that is "OpenAI cost and token usage." If I type that into the 100K model and pass in the whole schema every single time, it just fails. However, if I use a different NLP model to select a more-relevant subset of a schema and only pass that in, it always gets it correct. What I’m taking away here is there are inherent limitations in this tech. That doesn't mean that you're totally out of luck if you're trying to do certain tasks, but it's not going to be a magic box just because the context window is now a million tokens.
HB: OK–so if there were one limitation of AI or LLMs for software development you could immediately “fix” about the way they work for observability, what would it be?
PC: From my perspective, it's a tough choice between latency and reasoning capabilities. We have achieved good accuracy by constraining our use case, which also allowed for good latency. However, when I tried to expand its capabilities by suggesting hypothetical queries for inputs that don't make sense or providing instructions on how to make a query work, it would sometimes veer off in strange directions. This was also the case with GPT-4, but it was even slower, and perhaps my prompt engineering wasn't as effective (which highlights the importance of prompt engineering as a real challenge that cannot simply be fixed with a magic wand).
We currently have tasks in our product that GPT-3.5 is unable to handle effectively, but for which GPT-4 might be more suitable. These tasks involve reasoning and understanding specific data streams, particularly concerning our service-level objectives (SLO) feature, which is significant in the monitoring space. The SLO feature focuses on defining the expected behavior of a service and establishing thresholds for notifications based on different stages. The rate at which these thresholds are exceeded is referred to as the burn rate. Determining the appropriate SLO is not a straightforward answer but lies on a “spectrum of correctness.” However, we have observed that GPT-4 struggles to generate reasonable SLO recommendations, despite having experts at Honeycomb who can quickly assess their validity. While prompt engineering could be a factor, the overall reasoning capabilities of the model are still lacking. Ultimately, what I would like to see is improved reasoning capabilities in LLMs.
Can LLMs Write Code? Is There an AI That Can Code Yet?
HB: How do you (and your customers) feel about the brave new world of AI coding assistants such as GitHub Copilot, Amazon CodeWhisperer, HuggingFace StarCoder, and others? Will the day-to-day job of developers shift from primarily writing net-new code to prompting AIs for code, then editing and proofing the code that AI assistants spit out?
PC: Even though my title is “product manager,” I'm also a software engineer and actively write code daily. I've been using GitHub Copilot since the early private preview, and it has been fascinating to witness its significant improvement over time. From a software development perspective, there are several other tools (outside of ChatGPT) tailored specifically for developers that are currently available. One or two of them will likely become popular choices among developers.
However, for most people, Copilot is here to stay. When it comes to code editing, it performs quite well, especially for the vast amount of uninteresting and mundane code that developers often write. Copilot excels at handling coding tasks such as unit testing, which many teams struggle to find time for since it often involves repetitive and monotonous work that doesn't directly solve business problems. Copilot seems to be a time-saving tool in these cases, allowing developers to focus on the few lines of code that differ rather than the entire block. I don’t see other AI tools replacing the human element that Copilot focuses on
I believe that Copilot will become an essential tool in every developer's toolkit. However, there is a growing need for improved developer tooling and workflows, particularly in the areas of code review and understanding. The current standard of line-by-line code review doesn't make sense when so much of the code is generated by machines. It becomes inefficient to expect someone else or even oneself to deeply understand every line of code. This approach hinders productivity and impacts the business. While speeding up one aspect of the development lifecycle, it slows down another. So it's unclear if this trade-off is ultimately beneficial.
We require better tools for comprehending and understanding code. Ultimately, understanding the purpose and functionality of the code is crucial. The code serves a specific purpose, and it's up to the developer to bridge the gap between the text they see and the actual business problems it solves. Unfortunately, we lack effective tools for achieving this unless every step of the code's purpose is exhaustively documented.
The other angle is on the runtime side, actually being able to diagnose a failure once the code is live. AI will write code that fails all the time, just like humans do. How do you know what it’s actually doing? I think a lot of software developers in our industry have not built up the muscle to diagnose systems even in the code they write themselves, and if we’re in a world where that code writing is a lot faster and the review process is expected to also be a lot faster–being able to understand what’s happening will mean you need to use your brain differently.
A lot of developers will need to adapt and learn these new approaches. They will also require improved tools to support them in this process. While observability tools may provide some assistance, they won't be the complete solution. It's difficult to determine the exact right approach at this point. However, it represents a shift in the industry. Just like 15 years ago when continuous integration was introduced, there may have been initial resistance and confusion among some developers. But now it is considered a standard practice. I see this shift as a similar phenomenon. I realize this is kind of a long-winded answer, but I see it as a large but incremental shift that potentially unfolds as a slow burn. Then one day, we wake up and all of a sudden, our jobs are completely different.
AI + LLM Software Development in Observability and Infrastructure
HB: There could be big changes, but in our experience across 20+ years of advising startups (with several of our partners having founded startups themselves), it seems like anytime there is something that dramatically increases the efficiency of developers, it usually leads to there being more developers.
PC: Yes! Our aspirational roadmap is so much larger than we can get to, realistically. And if we can just get to something like 25% more of our current capacity, that's huge. So yes, I think I would agree with that.
HB: One final question–in this exciting future of AI coding assistants, where do you see the role of observability? Is it going to become even more important as more developers have more stuff running on top of infra that may, itself, have been coded up by robots?
PC: Absolutely. We’ve published an article about working with LLMs in production that delves into this very topic. I think the main thing to think about is: Are LLMs (and diffusion models as well, really) these non-deterministic black boxes that people are going to use in wild and wacky ways that no one could ever predict? Will there be situations where ops people look at what they’re working with and just get scared out of their minds? It’s like, “Wait, you’re putting this into production??”
And for a variety of reasons, there will be new failures, with users doing things you did not anticipate, and that's going to be a bug that you effectively created. You're going to ship bug fixes and improvements that end up breaking other things. As a result, your latency and a lot of other measures are going to be all over the place.
And honestly, such problems are not unique to LLMs–plenty of modern systems have similar issues as well. Except LLMs can add another order of magnitude of unpredictability into the equation. And so the same basic principles of observability may take on more significance with prompting: Do we know what inputs are? Do we know what the outputs of the model are? And if we don't feed the output of the model directly to the user, verbatim, which I expect most businesses would not (and do at least some parsing or validating), then how does that fail? What do the results actually look like? When users give you feedback on whether something was correct or not, what were they looking at, at that point?
All this stuff is instrumentable with frameworks like OpenTelemetry. You can instrument information and data, typically in traces in your application, or using logs. But yes, from there, you can start systematically analyzing what you have. You can try to classify certain types of outputs that come from classes of inputs, which lead to these particular kinds of errors.
But when we're dealing with the output of a model, does that mean we need to do better prompt engineering, or does that mean there's work that we can do without tweaking how we touch the model and just correcting things? Because we know that it might be mostly correct, and if we just do some minor tweaking or fine-tuning, we may end up with something fine as far as the users are concerned (so they don't have to know that the model screwed things up). Unless you have a way to systematically track things and see how they change over time, with the ability to measure what “success” means for the service you have, how do you know? You might do some prompt engineering stuff that looked like it was all good locally, so you deploy it–but how do you know it was having the right impact?
Observability tools give you, first of all, the means of instrumenting that information so you can produce that data in the first place. And then in tools like Honeycomb and others, we give you the means to track your data in ways that are meaningful for what you're trying to go after. And then you can use your data in different stages, monitor what matters and reactively make changes when things go wrong, or decide to make proactive investments upstream to improve a fundamental LLM feature you have.
So every week, you look at the patterns and decide to solve one specific problem, and move on to the next, measuring the impact over time. You’ll end up with a cycle of really good feedback about how users are interacting with things. It’s a process we’ve been using for a little while now, and I think it’s legit. I think dev teams will prefer taking this type of approach because it’s real, and not guesswork.
For more in-depth discussion on what AI means for professional developers, join the DevGuild: Artificial Intelligence event.
More Resources
Subscribe to Heavybit Updates
You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.
Content from the Library
Generationship Ep. #26, Solving Complex Problems with Rita Scroggin
In episode 26 of Generationship, Rachel Chalmers speaks with Rita Scroggin, founder of FirstBoard, to discuss the pressing need...
Generationship Ep. #25, Replacing Yourself featuring Melinda Byerley
In episode 25 of Generationship, Rachel Chalmers speaks with Melinda Byerley, founder and CEO of Fiddlehead, about the...