1. Library
  2. Podcasts
  3. Generationship
  4. Ep. #19, Auditability Matters with Stefan Krawczyk of DAGWorks
Generationship
26 MIN

Ep. #19, Auditability Matters with Stefan Krawczyk of DAGWorks

light mode
about the episode

In episode 19 of Generationship, Rachel Chalmers is joined by Stefan Krawczyk, co-founder and CEO of DAGWorks. They dive into Hamilton and Burr, shedding light on how they revolutionize data workflows and AI code auditability. From the genesis of these frameworks to their practical applications, Stefan shares insights on how they help data practitioners reason across complex functions and ensure transparency and accountability. Discover how Hamilton's graph structure simplifies auditability and why Burr's white-box approach is gaining favor.

Stefan Krawczyk is the co-founder and CEO of DAGWorks, a startup that empowers data practitioners with a flexible framework based on Hamilton, an open-source project he co-created. With over a decade of experience in building and leading data and ML systems, Stefan is a Y Combinator and StartX alum, and holds a Master of Science in Computer Science from Stanford University. Passionate about bridging the gap between data science, machine learning, engineering, and business, he is also an open-source contributor and advisor to startups in the data space.

transcript

Rachel Chalmers: Today, I am delighted to welcome Stefan Krawczyk on the show. With over 10 years of experience in building and leading data and ML-related systems and teams, Stefan is the co-founder and CEO of DAGWorks, a startup that helps data and AI practitioners move faster with two open source production-ready and flexible frameworks, Hamilton and Burr, that he co-created.

He's also a Y Combinator alum, a StartX alum, and a Stanford graduate with a master of science in computer science with distinction in research. Stefan's passion is to make others more successful with data by bridging the gap between data science, machine learning, engineering, and business. He built the self-service MLOps stack at Stitch Fix for 100 or more data scientists that help generate over a billion dollars in revenue.

He also contributes to open source projects and advises other startups in the data space. Stefan, I hope you get some rest in between all of that.

Stefan Krawczyk: Thanks for having me, Rachel. Pleasure to be on the show.

Rachel: At the core of your Hamilton framework is something called a directed acyclic graph, or DAG. Can you unpack that for us? What's Hamilton actually doing under the hood?

Stefan: Yes, so directed acyclic graph, which is a computer science term, for those who don't know, effectively means a flow chart that doesn't have loops or cycles, and so it's something that you can express visually, and it's a common way to express, therefore, how processes and things are connected.

You generally hear, with those two terms, what's called nodes and edges, so nodes being round circles and edges being the things that kind of connect them, right? And that's what you call a graph, and it's directed acyclic, meaning that there isn't any cycles or loops.

This is used everywhere, actually, from compilers to things like Terraform. Terraform, actually, under the hood, actually creates a directed acyclic graphs to know how to execute and apply things.

Rachel: If this, then that?

Stefan: Yeah, I mean, or it's more how does it know how to connect, what to do, and execute it? DAGs are also found in things like Airflow for more data-related kind of things.

And so Hamilton's kind of no different here in the way that it works to create a directed acyclic graph with this kind of flow chart is just by interpreting how you write regular Python functions, and so getting a little bit into the detail here, so I assume people know Python code here, but effectively, you write Python code, and the idea is you write functions.

These functions, the name kind of becomes the name of a node, so if you're thinking of this flow chart, you're going to write a name. The function becomes like a node in it, so you write code. We write these functions. Hamilton will call a module to pull these out.

Now, how does it create the edges? And so, rather than you having to specify how functions connect, which is common in other frameworks, you actually just use the same name as the function input arguments, so that way you can read a function named, you know, raw data set, and you can then have a parameter that says, " file location"as an input parameter, right?

And then you can have another function called file location that might, you know, get it for you or it's required as input. And so Hamilton, the way that it stitches together this graph, is just by the way that, the way the names of the functions and the function input arguments.

And so that's how it figures out how to link graphs together but then also figures out how to compute something, so if you want to then compute the raw data set, it will know, hey, I need to find this function.

Then I need to go find what it depends on by reading the function input arguments, and then we'll go try to find, you know, in your code base or, like, what the modules you've given it, like, how do I compute, you know, the rest of the things that are required to, you know, get the raw data set. Unpacked enough?

Rachel: Pretty well unpacked. Lots of obvious applications to the way we're using AI these days and assembling metafunctionality, being able to reason across groups of different functions and abstract yourself at a higher level.

Stefan: Yep, yeah, and this is really lightweight. It just runs anywhere that Python runs, so if you think of the kind of graph or DAG abstraction, like, it's very applicable to AI machine learning data.

I have heard even people using it to orchestrate, you know, a little command-line interface, so, like, you can think of it as a "make" is another way that some people have described Hamilton, so you can kind of do things like a make file traditionally could. Its declarative declares how you can actually build a DAG, right? With Hamilton, you can kind of do something very similar.

Rachel: We have to talk about names like Hamilton and Burr, obvious historical references, but, you know, I thought you were an American before I met you, and I was like, "Oh, he just doesn't know what DAG means."

Did you even hesitate about the company name, given what Australians and New Zealanders think of DAGs?

Stefans: So I want to say I didn't actually understand because I grew up in the city, in Wellington, though I could see sheep out outside my window. You know, I've been in the U.S. since 2007, so which case, I definitely forgot all the colloquialisms surrounding that.

And so for those of you who don't know what DAG is, go Google it, and you can find one of the disambiguations, and you'll kind of understand, but-

Rachel: Also used as a descriptor for a person who is, how should we say, not fashionable. Un-chic, perhaps.

Stefan: That must be an Australian colloquialism.

Rachel: Maybe it is. Maybe it is. Australia and New Zealand are two countries divided by a common language, and this is an example.

Stefan: Yeah, so naming, so, you know, went to see the Hamilton musical, actually.

Rachel: Yes.

Stefan: So learned about the American history, right? I mean, I want to say, the two hardest problems in computer science are naming things and cache invalidation, right?

Rachel: And off-by-one errors.

Stefan: Yeah, off-by-one, yeah, and so I was building, so I was on the platform team at Stitch Fix, and I was trying to help a team with one of the issues that they were tripping up over with their code base, and I was like, you know, tasked with kind of helping them come up with the solution.

I'm kind of on a work-from-home Wednesday, so a place for no-meeting days, was able to kind of think long enough and, you know, come up with this kind of abstraction, but what was I going to call it?

So the team who I was building it for was called the Forecasting Estimation and Demand Team, or the FED for short, and I'm like, "This is foundational." Who founded the Fed in the U.S.? Oh, it was Alexander Hamilton.

Rachel: The $10 founding father without a father.

Stefan: Yeah, and then they were modeling the business, so physics, it's Hamiltonian things, and they were, we're just doing graph theory, which is from computer science, which then has Hamiltonian concert, so, like, the confluence of those three things. I was like, "Great. It's a perfect name."

And then, more modern kind of... They're rather, actually, all relevant Hamilton names, so Louis Hamilton, Formula One driver, idea is, you know, with Hamilton is that if you structure code well, you can, and, you know, it's around laps and iterations that wins the race, then Hamilton kind of-

Rachel: And he's going to be back on the podium next year with his new team.

Stefan: Yep, and then, also, in NASA, there was a lady with the last name Hamilton as well. She was all about software correctness, and so with Hamilton, you have a great testing story, so I want to say they weren't originally the initial reasons.

And then, Burr is a little bit of tongue-in-cheek, so those who don't know, Burr actually shot Hamilton in a duel, but with my co-founder, who was at Stitch Fix, we were kind of creating Hamilton at the same time, he was like, "No, I don't like this, "or at least I have another idea," and he tongue-in-cheek called it Burr.

The irony is that Stitch Fix Hamilton won out over his one, but then the Burr framework we kind of, we open sourced is a somewhat tongue-in-cheek reference to that kind of back in the day of us creating two different frameworks, and in this case, we actually see the new Burr framework is pretty complementary rather than antagonistic, hopefully, with the current Hamilton one, and so, which case we see them living a nice, you know, life together than necessarily one killing the other.

Rachel: It's a fan fiction where they live happily ever after. I love it.

Stefan: Yeah.

Rachel: What are some of the cool projects that people are using Hamilton for?

Stefan: Sure, so the Olympics are happening this year, and so there is the British cycling team, so British cycling is actually using Hamilton to help with, as my understanding, some of the velodrome telemetry analysis.

Rachel: Oh, wow.

Stefan: So they can actually process and understand and have a better idea of, like, how they can make a better rider or how they can make someone faster, and it's by analyzing and telemetry, and the Hamilton helps kind of, you know, maintain sanity in that code base of making things testable, documentation friendly, and obviously, you can always understand how data connects with some sort of result 'cause you can draw that graph picture.

More commonly, people are using Hamilton for feature engineering, so there's a various companies and, you know, large and small from enterprise that are using Hamilton, say, on top of PySpark to help with feature engineering.

So I know consultants are bringing Hamilton in to some banks to help with that machine learning AI type stuff. Some are even including more of a machine learning platform since Hamilton gives you a bunch of great hooks.

From a research perspective, Hamilton's also... There's a nature F project by the Pacific Northwest National Labs and Oak Ridge National Labs, a kind of joint effort, where if you're trying to predict the weather, you can give it data, and that will help better predict the weather.

So for cities, the topology of the city actually matters, and so they built, on top of Hamilton, a little, given some kind of what's called shape files, so files of what a shape the city kind of looks like. It'll process that into the format. Then you can kind of plug into this kind of weather modeling system.

Rachel: Yeah, if people haven't been to San Francisco, we have a different climate every couple of blocks. It's kind of wild, and it's all dictated by the shape of the peninsula.

Stefan: Yeah, and then, lastly, more recently topical is retrieval augmented generation systems, so people are starting to build, you know, pipelines, so we have one open source project called Wren AI, for example, that's using Hamilton to help structure and bring water to some of the... What they're pushing out in open source for, you know, people who are trying to build RAG systems.

Rachel: Yeah, and that's where my mind automatically went, just being immersed in AI 24/7, it seems like.

People have been talking a ton about observability and explainability for these black-box AIs, but Hamilton also supports auditability, which I love. Can you talk about this and why it's important?

Stefan: So.

Auditability is basically the ability to kind of make sense of what you've observed and track at the fine level of detail.

So with Hamilton, the idea is that because you can draw this graph, this directed acyclic graph, which comes from your code, and that you can then, as long as you track along with it the inputs, outputs, you know, the code versions, you have this great audit mechanism, this thing that you can tangibly go back in time to and understand, okay, given this data and this run, what did it produce?

And so with other frameworks and other things, there is a lot of effort required to, you know, instrument observability and auditability for this purpose, but with Hamilton, because of this natural graph structure and understanding how things link, you kind of get it for free.

In terms of auditability, like, why it's important, right? I mean, so there's, I don't think it's, you know, Congress was thinking about laws for compliance and governing AI, like, being able to really detail and understand how some data came in, went to a model, and made a decision, right?

If you don't understand how they connect, you're going to be stuck trying to audit and verify that, like, hey, this credit decision or whatever this AI decision made, like, what data did it come from? How did it actually make this decision?

Auditability also lends itself to good debugging, right? When something happens or something fails, root cause analysis is much easier to do if you can really understand how different pieces of code are connected and how the different pieces of data were generated, right?

And so then this can help with reproducibility and debuggability, and then, therefore, it can help with, then, with transparency and accountability, so if you then tack on an extra metadata information, it's like who made this change? Where did it go to? When was the last time this data set was loaded?

Okay, when it was loaded, okay, what actually generated it? Then this can help, you know, speed up and give you more trust for, you know, the stakeholders in the business to understand that the data machine and machine learning teams and AI teams are actually, you know, delivering, and then we can actually, if something's wrong, we can quickly go back and go analyze and give, you know, a great response to, you know, leadership to go, "Yeah, we understand why it went wrong and why."

Rachel: Yeah, even more important, I would imagine, in a multimodal world where your quants are ingesting results from a whole bunch of different models, at least auditability lets you know that model in particular is the one that's hallucinating here.

Stefan: Yeah, I mean, if you want that, you basically have to figure out some way to track it. Like, my philosophy and thinking is, with the framework, then, you get a lot of these kind of easier insights come for free without you having to have every developer or every person writing code instrument things in a very particular way.

Because actually, this is where, what the troubles with auditability is that if people have different implementations, they audit things, or, like, they log observability for some information, but for not others, you actually end up with this kind of mishmash of information that, then, is, you know, doesn't really help you too much because it's like I know people did things very differently.

And I think this is a coming problem that's going to become more important because if you want to really trust and understand what some sort of AI system is doing, then you're really going to have to have the observability and auditability to understand, exactly to your point, which model induced this hallucination.

Rachel: Back to Burr. You recently got such a lovely comment about your Burr library that I wanted to make sure we talked about it.

A Reddit user said, "A good library should adapt to you and make your velocity go up. They have an actual concept instead of just wrapping stuff that doesn't need to be wrapped that will save you time because they do the thing they do better than you ever could. Honestly, take a look at Burr. Thank me later."

What's the concept underlying Burr that's earning it such great reviews?

Stefan: I think it's a few things, so one, it's a graph abstraction, so for the technical term, so with the DAG, there are no cycles, but in Burr, you can actually have cycles and loops.

So this means you can do conditional branching, and just so happens that this way of modeling the world helps you think about problems which just happens to be great for thinking about how agents or bots can kind of be designed and kind of operate, right?

The second is it has a built-in checkpoint and caching mechanism that comes with an open source UI, so from a development perspective, you can get up, you can build this graph, you can immediately see it, and then, when it's running, you can actually see what it's doing, right?

And so in terms of, as you're developing and trying to figure out, okay, why did it this agent or LLM or thing do its thing, like, that is really easy to see, introspect, and then very easy to recreate that kind of point in time to speed up that development process.

Then, lastly, I think, which is what this comment is probably getting more at is that--

We built Burr to be a white-box framework, and what I mean by that is, like, we try to make it as transparent as possible for you to see what's going on.

So we aren't trying to bundle different concerns that will make it difficult for you to customize for your context because agents and other stuff, everything's still new. Like, everyone's, like, going to have to customize for their particular concerns because I know PDFs are very different, and you need to customize how you pull out data from them or your LLM's going to be different from someone else's, right?

And so we built it so that if you have these production or very customizable concerns, the framework isn't going to get in your way, and so this is where I want to say-- Because Burr is also a very functional type approach, and I want to say, you know, object-oriented approaches generally, I think, help you, and that's kind of what these other frameworks run into.

S o we want Burr to be, you know, the best framework for you to kind of help you run and organize your code, provide the hooks for observability, and so that you can focus on your logic and customization so you can kind of, you know, get the core value of what you're trying to achieve rather than trying to fight the framework to get it to do what you want.

Rachel: It's such a beautiful example of, you know, I have this philosophy that code is a reflection of its creator's personality and culture, and you're clearly both very straightforward and down-to-earth, and the tools that you've built are all about propagating that transparency and that groundedness.

Stefan: Yep, you know, six years on a platform team definitely you see all the problems and tools of, yeah.

A lot of tools, for example, focus on the "Hello, world," super-easy, but then they mess, oh, actually, enterprise and real gets complex. Like, how do you do? What do you manage? So I've seen, definitely, the scales of the extremes, and so.

With both Hamilton Burr, we're trying to build something for a platform team, but also for a practitioner in that the operational production and action concerns are the ones that generally slow you down and get in the way and couple you to technical debt, couple you to systems, and so really trying to strive in that way to make it easy to just, if you need to customize it plug in the observability or the hooks that you need.

You don't have to do too much be cause the framework already provides a lot for you, and then, with everything being lightweight and very lightweight dependencies, it shouldn't be difficult to manage, update, change, or update things since the framework, you know, doesn't bring in the kitchen sink of stuff that, you know, you don't need.

Rachel: And we're at this really interesting inflection point where so many companies are at the scale that used to be the sole province of the FANG companies, and the FANG companies themselves have gone off into the stratosphere even further.

But we now have this upcoming generation of platform engineers who need to be able to reason about systems at this very sophisticated level with many, many variables, and it really does feel like a state change, like a significant step change from being DevOps or being a Sysadmin.

Stefan: Yeah, I mean, learning a bit of MLOps, I think, is something that would be useful, so MLOps, I want to say, is there's the general, you know, if you draw a Venn diagram would be, would cover everything that kind of gen AI, AIOps covers because in machine learning, you kind of have the same problems.

Obviously, things are maybe at slightly different scales, but effectively learning how to deal with something that's probabilistic and generative is one of these main hurdles for platform folks in general to kind of understand since running the same prompt and data to, you know, Open AI doesn't necessarily mean you're going to get the same response back.

Rachel: And it's so hard for you people from STEM backgrounds. You're, like, struggling with nondeterministic systems, and us humanities majors are going, "Yeah, yeah, that's what the world is like, my friend."

Stefan: Yep, yep. Yeah, so heavy emphasis on, like, how do you evaluate? Like, the interesting part for me is the coupling between guardrails, so making sure you program, the agent or app, whatever you're building is doing the thing that it's doing, but then, also with kind of evaluating it as well.

Since, you know, did it do the thing that you said it would do? How do you evaluate, you know, something that's somewhat nondeterministic since you can't generally just do equals equals, you know, and kind of assert equality on things, there's a little bit of fuzziness that people have to deal with.

And so learning what is the right software development lifecycle, I think is something that's, you know, how do you incorporate, you know, building those guardrails and evaluations in a way that, then, if you do want to make a change, you're going to be confident that you haven't, like, incredibly broken your app right before you ship it, right?

And so along those lines, I think, so therefore, this kind of the platform ops kind of mentality is really a bit of a zeitgeist, I think, is, yeah, figuring out how do you build that into your software development lifecycle, that's also not expensive.

Since you don't want to, like, ping, you know, Open AI with a million different messages just to release a product, right? And so people I've seen, yeah, the pendulum has been swinging back and forth as to, you know, how much do you do in CI versus how much do you do after deployment versus how much do you unit test and Spock check and stuff like that?

Rachel: We're going to have to rebuild all of this, and as an investor, I couldn't be more excited about it.

Stefan: Totally. I mean, lots of opportunities.

Rachel: Yes, yeah. One of the guiding principles of this podcast is that AI should amplify or augment software engineers, not replace them and should democratize access.

As you work to make people more successful with data, is this what you're seeing?

Stefan: Yeah, I mean, I want to say from personal experience, I actually, you know, Copilot has been useful, and particularly in the case of something that I would potentially spend a few hours trawling Stack Overflow for to like, hey, how do I do something? Copilot can generally cut that down a lot.

As these tools get better, I think people are finding better context where there's generative because that's what these tools are really good at. It's like generating kind of content given some context.

You know, I hear some companies like Anthropic, internally, for example, has something that helps, you know, scientists or, like, problem to create better commit messages 'cause it can not only say, you know, what happened, but also try to guess as to, like, why did you do this change, right?

Which is very important for a good commit workflow and commit hygiene is not only what you did, but, you know, why did you do it, right? And so that's cutting people's times down.

Like, I'm excited or interested for the unit test case, like, given some tests, can you help me generate some tests, right? And so sometimes it's a little hit-and-miss right now, but as these things improve, right?

Then it'll cut down the time that it requires so you can leverage and, you know, do more of the interesting work as an engineer versus, you know, more of the stuff that we find a little, you know, dreary and kind of repetitive.

So, for example, creating documentation, so an open source developer tool, it's like I'd love to, you know, just be able torun up the class and the spec, and then, boom, it could give me the example and the documentation, right?

Rachel: And have the documentation evolve as the code does instead of getting orphaned when the code changes.

Stefan: Yeah, or when a pull request comes in, it goes, "Uh-uh," you know, "these things don't match anymore. "You guys need to fix your docs, right?" I mean, so I'm excited for desktops to continue to go and evolve.

A lot of things that I think from that perspective are just going to help people be more productive, so pull request reviews and other things, making sure you've checked off the checklist, things like that, so definitely will make small teams of developers more efficient.

Rachel: Stefan, what are some of your favorite sources for learning about AI?

Stefan: Yeah, so there's a guy called Sebastian Raschka. He has a great newsletter, and he also has... He publishes on many mediums.

He's also got a YouTube channel, I think, so I like to follow to his newsletter 'cause he does a very good... So my background is kind of machine learning, so I want to say the content is still pretty accessible.

He basically takes research, distills it, and tries to just give you kind of a small overviews of, yeah, how models and things are working, what some of the techniques that are happening since he reads the research papers and tries to, you know, summarize and kind of do stuff.

Otherwise, yeah, Andrej Karpathy, so Andrej Karpathy has a great YouTube. If you really want to know things from first principles, he really has a great series of, like, building an LLM from scratch.

Rachel: He's amazing, yeah.

Stefan: And so if you really want to understand something from a basic level and, like, you know, 'cause some of this is, you know, it's easier to understand if you can track the evolution of what was the initial thing to what it is today, and that's kind of what he walks through, so definitely plus one for Andrej's YouTube.

And then, from a practical standpoint of things, like how do you actually build apps and get it to production, I have some former Stitch Fix colleagues Bryan Bischof and Jason Liu, who tweet on X, formerly known as Twitter.

And so they definitely generally have some maybe sometimes some snarky comments, but generally put out posts and things that are around more of the practical, like, pragmatic side of, like, well, this is how we build apps. No, you don't need anything too complex, this is kind of what we did type thing.

So they're useful from, like, if you're building it, and you want to, like, understand more from a pragmatist's kind of perspective how to do stuff, they're worth a follow.

Rachel: You got to get them on Mastodon. Fediverse is the future. Speaking of, if everything goes exactly the way that you would like it to for the next five years, what does the future look like?

Stefan: Yeah.

So to me, composable AI i s hopefully something that's achievable, and we then trust the systems that are going to output.

So for me, you know, because composable AI means, like, so we have AI, and they're going to build it and build on top of each other to do things, so if you think of, you know, I want to create a list of potential people to reach out to, I might have an agent that scrapes the web.

I then might take what they have and then enrich it by figuring out or trying to, you know, collect and augment things, and so you're going to have these little agents or AI things, and so to be able to use them, you need to be able to trust them.

And so this is where I think the introspection observability and the importance of auditability to really understand how these things are operating and are working will then give you the confidence that, like, hey, if it gives you an e-mail, and this is the bio, that, like, they haven't screwed up, right?

And so the only way to do that is if you can then peel back the onion, so to speak, have a system that then you can kind of, yeah, see how things connected, where things kind of either, you know, went wrong or what decisions it made.

And so, for me, hopefully, this is, you know, built on the back of Hamilton and Burr, obviously. A little bias here. You know, Hamilton for the data-related work 'cause I think, you know, you need to ingest data somehow. A DAG is a great representation of it. Hamilton can help standardize.

And then, you know, providing the introspection of observability auditability, right? You can then have the base building blocks to, like, you know, build that stack.

Similarly for Burr, since, you know, you can describe agents, you can build applications, both them, as we said, you know, I want to say complimentary, in which case, you know, whoever's building these things, you know, we should have, like, a management type interface where we have these agents, assistants.

They're doing some work for you, and then you can have thumbs up, thumbs down, or approve things, or say, "No, that's wrong," and it'll go back, or you can even just rewind, and go, "Oh, actually, sorry, I didn't give you enough of a specific command. Let me fix that, and then have you restart again, right?"

And so that's the kind of, in my mind, in five years, like, Apple was already, you know, demonstrating with, like, things on its computers to do this, and so which case, I feel like everyone's going to be having something like this, and for me, you know, hopefully, off the back of Hamilton and Burr.

But then, the one personal thing I'd love is, you know, an e-mail assistant, so someone to help, you know, easily filter and then autodraft things that I can then just go tweak, change, update, and then, that would, that would be, that'd be lovely.

Rachel: Select All Archive is your friend. Last question, my favorite question: if you had a Generationship, if we were all on a beautiful colony ship on our way to Alpha Centauri, what would you call it?

Stefan: Yeah, good question. The way I like to name things is to kind of think of what does that spark in my mind, right?

And so I was thinking words like, you know, foundation, core, I guess, beginning. In my mind, those all words, you know, like, come together as I'm thinking of something called, like, you know, Seminal 1 or something because if the point is to go forth and populate and spread and be the first of its thing, that's probably, you know, sounds spaceship-ish to me as well, so Seminal 1.

Rachel: It's a great one. Stefan, even though, as a New Zealander, you are my natural enemy, it's been delightful to have you on the show. Thank you so much.

Stefan: Much appreciated, Rachel. Hold nothing against Australians. Happy to speak again.

Rachel: Thank you.