MAR 20, 2025

21 MIN

Ep. #32, Structuring Data with Marcel Kornacker

GuestsMarcel Kornacker

light mode

about the episode

In episode 32 of Generationship, Rachel speaks with Marcel Kornacker, creator of Pixeltable and a pioneer in database technology. They discuss how AI engineers can streamline data management, the challenges of founding a startup, and the future of generative AI in software development. Marcel also shares insights from his career journey and how AI is reshaping the way we work with multimodal data.

about the guests

Marcel Kornacker is CTO and Co-Founder of Pixeltable, a company revolutionizing AI data management. He previously co-created Apache Parquet and founded Apache Impala while at Cloudera. With a doctorate in computer science from UC Berkeley and experience at Google and multiple startups, Marcel brings deep expertise in database systems and AI infrastructure.

show notes

about the episode

about the guests

show notes

transcript

Rachel Chalmers: Today I am thrilled to welcome Marcel Kornacker to the show. Marcel is the creator of Pixeltable, co-creator of Apache Parquet, and creator of Apache Impala, which he started when he joined Cloudera in 2011.

Before Cloudera, Marcel worked on database technology at Google and several startups. He has a doctorate in computer science from UC Berkeley. Marcel, thank you so much for coming on the show.

Marcel Kornacker: Rachel, thank you for having me.

Rachel: You've had the extraordinary career in and around relational databases. How has this deep data background shaped the way you think about AI?

Marcel: It's certainly shaped the way I think about AI and it motivated me to start Pixeltable. I'm really an outsider in the AI world in that sense. So I didn't get interested in it until about 2022.

I was an EIR at a venture firm called Sutter Hill Ventures, and I was introduced to another EIR there who had a computer vision background and then ended up participating in a number of discovery calls with computer vision engineers and managers who ran computer vision teams.

Back then, LLMs were not very much talked about. GPT-3 didn't come out until later that year. So computer vision has a longer history in the deep learning and AI space, and so there are more mature companies there.

So that was sort of the background to the computer vision focus there, and it was interesting to hear them describe how they work with data.

So I got an introduction to, we're working on creating products with computer vision, but a lot of the work involves curating data sets or dealing with data in some form.

And so my background is obviously entirely in data, and so Pixeltable was sort of born out of this observation of what engineers need to work with. And so it's really a reflection of that. So that's how I got there.

Rachel: So tell us what you're building at Pixeltable.

Marcel: Yes.

So Pixeltable is in many ways a database system, quote unquote for AI. We don't position it as such because AI engineers aren't really looking for database systems. But what I mean by that is Pixeltable unifies data and data storage and data lineage as well as execution, meaning orchestration, which is what it's called today, and eventually also model versioning.

So the idea that you have a Pixeltable as a tabular structure, and you can overlay what is typically done today in the form of scripts, scripts that open files, do some transformations, write them back.

You can overlay that onto the table structure via computer columns, so that's one aspect. And Pixeltable will then handle incremental updates. You add more data, it runs the computational dag that you define incrementally.

That's one aspect. Another one is it understands multimodal data. So it also stores your videos, images, audio, documents, or at least it stores external links to them and knows how to work with them.

So a lot of the typical data plumbing that you encounter when you're trying to work with these files and modalities fall away in essence, and it is declarative. So you can create an index on an image column and you give it the embedding models you want it to use, and it will then maintain the index for you.

You don't have to think about the mechanics of the data plumbing that goes on under the covers.

Rachel: So I'm imagining that I'm building an app for say, home insurers looking at a region that's been devastated by wildfire.

And I as an app developer could maybe build something where you could dump in new maps as they become available and new stories and things like that and query against it.

Is that the right kind of mental model?

Marcel: You could do that, it's not a document repository. So it's structured in the sense that when you have a database system, you have tables, the tables have structure or today the very popular in-memory framework would be pandas.

Pandas also gives you a table and it lets you do things with the table. So it's not so much NotebookLM, it's really more of a here's this structure and now you could add maps as images obviously as a column, or you could, let's say satellite images.

And if you have a model that can assess fire damage, you could then simply add that model and vocation as a computer column utilizing the fire image, the satellite images of the fire damage as an example.

Rachel: That's a very helpful clarification. What advantages do you get from the declarative approach?

Marcel: Well, one is it's incremental, so I add more data. The system automatically knows how to run the computational DAG that you find, create the new data pieces and store them for you.

So all of that data plumbing that would normally be involved goes away. That's one aspect. Another one is, so indexing, and indexing is part of that, right? We often talk about vector indices and stuff like that.

And so this is sort of baked in in the sense that you as a developer, you get to tell the system what embedding models you want to use, but you don't have a choice in terms of who provides the vector index.

That's just simply taken care of by the system. Just like when you use Snowflake, you don't show up there with your own Peatree index, right? Or your own Nash index, right?

It just gives you that and you don't think about it, right? The declarative approach means you don't think about what is involved in the maintenance of these access structures.

So I think this is a big part of it that it takes the cognitive load away from the developer to have to keep constantly thinking about the data mechanics under the covers of your application.

Rachel: New satellite photos come in, I feed them to the system, my AI can query against them.

Marcel: Basically, yeah, and satellite images are interesting because they're typically very large, so you're not going to be, if they're like 20K by 20K, you're not going to be running them on a single model because no model can handle that.

So you would typically need some pre-transformation, a tiling, and Pixeltable would allow you to do this tiling very easily and then consume the tiles as if they were another table, like basically one row per tile.

You can now run a standard model against it, it could be object detection or whatever else, and you don't have to think about the mechanics of taking an image, breaking it down into small pieces, feeding the pieces into the model, storing the model output.

All of that goes away and is handled by Pixeltable. Basically all of the things that don't really have anything to do with your application logic are taken care of.

Rachel: My actuarial tables are going to be so up to date and accurate, it's going to be frightening.

Marcel: Yes.

Rachel: What are some other real world applications for Pixeltable? What are people actually using it for?

Marcel: So we have one design partner who's in the computer vision space. They're basically doing traffic surveillance, let's call it.

And so this is another typical multimodal application where you have image data, video data. Oftentimes you do object detection on the frames.

You pick out particular events, you might need to then create a, let's say, ticketing system as in like you're detecting a stop sign violation, and now you need to produce a video, a small snippet that documents the violation and maybe run a license plate reader on the image and generate a structured output as in like issue a ticket against the holder of this license plate for this incident.

And so this is something you can string together with Pixeltable relatively easily once you solve the algorithmic complications, but the data complications shouldn't get in the way. So this is one example.

Another one is doing analysis of e-commerce websites. And this is really truly multimodal where you have screenshots and you are asking the LLM to, or an omni model to figure out the components of on the screen like buttons and this and that, and you then feed that into an LLM and ask it to simulate a user action.

Like what is the next thing you want to do? And you definitely have to string together multiple model invocations.

You need to produce an image of the screenshot with the detected interaction elements overlaid on it, outlined, et cetera. So there's a bunch of image manipulation going on there.

And so this is also something you can do fairly easily with Pixeltable without having to think about intermediate storage of my derived image files and so forth.

Rachel: Yeah, I am beginning to see how strong the name is now, it's a way of structuring and ordering any kind of visual data and making it accessible to AI. Is that an accurate characterization?

Marcel: Well, it started in visual data field, but now we've expanded into audio documents and so forth.

So Pixeltable allows you easily to take a video, extract the audio, and then invoke a model to transcribe the audio, or if the audio is too long, break it up into smaller chunks and then submit to a model and so forth.

Again, all of the file handling that would normally be involved in that goes away. And you can just think about the algorithms, the logic of your application and not so much about, like I said, file storage and, and running audio extraction and chunking the audio and so forth.

Rachel: Taking a little bit of a left turn here, what are some of the challenges you've faced? You've been a tech leader on a lot of enormous and influential products, but now you're a founder.

How is that different, how is it similar? What was hard about leveling up?

Marcel: It's still going on. We're fairly early in the game, and I'm sure my role will evolve over the next few years, but I think one thing you have to do as a founder is, when you are a tech lead, you are solving technical problems.

You're somewhat limited often in your scope or what you think about, right? There are external parameters that are given to you, let's say the scope of a product or whatever else, you are working in a larger organization.

The other parts of the organizations make decisions that impact what you work on, but here, it's completely free form, right? It's not like we have to define what we're going to work on and what we want to prioritize. And so you have to be more, I think, a more well-rounded individual. You have to think about product, you have to think like a product manager in many ways.

You have to empathize with a user, you have to understand how you can create a business out of what you want to do technically.

And so your technical solution is a means to an end, and you need to understand now as a founder what the end should be. Previously the organization told you what that was, but now there's more.

Rachel: Yeah, no, that's an interesting framing. I often say that the great CEOs are great at triage. What you're describing is the need for a tech lead to impose discipline on themselves rather than having it imposed externally.

And there's an interesting parallel between what you're building and what you're describing in that founders also take an enormous amount of complex data and try to order it and be able to run queries against it.

So I love the ways in which the software that people create reflects what problems they're trying to solve in their lives and careers.

We are clearly in the middle of a very dramatic platform shift from the pre-LLM world to a very different future. How do you think developers should be thinking about generative AI?

Marcel: I mean, when you talk about developers, I think there are always two trajectories. One is, how do you use the technology to create new products? So you as a developer use it as a tool in your product.

And then the other one is, of course, everybody's talking about coding agents and generating code and having AI basically replace software engineers altogether.

So the first aspect, I think we're still kind of struggling to understand what is possible and what it'll allow us to do in terms of new capabilities, things we can build into our products.

And so I think there is a lot of experimentation happening right now. I think LLMs in particular, bots have been successful and effective and we're still waiting to see the next wave of, I want to say LLM-enabled products. But I think those are being developed, everybody's doing experimentation.

And I think the impact on software engineers is simply that rather than being a niche area that only PhDs in AI can use to build products and so forth and that has more a research bend, I think we are going to see widespread adoption by standard software engineers and the need for standard software engineering to utilize LLMs and generative AI more broadly in mainstream products.

So I think there's that trajectory that's certainly happening. And then there's of course the creating software altogether with AI side and things definitely look promising.

You see a lot of demos often obviously selected for being effective demos of AI generating code and maybe putting small apps together and stuff like that.

My personal experience has been less, I want to say dramatic in the sense that maybe if you're writing code that the LLM hasn't seen before, it's harder for it to generate something useful.

So I think we're still some ways away from an effective coding agent that can really do moderately large things on its own and sort of hit the mark there.

It's certainly very useful for line completion or smaller pieces. But I think we're at the beginning of this process, so hopefully in five years, I'll be having a conversation with my coding assistant and creating software that way.

Rachel: I'll send my podcasting assistant to do the podcast for us. What risks do you see in this widespread adoption of gen AI and what are some mitigations that are available to us?

Marcel: I think the risks are already very visible, right? Deep fakes and everything is getting better.

Like basically voice generation, you have companies that do this now warn that security systems that rely on voice verification will not work anymore, right?

You see video generation is now becoming a thing, not quite there yet, right? The examples that look good are highly selected, but things aren't static. In one or two years, things are probably going to look very different.

You already have image deep fakes that are indistinguishable from something that a camera would've produced. And so, I don't know.

Like I said, I'm an outsider in this area. I'm not quite sure what the technical solutions are to combating that, but I see very clear dangers there.

And with the ability of everyone now to completely select your news sources, right? To completely insulate you from things that don't fit into your worldview.

Rachel: Yeah, another very polarizing technology. A technology that people both have polarized opinions on, and that further polarizes our opinions about other things.

Marcel: Yeah.

Rachel: What are some of your favorite sources for learning about AI?

Marcel: I'm really enjoying the "Latent Space" podcast. Very technical, very varied. And they have interesting people come in and sort of give their story and their perspective.

I've found that useful. There's some newsletters like TLDR.

Rachel: Yeah, love that one.

Marcel: But I have to strike a balance between picking up what's out there and working on the product.

Rachel: Whenever I ask this question, I think about Donald Knuth refusing to answer email and devoting hours of every day to deep thought and like hashtag goals

Marcel: Emails yeah, and now it's like DMing and Slack, et cetera, everywhere. So yeah, it's actually interesting.

I mean, I feel like for us too, we have to kind of strike a balance between, we are an in-person, a hybrid office situation for engineering in the Bay Area in San Francisco. So we have an office in San Francisco and we're in the office two days a week right now.

And we still have to, I feel like, find the right balance between we actively want to enable in-person collaboration, but then there's also the whole toll of, I want to say message channels outside of that.

So you basically have the ability to be bombarded by messages all day long.

Rachel: Yeah, it takes, again, a lot of discipline and structure to carve out space for focus and collaboration these days.

Marcel: Yeah.

Rachel: With so much noise in the information atmosphere.

Marcel: Yeah.

Rachel: If everything goes your way for the next five years, if you're the emperor of AI and you get to dictate how everything goes, what does the future look like?

Marcel: What does the future look like? I think my view of the future reflects my, I want to say data background and data infrastructure background in the sense that I think I don't have that much of a vision in terms of model capabilities.

Yes, I assume they will all get a lot better and will we see AGI in five years? I don't know, but--

What I really would like to see is that generative AI, AI in general, becomes a mainstream piece of every software engineer's toolbox and maybe even the turn generative AI will just fade away, just like big data eventually faded away and nobody talks about it anymore because it's just data. And so for everyday engineers to be able to use this technology in a minimal friction way to enhance products, right?

Today it's still a niche area. People need specialized capabilities and so forth. There's still more of a struggle compared to a standard web app, but I think we see that going away.

And so this would be my hope for the future is that this will just become completely easily accessible. As accessible as storing tabular data today is to the average web app developer.

Rachel: Last question, my favorite question. A Generation Ship is a giant star ship that takes longer than a human life to get to its destination. And so in some ways it's our legacy to our grandchildren.

If you had such a ship, what would you call it?

Marcel: That's a good question, maybe just Onward. Clearly we are leaving something behind.

Rachel: The passage of time is inevitable. We might as well face it with some grace.

Marcel: Yeah.

Rachel: Marcel, it's been a delight to have you on the show. Thank you so much.

Marcel: Thank you for having me.

Content from the Library

Visit library

Apr 17, 2025

Article

How to Properly Scope and Evolve Data Pipelines

For Data Pipelines, Planning Matters. So Does Evolution. A data pipeline is a set of processes that extracts, transforms, and...

Apr 1, 2025

Article

Data Council 2025: The Data Science & Algorithms Track with Sean Taylor and Jesse Robbins

Heavybit is thrilled to be sponsoring Data Council 2025, and we invite you to join us in Oakland from Apr 22-24 to experience 3...

Mar 31, 2025

Article

How to Create Data Pipelines

How to Create Data Pipelines Introduction to Data Pipelines In today’s data-driven world developers and product managers rely...