Ep. #1, Introducing Open Source Ready
In this inaugural episode of Open Source Ready, Brian Douglas and John McBride embark on a technical and philosophical exploration of the current state of open-source AI. John shares his hands-on experience with building large-scale data pipelines and integrating AI to create meaningful insights for developers. The discussion revolves around the balance between innovation, data privacy, and the growing power of large tech companies in the AI space. They also touch on the open-source community's challenges, including licensing issues and the role of foundations in supporting AI projects.
John McBride is a Senior Software Engineer specializing in AI at OpenSauced. He has an extensive background working with backend technologies including Linux Technologies at AWS, Kubernetes at VMware, and Cloud Foundry at Pivotal.
In this inaugural episode of Open Source Ready, Brian Douglas and John McBride embark on a technical and philosophical exploration of the current state of open-source AI. John shares his hands-on experience with building large-scale data pipelines and integrating AI to create meaningful insights for developers. The discussion revolves around the balance between innovation, data privacy, and the growing power of large tech companies in the AI space. They also touch on the open-source community's challenges, including licensing issues and the role of foundations in supporting AI projects.
transcript
Brian Douglas: Welcome to the first installment of Open Source Ready. Open Source is something that's like, if you touch software, you have touched open source and it's hard to get away from like that's a given.
But what's actually really more in vogue than like that's happening right now is Open Source AI. What I want to do in this podcast is talk to founders, engineers, maintainers, folks who are working in this space specifically, and really just talking about like that and the fringes of open source and what is truly open source, what is Open Source AI.
With that said, I want to get ready to actually introduce our guest. John, why don't you explain to the audience who you are and what do you do?
John McBride: Sure thing. My name is John McBride. I'm a Senior Software Engineer at OpenSauced where I've been building a bunch of our backend stuff, application logic for our API, but also a bunch of AI features and things for our backend to power StarSearch, which is a feature of open source for discovering, using AI generation to discover people in the open-source ecosystems, and sort of like a bunch of different things happening in the open source ecosystem based on near real-time data.
Before that I was at AWS working deep on Linux Technologies before that on a bunch of Kubernetes stuff at VMware, before that, at Cloud Foundry at Pivotal. So been deep in a bunch of backend technologies for a number of years. Very excited to be here chatting with you about all things open and AI.
Brian: Yeah, yeah, and I probably should just mention, I'm B Dougie, Brian Douglas, and host of this podcast and John and I worked together. So I run a product called OpenSauced and we're, at the moment, we're orchestrating developer culture.
So like we shipped a CLI just recently or reshipped a CLI just recently 2.0. John, do you want to talk about like how you've been like open source as a whole? So we got the product like we're going to GitHub repos GitData and you want to talk about how we sort of adjust and build that?
John: Yeah, sure. So obviously, GitHub, you know, has just an insane plethora of code developers and people putting their homework on there. But also, you know, every day, there's new and interesting projects that are constantly popping up.
So when I first joined OpenSauced like 18 months ago or so, kind of the idea was, you know, we'd be using GitHub's APIs and directly calling those via GraphQL or just, you know, HTTPS on their APIs to just like kind of constantly be scraping data.
But we ran into a lot of problems with that and that's like was never going to scale. So one of the big first things that I took on was sort of this like big like pipeline ingestion of just a bunch of events data off of GitHub's event feed.
So today, you can go to api.github.com/events and it's this like kind of constant stream of events. You keep refreshing it and you'll see like new events pop up every single time. It's a very, very old endpoint, tons of different technologies all across, you know, Google or really any of the FAANGs that use GitHub probably use this thing.
But it's like an old API endpoint that ends up kind of feeding you constant and up-to-date information based on pull requests and all different things happening on GitHub.
So we were able to use that kind of at big scale across really every event happening on GitHub to sort of rebuild a like ginormous public data set of GitHub data that represents, again, everything happening, but then all the things that are like kind of the current snapshot of GitHub repositories, every GitHub repository at scale.
So the way we did that then was to use a time series database, specifically, timescale on Postgres to actually ingest that data, store that data, and sort of create like a data lake of kinds that our API backend can use with that data lake and with our API backend.
Then you can go to OpenSauced, you can see all the graphs and the charts, which is kind of those representations and snapshots of events of things happening across GitHub repositories or users for all of GitHub.
It's a similar approach to maybe how the GitHub archive approached this with, you know, kind of constantly ingesting data and then like taring those up in like nice little zip tarballs.
But we took an approach of, you know, really being able to do that in real-time so that, you know, in theory, you can open up a poll request and then that'll show up on OpenSauced like five minutes later or something, or however long it takes for that to flow through this ginormous data pipeline.
Brian: Excellent, so like I wanted to actually talk to you about building a, like a large scale RAG. You're actually giving a talk at KubeCon. What's the title of your talk at KubeCon?
John: Ooh, that's a good question. I think it's a spicy talk name. I think it's like Building Huge Scale RAG Applications on Top of Kubernetes or something. But, yeah, the idea is, you know, to kind of walk people through how we used a lot of that public data set or, you know, that kind of data lake you kind of can think about to actually build a huge scale RAG for StarSearch on OpenSauced.
Brian: Okay, excellent, yeah, so like should we explain what a RAG is at this point?
John: Yes, yes, absolutely. My background is definitely not in like data science or machine learning, I guess I like to masquerade as these things, but eventually, you just get good enough at them that, you know, you end up on a podcast talking about it. So here we are. You too, audience, can learn about AI and machine learning and all these things.
RAG stands for Retrieval Augmented Generation and probably, the most important part of that is the retrieval part. You know, we've all used at this point like ChatGPT or Claude or Gemini or whatever, and that's really the like generation part of it where you have some transformer model that you know, can take some prompt and then do a bunch of text transformation and, ultimately, spit out something that, you know, ultimately, is pretty good, be it code or, you know, a recipe or something.
But oftentimes, these transformer models haven't been trained with the most up-to-date information. Like it's just kind of a problem that, you know, maybe isn't being solved for. But really it seems like the industry has approached it with like, every six months there'll be like a new model that maybe has like up-to-date training information or something.
So if I wanted to ask about like, ooh, who is Travis Kelce dating or something. Taylor Swift.
Brian: I mean that's a good tongue-in-cheek question. Yeah, but it's hard to like not jump in and also like interview myself in this, 'cause I had a hand in this like development, but like one of the things I was testing for was ask like, what was the last major release of MS-DOS?
Like that's open source today. And like that data is existing on GitHub, but when you ask that on like a ChatGPT or U.COM, it's like, oh, well Bill Gates, like he was the one that was CEO and like when it was developed and like these engineers whatever's on Wikipedia and I think what we're seeing and like I think ChatGPT's actually gotten better at answering questions about code and engineering, et cetera.
Well, like code's been fine, but like who are the folks who like built the thing that's been a challenge 'cause that they don't have the debt data vectorize in like part of their dataset.
John: Yeah, or if that changed tomorrow, like that's another thing where if they cut another release of that or there was like some big new feature that showed up, you know, at its base, these base models just wouldn't have that information.
So, yeah, the big part of that though is really the retrieval where you can use big chunks of text that you put into like a system message for one of these AI models or even getting more fancy where you go and retrieve that information from a database or from some kind of vector search or you know, whatever.
There's all different kinds of techniques and things you can do for like that actual retrieval, but, ultimately, in the retrieval augmented generation, you're like taking some additional piece of information and really kind of peppering the model with like, you know, like, "Hey, by the way, here's some like additional information that you should know about or that is relevant to this user's query or question."
And then it can go and generate something like excruciatingly relevant based on that information you've given it.
This is pretty close to sort of like Perplexity's core product offering it seems, where like they can do a bunch of stuff with AI generation and text generation, but really like augmented based on retrieval from like the entire internet, which, you know, is obviously, something Google is kind of doing now as well where you can ask Google a question and, you know, Gemini will pop up with an answer right there.
That's, ultimately, like a RAG workflow where it's giving additional context and information to a model.
Brian: Yeah, so I did want to talk through like the approach for you building a RAG and like how you sort of found this context and like where were the sort of bells and whistle? What sort of technologies did you reach for to build this?
Because there's a lot of choices and I think I've had a lot of conversations about, well, when I explain like StarSearch, like what open source is and, oh, yeah, did you just use OpenAI and like use embeddings and like, yeah, sort of like we did that for a couple weekends but then what we figured out is like, that's expensive. So like do you want to share a bit of that story?
John: Yeah, yeah, absolutely. So knowing that we kind of had this big data set in Postgres and really I guess like a constantly growing data set in Postgres.
So, yeah, initially, we tried a bunch of stuff with OpenAI, you know, sort of took the initial rudimentary approach sending a bunch of stuff to OpenAI, getting it to like generate text for us and stuff.
We have this big data set basically on Postgres that we can access and do a bunch of stuff with. And we started to explore not only what a sort of vector search on top of that data would look like for obviously, very fast and accurate retrieval of relevant information, but then also how we could cut costs and save by running a bunch of inference using OpenAI models ourselves.
And that ended up being kind of a bunch of this backend infrastructure around our Kubernetes clusters that we have for our backend, but then these like little services using something called vLLM on Kubernetes.
And really what that is, is just kind of an OpenAI-like API but something we can run on our own GPUs and hardware that we get on our Cloud to then do inference.
So we have GPUs, we can do inference, we have what looks like an OpenAI-compatible API to then still use our clients to hit, you know, the various endpoints of things for being able to do this RAG-like flow. And then that's, ultimately, what gives us the capability to provide our users the StarSearch interface.
Brian: Yeah, and like StarSearch was, the goal was like, can you find folks within open source? Specifically, like, "Hey, who were the folks who were like last maintained this?"
And like what was interesting about this is like everyone focused our, like we talked about the problem of like the dataset don't have the right context to ask the question of like, "What was the last release from MS-DOS or who are the maintainers for this project in 2017."
But that was what we were trying to solve with StarSearch. So we ended up indexing a bunch of issues and PR data, which gives you like a stronger signal outside of that. But we're essentially scraping GitHub, like the event feed, it's open, it's been public for years.
That's our secret sauce and how we've been doing that. But I'm curious to get your take on something that actually shipped recently. So this is like Happy Birthday to Cloudflare.
So at the time of this recording, it's during Cloudflare's birthday, and they just announced like this Cloudflare AI blocking tool, which I know this is like this was talked about in past, so I'm not sure if this is like the official like add a Beta launch or if this is still Beta.
Actually, I missed that in the article so I dunno if you actually skimmed that or caught that announcement. But in the same vein, like Google paid $60 million for Reddit's data and I think the biggest miss is data when it comes to like building AI on AI.
So like all legacy Cloud, all legacy startups, and companies are having a field day and like implementing AI in their platform, but not everything's going to be accessible.
So I'm curious like do you see things like things like StarSearch or other projects where people are like-- ChatGPT, for example, where you have like this sort of all-knowing Oracle of a chat, like do you think that world's sort of like constricting?
John: Yeah, that's a good question. I mean personally, I do sort of see it constricting. You know, I've even seen, you know, like where these bot scrapers show up on like my blog, which you know, is sometimes like the most unhinged things.
Like I wrote this whole thing about how I've used tmux to like launch services and provide like a service mesh almost, just disgusting if you know anything about tmux or trying to like deploy services.
But, ultimately, I think it kind of ends up being maybe a few tiers where obviously, there's like the lowest tier, there's like OpenAI, there's Google, there's even like Meta with the Llama models, you know, they're trying to find like good raw text and data that they can feed into the next generation of transformer models, which will, ultimately, be, you know, GPT 4, 5, 6, you know, et cetera, et cetera. Llama 3, 4, 5, 6, 7.
But we've sort of like run out of data, like they've just like scraped the internet at this point. Yeah, and you know, they're going now to places like Reddit where there's like really good new data showing up that is like pretty consistent and of somewhat good quality.
Ultimately, OpenSauced sort of exists in like another tier above that where like we are accessing sort of these like, almost like sort of what I see as like public goods and public data inside of open source repositories that, you know, enable like a brighter future.
So maybe I'm biased because like I built the thing, but I do envision it's kind of going to be restricting or you know, like maybe there'll be like community pushback, ultimately, where people like me with my blog will say like, "I don't want that, I don't want some mega-conglomerate to be able to, you know, soak up a bunch of like my hard work from my blog or something to then, you know, be able to like kind of pass off as its own kind of thing ultimately."
So this is another interesting thing that Perplexity has been asked about in the past because they're sort of on the forefront of this where, you know, they're trying to like compete with Google or like compete on the search engine level where their, I guess sort of value add is like adding text generation via models on top of all that.
But it'll like generate a bunch of stuff based on things that it like went and read and scraped from like the top search results. But like very often, you don't find yourself being like, "Oh, cool, thanks, Perplexity, I'm going to click through into something," like maybe if it's like, "ooh, this is a really good article, you should go read this or something."
But usually, you're just like, "I got my information, I'm done." I'm not going to like give ad revenue to, you know, news publications or people who maybe have like, you know, memberships set up on their sites or something. So I guess to answer your question, I do envision it kind of restricting in the future.
Brian: Yeah, I mean I wonder if like the world becomes where like if I leave Facebook, for example, which I haven't, I haven't left Facebook, so feel free to find me. I won't add you 'cause I haven't logged in in forever, but like I had to export my data and I could have like a data set of like college years to like when I had my first kid, it's like when I kind of stopped using Facebook.
So like I have this data set, we all have these like data sets that it's like our personal, like actually a better example is like Apple Intelligence.
Like I've got so many photos, again I mentioned I have kids and pets, like of a dataset of like, "Hey, tell me about that one time when I did this thing at this one place" and I could use a geolocation feature within Apple, but if this like, "Hey, Apple Intelligence, retrieve me this like augment and retrieve this piece of data."
And I think Apple Intelligence is actually really interesting. I don't know if it's like the be all end all like the actual example of this, but I think what they're doing is really interesting.
So I guess my question to you is like, do we see more of a further lockdown of like our public data out there and then we sort of protect that so that way we have a personal dataset to do our own personal assistant copilot on our phones and et cetera?
John: Yeah, that's a good question.
I mean, I would love a future where it's like my hardware, my data, my AI models, and like I don't have to worry about these things getting kind of sent up to the Cloud and then it's like, you know, putting money in the pocket of others ultimately.
But like, I do think the Apple Intelligence case is really interesting 'cause they seem to be investing in kind of that approach a little more where they have like small local models, I think, running on your phone. I think they also have cases where it's like going to go up to OpenAI to like do stuff with that partnership.
But it'd be a beautiful future if like these things could work really well on like consumer-grade hardware that, you know, ultimately, could still serve good purposes.
There's like really good models out there that are so tiny and still powerful enough, I would say, for like most use cases, like I've been playing around with this new OpenAI Strawberry or whatever it's called o1 or something super overkill.
Like there's very few cases I actually need this thing, you know, usually, like the mini models like Llama 3 that can just run on a MacBook gets me most of like what I need or at least like most of the way there. Granted I'm not trying to get it to always like do the most correct thing, you know.
So if it's like, got to be like pretty close to 100% correct, then maybe you're off. But, yeah, I'd love if there was a future where it's a little more of the kind of data sovereignty that like individuals were, you know, concerned with and then, you know, using your hardware for like your stuff.
I don't think that's too far off from like a typical open-sourcey person's kind of view of like, how they want to run software and like exist in the world of technology. Like I'd love if I could or if I had like infinite time to, you know, like run a bunch of services at home.
And I know people who have like huge, crazy mass setups with like a bunch of local media and movies. They don't use streaming services, 'cause you know, it's that same idea where it's, you know, it's like, "Oh, I'm on a streaming service, it's not my media. They could just take that away tomorrow. And it's not my media, you know."
I don't have enough time for that, honestly. I wish I did but, yeah, I'd imagine that there's going to be a lot of those people in that camp similar to how Open Source exists today, right?
Brian: Yeah, it's like, it's not magic. There was one that like recorded your every move on your computer and then you could always like recall it.
John: Oh, I know what you're talking about. I can't remember either, but-
Brian: Or folks, and I mentioned me, @BDougieyo on Twitter when you let me know what that is when this goes live. But I was thinking like things like your own Plex server, your own home assistant.
Like I can run my own network, my own automation through my entire house. I could also run my own streaming and like have all my kids' videos and soccer games like on a server. And I don't know if it's like too unlike like I'm collecting all these hard drives and data of like the life of my family and then kids and like every conference talk I've given is in a Dropbox folder.
So like if I'm like, cool I want this to have teeth into everything, let me go ahead and like structure this into like a RAG so that I can pull up and be like, "Hey, you know, I mentioned a thing in a conference talk, but I can't remember what it was."
Or "I was on a podcast and I can't remember when I did that." And "I did a podcast" is a great example but it's also already, it's a proven like demo that most of these AI companies are already shipping. It's an interesting thing and an interesting space we're in today. And I think as the space of open source, but also how we treat open data like public-facing data, it's evolving very quickly.
John: I mean, I don't see it as being too dissimilar from like hardware races in the past either.
Like we could continue this example of like using media locally at home and like people today are running, you know, 5, 10, 15 terabyte masses at home, which for people who aren't aware, those are like locally networked banks of storage that you can have like with like mass amounts of storage for your movies and photos and things.
But 10, 15 years ago, you know, you'd be lucky if you got 100 gigabytes and you're still probably spending a lot of money to like have all that network storage. Whereas today, you could probably spin up a really powerful mass for, you know, a couple hundred bucks.
So my hope would be that, you know, the GPU craze ends or it just kind of outpaces itself and you know, Moore's laws continue and GPUs get better and better and better. And consumer-grade GPUs also continue to get better, better, better. And you know, it would not be unreasonable in the future maybe that you or I have a cluster of GPUs at home for a couple 100 bucks that you know, can run Llama 6 or something as like a local, you know, McBride household thing that just works and works really well.
I'd love that future 'cause I don't want, you know, the powers that be to continue to like hold all the power of not only training models but controlling like mass amounts of hardware and then yeah, keeping the little guys like us outside of being able to use those things at all.
Brian: Yeah, so I think there's like a, as like I did this podcast and like talk about open source and AI. I think there's going to be an evolution in this space and it'd be interesting to go from like episode one through, hopefully, we get up to 150, 200 podcasts and like really see this space evolve.
Like I think it's already been super quick. Like I only just felt like I got on the conveyor belt of AI, what, 18 months ago, and what's changed and like what's accessible i like overwhelming on what's out there. So that's what we're hoping to do.
John, normally, like at the end of the podcast, people do like picks, but what we have today is "Reads." So I guess my question to you is, are you ready to read?
John: I'm ready to read when you are.
Brian: Cool, so "Reads" will be just basically, articles, projects out there on the internet that like will have our guests, I'll bring some myself and it'll just be like some light reads that we could talk through.
And it's actually been kind of hard because like all the reads have been really relevant to the conversation we've had. I did that intentionally but first read is actually the Llama 2 license, actually, the Llama license in general.
And I know you'd spent some time digging into that and I'm curious like some people are concerned about Llama and like what Meta is trying to do like at Zuckerberg trying to like own all the future of AI. I don't know like, but you know the, you know, the Llama license 'cause I know you reviewed it.
Do you want to give us a quick rundown of what it is?
John: Yeah, disclaimer. I'm not a lawyer but I like to masquerade as one. So here we are.
You know, I think the biggest thing with the Llama license is essentially, it's an effort for them to keep their competitors from monetizing or creating a service around these things. Much like, you know, a ChatGPT thing or something or using it as like a dependent piece of something.
You know, if you make X millions of dollars or something, which is basically, almost nobody that exists really in just a category for like OpenAI and Microsoft.
But there are some like strange, you know, little tidbits of it as well where like you're basically required if you integrate one of the Llama models into, you know, your pipelines or things to have like a disclaimer that's like powered by Meta's Llama or something as kind of like a little kickback to them or something.
And I think that just keeps them kind of in the mind share of people and developers that it's not something that just fades into the background but also ensures that Microsoft's not going to put that on whatever. And it would be like an obvious legal problem if they did.
So, yeah, it's super fascinating and almost reminds me a lot of business source licenses, which are, for the open source people, are very hard copy left licenses that can almost kind of infect other pieces of code in theory. I don't know if this has really been tested legally, but a business source license and in theory, would require other pieces of software that integrate with it potentially to also become business source licenses and open source.
So these big Cloud providers that end up shipping platforms on top of Open Source. Maybe a good example would be like AWS's use of Elasticsearch or Grafana,
Brian: Which is actually very, very timely in the last couple weeks.
John: Yeah, very, very timely.
Brian: With the announcement of the Open Search Foundation.
John: Exactly. Exactly. The Meta license feels very similar to that where it's like, you know, they sort of want to ensure their own interests are intact and their biggest competitors can't, you know, just go and use it as a service or something.
Brian: Yeah, I'm happy to hear it 'cause I always wanted to pick your brain 'cause I'm thinking about this a bit with like the Mistral models.
And like everything for them is Apache too. So are they leaving too much on the table of somebody eating their lunch in the future? Or you think they have a, I guess, defensible business?
John: That's a good question. I mean, I think they've created a lot of innovation just like they were one of the first to have this, what do they call it, the mixture of experts kind of approach where, you know, it's one model but you know, it it has a lot of these quote-unquote "Experts," that kind of are distributed within it and they can, you know, do whatever to actually give you like really relevant information based on pretty small number of parameters so it can run on smaller pieces of hardware.
So they're doing tons of innovation as far as like models they're pumping out and stuff. You can maybe think, and I haven't kept up with them super well in the last few months, but I sort of almost view them as like a research institution, maybe what OpenAI started as before they are going like full-on product approach.
So maybe there'll be a point when the Mistral people go product instead of just like OpenSearch models and things. But one thing I think about, I think about this a lot actually, is this thing Jeff Bezos said, years and years ago, which is, "Your margins are my opportunity." Something like that.
Where basically, what he's trying to say is that like, businesses and opportunities that maybe seem like too thin of a margin or something that like wouldn't be a good enough business to go into or just like, "ah, we can't disrupt that, like, eh, it would be too hard or something" that really is Amazon's bread and butter.
And I think why they've seen like such massive success not only in retail but Cloud as well, like Cloud is expensive, and you know, they've done incredible success with AWS.
And I could almost see that being a similar approach with something like the Mistral models is like, yeah, our opportunity, ultimately, is to be able to like source open these and put the weights out there and give them an Apache license so that they're like pretty permissible and they're still going to be like excellent.
Like that's a good way to win the hearts and minds of developers and then maybe 10 years from now create like a really, really excellent product, I'm not sure.
Brian: Yeah, I mean there's a lot because like you mentioned Kubernetes early in your intro and like all these like big infrastructure tools that how to open source, like open source first and establish itself as like the standard.
And like Elasticsearch being one of 'em that has a business, but then obviously, their license went back and forth as it recently went back, but then you have Amazon absorbing Elasticsearch code to make it OpenSearch and now it's fully open-sourced again. Well it was open source before, but, yeah.
John: Yeah.
Brian: There's like a lot to be said, like, who knows who's going to win. Like I think my bet would be on Opensource. I think I'm biased, but I think time will tell at this point. Like OpenAI now going as of today of this recording, like a lot is happening.
So like, I don't know how topical these podcast episodes will be, but today the CTO announced that she's going to be stepping down, OpenAI is going to go into fully for-profit mode. So like the open part of AI is now API maybe is open, but like the parts that are open are not going to be, not for you and me, but we can definitely pay for it, for sure.
John: Yeah, exactly. Yeah, it's a weird time. I mean, hey, that's another thing I think a lot about is like the kind of bizarro state that like the ecosystem of Opensource finds itself in.
Like, there's a lot of movement happening obviously, in the market with people, you know, moving jobs, getting, you know, laid off, et cetera, et cetera. And like I think truthfully that the model that worked for a long time with Opensource was sort of this like kind of pay-to-win thing.
Like even Kubernetes is a pretty good example where that was a Google thing for a very long time and then other clouds got involved like VMware, like AWS because they wanted to provide a Kubernetes product that other businesses could buy, services, et cetera, et cetera.
And that even then birthed more open-source initiatives like Ranchers, K3s, and VMware had this thing that I worked on called Tanzu Community Edition, which was like a purely open-source approach to VMware's flavor of Kubernetes.
And like in theory, that would've just like continued to blossom into like more and more open source and it's like, you know, all going to work because there's like business behind it and like developers are going to use the kind of freemium thing that works really well, but then we want to scale the business and the enterprise around it.
I think maybe not obvious to everybody, but feels obvious to me, you know, with a lot of like layoffs that happen or even like massive acquisitions like Broadcom's acquisition of VMware, a big kind of hit to the Kubernetes community was a lot of people within VMware just not working on Kubernetes anymore.
Because it's like, you know, that was their job for a long time. And I think, ultimately, you know, it's a huge, huge lift to be expected to continue to work on huge-scale Cloud infrastructure technology when you're not really getting paid to do it, right?
So I worry that, you know, we're going to find ourselves in a scary situation where like the maintainers have kind of dried up partly because the business around open source is kind of dried up.
That's my sort of skewed perspective though. I definitely admit that, like, I'm probably jaded coming from, you know, a Broadcom acquisition through that, right?
Brian: Oh, it doesn't show at all. But yeah, I mean the benefit we get from folks like the CNCF who like, there's very clearly like a money operation that like funds KubeCon and funds all these other incubated projects and community around that. Like we've got that.
I alluded to a couple times, Open Source Foundation started, like initiated last week at the time of this recording. So like the hope is that we can now have perhaps an AI foundation that can help, like if it's Mistral or if it's somebody similar, like folks participate and also move progress forward, absent of like, well obviously, there's incentive of like, yeah, we're going to make a profit but we want to drive innovation.
And like if a company gets absorbed into Microsoft AI or AWS or whatnot and like everyone else is left holding the bag or when it comes to maintainers, like everyone else is holding the bag of, everyone's getting jobs at Google and Amazon, but no one's here to do the open-source side.
The hope is that these foundations can help be like, there's already a proven pattern. So like not the only solution, but that's the hope. So if anybody listening wants to start open source AI foundation or maybe there's one in the works, you're invited to come on the podcast.
John: Amen, Yeah, I think that sounds great and like you're right, ultimately, there's a lot of good stewardship and shepherding that's happening within those communities and that's absolutely essential.
Brian: Yeah, and like we talk about licensing, we didn't even get to the part about the OSI AI license about like what is open source when it comes to AI. So like the OSI, the open-source initiative is actually been doing weekly meetings like all summer and are about to announce specifically the details and like rubber stamp what Open Source AI is with a license.
So like think of similar like to Apache 2 or BSL or GPL, this will be another license that will be out there and perhaps we'll see adoption from folks like Mistral and other folks that are doing open source foundational models and adjacent.
John: Yeah, it'll be really interesting to see what happens with those 'cause I mean it's pretty hard to apply, you know, like an MIT or Apache license to these things 'cause it's very different from traditional software where there's weights, some people even release like the training set of data that they use to like actually build the models and stuff.
I think traditional licenses worked really well, obviously, for source code, but this is like a whole different domain with really like how do you license giant chunks of data and maybe those chunks of data don't always like belong of freely licensable things like copyrightable material example that kind of gets abstracted away inside of the weights and the models themselves, right? Very interested to see what will happen next.
Brian: Excellent, yeah, so if it wasn't clear for the listener, 'cause I didn't really call it out, but we had two reads, one being a Llama 2 license, the Llama license in general in open source and the OSI AI license, but we kind of blended that into like one big conversation.
So, John, I appreciate you chatting with me. At this point, that completes the Reads. And listeners, stay ready.
Content from the Library
The Kubelist Podcast Ep. #45, Live from KubeCon 2024
In this special episode of The Kubelist Podcast, recorded live at KubeCon 2024 in Salt Lake City, hosts Marc Campbell and Benjie...
How to Make Open-Source & Local LLMs Work in Practice
How to Get Open-Source LLMs Running Locally Heavybit has partnered with GenLab and the MLOps Community, which gathers thousands...
Open Source Ready Ep. #6, The Infinite Nature of Software with Adam Jacob
In episode 6 of Open Source Ready, Brian and John are joined by Adam Jacob, co-creator of Chef and CEO of System Initiative, to...