1. Library
  2. Podcasts
  3. Open Source Ready
  4. Ep. #2, Defining Open Source with Avi Press of Scarf
Open Source Ready
33 MIN

Ep. #2, Defining Open Source with Avi Press of Scarf

light mode
about the episode

In episode 2 of Open Source Ready, Brian Douglas and John McBride speak with Avi Press, Founder & CEO of Scarf. Together they explore the evolving definition of "open source" in the context of AI and the future of technology. They also examine broader issues surrounding licensing, business models in open source, and the challenges associated with both.

Avi Press is the founder and CEO of Scarf, a company that provides open source usage analytics and helps businesses with sales and marketing intelligence for commercializing open source software. With a background in software engineering and a passion for supporting the open source community, Avi focuses on bridging the gap between open source projects and their commercial potential.

transcript

Brian Douglas: Welcome to another installment of Open Source Ready. I got John here as the co-host. How you doing, John?

John McBride: Hey, I am doing good, Brian. How are you doing?

Brian: Excellent, yeah, you did so great the first episode, we're going to make you co-host moving forward.

John: Hey, I accept.

Brian: Excellent. So I appreciate you accepting 'cause we already hit record and I don't want to edit that out.

But today we actually have a guest to talk about, actually, we'll ask Avi what to talk about. But Avi, you're here.

Avi Press: Hello.

Brian: Avi Press, how you doing?

Avi: I'm doing great, I'm doing great. Happy to be here, thanks for having me.

Brian: Excellent. So Avi, what do you do and why you're here?

Avi: So I am the Founder and CEO of Scarf. We're a company that focuses on open source usage analytics and we help companies with sales and marketing intelligence for those that commercialize open source.

Brian: Cool, how long have you been working on this for?

Avi: It's a project that I started back in 2019, but the company's about to have its fifth birthday on Monday, which is really crazy to say at this point.

But yes, we've been at this for a while now and you know, we've gotten to the point where there's billions of software downloads happening on our infrastructure, we're quite at scale now, and have really, I think really made a dent in helping provide more visibility into software usage and practice, and the open source world, which I'm very proud of.

Brian: Excellent. Yeah, well I'm super excited to talk about open source with you here at Open Source Ready, the podcast where we talk about all things around open source, licensing, AI, but actually we're going to focus on the licensing part 'cause there's a bit of like a conversation around the OSI's effort to like creating the open source definition specifically is what we'll talk about today.

And I reached out to you, Avi, 'cause I saw you on Twitter engage, or I should say X at this point, engage with somebody else on this subject. So do you want to explain what this open source definition thing is?

Avi: Yeah, so you know, in the same way that I think a lot of folks, there's been tons and tons of discussion about the classic open source definition, and what it means to be open source and when things are and are not open source.

And you know, with the rise of all the AI technologies and with the rise of more open AI tooling, there's the question of, well, what does open source mean in this new context of AI, AI models, et cetera?

And I think what we as a community are grappling with right now is that the classic ways that we have defined open source software don't cleanly map over into the world of AI when we have data that you have to train models on, you have weights that you might need to ship with, and basically just a whole new stack here. And the definitions that we have used classically for software do not cleanly map over to this new world that we are in.

And you know, the OSI has been working for a while now to define what that means, but I think there's a lot of, there's a lot of discussion and disagreement about how to do this, if we should do this, when it should be done, et cetera, so basically every aspect of this I think is very much up for debate right now.

Brian: Yeah, honestly, I've been watching the debate happening in like the discussions in the open source forum. It feels like this has been a long tale. Like, how long are we talking about, like the discovery of a definition?

Avi: Yeah, I guess, I'm not sure. I mean the first time that I heard people within the OSI talking about this, I think was like, you know, back in, I want to say 2021, like this has been talked about for years now, that this has been in progress.

But I think that there had been a lot more like in earnest, like workshops at conferences that like the OSI had hosted for people to come in and provide their opinion and their feedback on what should go into the definition, and I think--

I believe there we're at least three years into talking about this, but perhaps more, I'm not sure how long it was kind of baking within the OSI before I was privy to it.

But I think once ChatGPT came out, I think this was just on a lot more people's radar than it had been before.

John: Yeah, it looks like one of the first RFC's for the open source AI definition was this time last year in October, September of 2023.

So even before that I'm sure they were, yeah, going back and forth with community members and trying to figure out what that definition could be.

Avi: Yeah, so I mean I guess like a lot of stuff goes in until an RFC has come out from an organization like that.

And so I think it's one of those things where there had been people thinking about this for years and years, and now I guess, yeah, 2023 when they finally start to make any kind of hard stake in the ground.

But you know, I think one of the things that came up a lot with open source software was people, like, say coming out with like a BSL license and calling it open source and everyone gets very upset with that because that's not open source, right? It does not follow the definition, and then-

Brian: Could you explain the BSL, what that is for our listeners?

Avi: Sorry, yeah, thank you. So the business source license, which a lot of people call BSL, others I think more officially BUSL is a kind of category of software licenses where you're not allowed to compete with the producers of the software is kind of the only restriction on usage.

And then after some amount of time, the code becomes open source, it converts into say an Apache 2 license or these kinds of things. There's a whole initiative like called fair source now that's trying to standardize some of these things even further.

But you know, overall, there were a lot of database companies that had relicensed, you know, so like you know MariaDB is BSL, well, like Elastic had something similar to that for a while before they switched over.

But yeah, there were a lot of database companies doing this and then there was this whole discussion about like who can call themselves open source and who can't and how this could be very damaging to the open source community if it gets open washed as it were for us to kind of gray the boundaries of who can and cannot call themselves open source.

And so when, you know, for instance, Meta releases Llama and calls it an open source model, but has restrictions for how large companies cannot use Llama, people very rightly point out that say, "Hey, this is actually not open source, you shouldn't call this open source."

Which then has a lot of people, you know, rightfully saying, "Oh well, we need a harder definition on what is and is not open source AI." And so I get where it comes from.

But yeah, I think what we should do about that is very unclear.

Brian: Yeah, and I don't know if we've even crossed the chasm of like what is open source?

'Cause there is a strategic definition of that today, and honestly I can't even recite it today 'cause I'm like, I don't know, people have different ways to sort of pen themselves and like position their approach to what is open source, but like source open, open source, open core, it's pretty muddy at this point.

And you spend a lot of time like sourcing open source data for customers. Is there a clear distinction of like what is open source for you?

Avi: I very much understand what the like classic open source definition advocates will say here, and, you know, you kind of have the certain freedoms that must be protected, right?

So anyone can use the code, modify the code, redistribute the code in a permissionless kind of way, and those are really hard lines that are drawn by the open source definition and you either clear that bar or you don't. You know, things like the business source license simply do not clear that bar, full stop.

You know, in my opinion, I think it would be good if there was more gray area to this and more nuance to this because I think over time things do get messier and there's utility in having a bit more of a flexible framework to think about this.

But you know, I see where the more traditional folks come from where they say, "We fought really hard to get these particular terms to be defined and respected and meaningful, and it would be a step backwards to let people gray those lines that have been drawn over a long period of time."

And you know, I'm sympathetic to that argument, that's why there's folks like the fair source people who wanted to just make a new term.

These are software licenses, right? This is for software and what I think we're all running into right now with the AI definitions is that data and software are different. There are many places where they don't behave the same way. They're definitely not governed in the same way in most countries.

But I think what makes this even more challenging for me especially, so I think like all of the functional programmers listening to this will say, "Well, what's the difference between code and data? They're the same, they're inextricably linked."

And I'd say you're right, but these things are meaningful when it comes to open source AI because if I give you a model that you can use and redistribute but you couldn't reproduce it yourself, well, now, we're kind of in a water where some of the freedoms that we started with are in jeopardy, I guess.

It's not necessarily the case that you can take something off the shelf and reproduce it yourself. And so there's a lot of really unclear things there about how we should reason about all of these new pieces.

And I think what makes it even harder is that this is a moving target because the architectures of AI right now might not be the architectures of AI in a year, and we're currently making definitions that kind of assume a particular architecture, which I would say is mistake one of probably many here.

John: I think that usage of the word "freedom" is really fascinating and probably does resonate with a lot of people, you know, in the Linux, GNU, whatever ecosystem, the more traditional open source side of things.

But how would you define freedom within AI or like freedom within AI products or AIs that happen to be open?

Avi: Yeah, I think the RFC definition does take a bit of a stance here in terms of the freedoms that it claims need to be kept here. And so I'm just pulling it up.

So yeah, you need to be free to use the system for any purpose, study how it works, and introspect its components, modify the system, and share the system.

And I think one thing that's really interesting here I think for the first time is that you might be free to do something but it's still basically impossible, right?

Like Meta could spend, you know, $15 billion training a model and say, "Okay cool, here's the data and here's the model, you can do it yourself," but like I don't have those resources and no one other than like Meta is going to have those resources. And so what good is that freedom in practice?

John: Yeah, it's funny that they released the 405 billion parameter Llama model because it's like who's running that unless you have a bunch of H100s sitting around, but like, on consumer grade hardware, no way.

Avi: Well, I guess this is maybe where I'm not so sure. So does that mean it's hard to run it or hard to like fine tune it or retrain it to that? I don't actually know the answer to that.

John: It just has so many parameters that sticking that on a consumer grade GPU or a MacBook, on like Ollama or something, it'll work, you're just like using a bunch of swap memory.

It's just wildly inefficient and it just doesn't quite work like you would expect the chat interface of something like ChatGPT to work.

Avi: Yeah, there's just so many things that are really thorny here because, in practice, the data, the volume of data required to train these models is so large that we're running into all these problems with like copyright around the data that it's being trained on.

And so, this poses a lot of challenges here because I think the more traditional open source folks say the whole stack all the way down needs to be free to use, and have all of these rights granted, and it looks like where their landing is one where there's kind of an acknowledgement that some of the data might not be attainable.

I think they use the word unobtainable data, so I guess that's one question. Like, if you can't obtain some of the data, should it be an open source model or not?

I don't know the answer. I don't know the answer, but it seems unlikely to me that any useful model will have 100%, you know, licensable data all the time, and now we're muddying the waters between are we talking about licensing data? Are we talking about code? Are we talking about both?

Brian: You bring up a great point.

So you mentioned the issue of timing right now for having an open source definition because, like, if we were talking about GPT-2 or whatever that number was where it was still open sourced, there's a meme right now about closed AI rather than open AI.

So like it's a closed source company, like maybe they open source some old stuff, but today everything's closed source, we got other companies that are building models that are closed source.

And then we have like a Llama that's open source. We got Mistral that's open source, Apache 2. I don't think anybody anticipated it moving this quickly.

Let's talk about some other problems that you see with the definition today outside of just like timing and trying to attach yourself to a moving conveyor belt.

Avi: Right. So I mean, just to speak to the timing piece a little bit, like, if the point of the definition is that we want to make it really clear what is and is not open source AI, well, why do we want to do that, right?

We want to do that to make sure that when people say that, it means a very particular thing, and we can depend on what that thing means.

Well, people have been saying this is open source AI now for the entire time that we've been in this AI wave, and so I claim it's too late, the ship sailed. Like people are already using the terms however they want to use it and any organization can say, "Hey, you're using this term wrong." But if people already use the terms in the way that they use it, like cat's already out of the bag, what are you going to do? Like there's nothing that the OSI can do to make people use English in a certain way. Unfortunately, they do not have a trademark on this term open source as much as they wish that they had done that, they don't.

And so it's a little bit tricky now and so what I would've liked to see is urgency to, you know, if this is really so important, well, let's get a definition out as soon as possible and work to improve it.

But I think where we've landed is it wasn't around when I think it was really needed and we still don't have a definition that seems very good quite honestly.

Brian: Yes, I guess my question would be like, are the right people working on this?

'Cause like the OSI is an organization, I believe like nonprofit, folks are participating and donating their time to it, but if we had Microsoft like tell us the open source definition, would that be the right move?

Or like, I actually just honestly earnestly curious, like, are the right people working on this thing?

Avi: I'm not sure, yeah. And you know, no disrespect to the folks at the OSI, I think this is a very noble thing to take up and lead the charge on. But the OSI is a really small group of people.

There are not very many like AI experts on that board, and the people who are actually doing a lot more of the concrete work on AI, yeah, are like at Microsoft and Open AI and Google and all these other companies and I'm not seeing a whole lot of participation on that front from these companies and I don't really see why they would really care to, I guess, I'm not sure.

So I guess I think it's a great question, Brian. I don't know if the OSI is the right group. I think some people are arguing that they should stay in their lane more towards software, and now we're talking about stuff that is beyond software.

It's an interesting question and I think I'm not convinced that this is the correct setup, but I also am not exactly sure what would be better. I just know what I'm seeing, I see a lot of issues with.

It's easy to criticize, for sure, it's a lot harder to do something, but the OSI is a really small team. There's not a lot of resources for this.

Brian: I'll propose a different angle too 'cause like, John, you spent a lot of time with the CNCF and Kubernetes, and like foundation. Is there a world where we see like an Open Source AI Foundation spin off instead?

John: I think that's great, I would love that. And I think especially given the number of things that encompass what ends up being the end product of like a quote, unquote "AI model."

You know, it's not just the bits of software that train it, the underlying dependencies, something like PyTorch, or NumPy, or whatever, the actual training contents which may, like we were talking about, could include copyrightable material. You know, that's complicated in itself.

So it's like almost you would need this like bigger foundation to encompass more of those things, the software dependencies, the software itself, the actual weights, the training data, the release cycle of some of that stuff, or even like how you would attribute risk to some of these things.

That was something I saw that was, I'd love to ask about as well is how does something that's an open AI model or I guess a model that happens to be open, how people who use those things consume risk from those, where it's like, "Oh, maybe it said something to a user about eating rocks" or whatever Gemini was telling people last year.

You know, like who assumes the risk for that? Is it the open consumer of that model? Is it the person distributing it? Is it the person who's like a quote, unquote "license holder"? Very hard to tell right now.

Avi: Right. And I think, I mean this is one of those spots where I do think that the blueprints that we have from traditional open source are very clear, which is that this is provided as is with no warranty.

And so it's just, you know, you use it at completely your own risk.

But ultimately, I think with a lot of these models is that it's pretty hard to run it on your own machine, sometimes you need one of these services to kind of assume that risk and it poses challenges and questions about like how decentralized and federated and permissionless can all this stuff really be in the limit if everything with AI gets computationally more intense over time where it kind of prices out individuals who are really doing anything useful?

There's so many questions here and so little answers. I do think one thing to mention though as well is that I do know that like the Linux Foundation does have a sub-foundation called LFAI & Data and I know that there are some initiatives around some aspects of open source AI from these foundations.

I don't know all the details of precisely what they are doing and how these relate to the things that the OSI is doing, but a lot of different people are working on this, and so it'll be interesting to see what comes out of these other initiatives.

Brian: Cool. Yeah, I honestly didn't know about the LFAI Foundation Initiative, but yeah, sounds like I should be reaching out to them and have them on the podcast as well in the future.

Avi: Yeah, I would recommend.

John: I did have a question, and maybe this is a bit of diversion from talking about AI, but I am very curious from Scarf's perspective, how the direction of the business of open source is going.

You know, maybe a personal worry of mine is that was like more a 0% interest rate phenomenon where there was all this good tech and good engineering resources going into open source and Scarf has the numbers.

So I'd be very curious for you to tell me like, "No, it's great, it's alive and well. The business of open source is okay and there's still lots of companies putting resources into these things."

Avi: Yeah, I guess so from the customers that we work with, and the projects that we work with, I think there's different ways of interpreting this. I think with any amount of data, that when you really get into scale, there's so many different lenses that you can look at all of this stuff.

And so, on one hand, like we even work with really large companies that I didn't even really realize have a very open source forward strategy and they very much do and are continuing to invest in it.

That includes AI companies and chip manufacturers and these other companies that really do continue to invest really significantly in open source.

On the flip side to that, I think is that what we do see in the data as well is that there's so many companies that make software that have such a huge impact and have so much adoption throughout the Fortune 500 and the public sector, and their businesses are struggling at the same time.

And you know, I feel very fortunate to have been able to make really meaningful progress on helping these companies do better by giving them more visibility into where their commercialization opportunities really are.

But it remains really hard and I don't think there still are not great playbooks for the right way to run open source forward businesses.

Like, there are not, it's just not as well taken of a path, like it is for traditional enterprise software where there's business case studies and all the classical resources that business owners have. Like, those playbooks still don't really exist for open source.

I don't think we really know what the best ways are, although people are starting to write about this stuff more and more, which I think is great.

And interestingly like, you know, we've talked to some companies that have a lot of open source traction and they have like planned to move away from open source and invest more in their cloud offering or these kinds of things.

And then once they start really seeing, once they actually have data visibility, we've seen them actually walk back on those choices because there is such a rich gold mine of data there that actually can help them run their businesses more effectively.

And so that's a very wishy-washy answer to your question, but I think that there is very much still a bright future in open source for business, but I think it's going to remain very challenging.

I think that it's one of those things that it's definitely more effort to do, but I think that in the long run, it will lead us all to a better place and better software and more freedoms for everybody, but I don't think it's going to be an easy road.

Brian: Yeah, so I did want to get us to the reads, but honestly I think this is, there's a lot of conversation here and I'm actually looking forward to what this ends up turning into for the OSI and the definition.

I'd love to see something out there in the open, but it definitely seems like perhaps OSI might be behind the mark on the timing and like what it's actually representing.

But yeah, I think I would love to have the conversation with OSI and folks involved with this in a future conversation. But for now, I got to ask a question, are you all all ready to read?

Avi: Ready.

Brian: Excellent. So yeah, these are going to be reads, good reads, things that we've been seeing around the internet, appreciate having the definition conversation.

Switching gears a bit and we've got a couple reads, John, you've got some reads as well. Did you want to explain what you have on the list?

John: Yeah, I have actually two reads today. One is maybe a relevant topic where the Internet Archive, maybe people saw this, was actually hacked where some 31 million accounts have been breached from the Internet Archive.

And now, that's all showing up on Have I Been Pwned, so please, if you were using the Internet Archive, please go to Have I Been Pwned and check your account, change your passwords, do all that stuff, but pretty insane.

Big DDoS. And then some of these like just JavaScript alert things popping up on the Internet Archive.

Brian: I didn't get a chance to read this before the show, but what was the vulnerability? How'd they get inside the archives?

John: I don't think they've disclosed any of that yet. I think from what has been getting tweeted out by BleepingComputer and some of the people within the knowhow is that it's still like actively being DDoS'ed.

I think yesterday I couldn't even get on it. I think they had brought everything down kind of as a measure to preserve what they have left, but very unfortunate, kind of wild that such an amazing piece of internet resources and infrastructure is being attacked in this way.

Avi: Both a breach and a DDoS is brutal.

John: Yeah, all at once.

Avi: Hugs to the Internet Archive folks.

John: The other read that I have is a pretty deep dive, but I thought was very interesting. It's titled, "Can You Get Root With Only a Cigarette Lighter?" by David Buchanan.

And this was a fascinating read where this person did a bunch of hardware hacking and modifying some of the RAM pieces on their motherboard bus, and then using a lighter to just click the lighter, cause a hardware fault, and then inject a bunch of Shell code to get root on some Linux laptop.

So not worth going into because you know, there's like 1,000 words about memory pages and how virtual memory works and all that stuff.

But a great read, and a good reminder that if your hardware disappears and you're unsure if it's still safe, you know, maybe there's a hardware exploit on it now.

Brian: You know, actually, I did see that come through the Hacker newsfeed for myself. But I didn't actually read it, so I'm going to read that shortly after.

Actually, I have a couple reads as well and I'm going to share. So The Pragmatic Programmer is a newsletter, it's also an individual, I guess, Gergely, I've probably butchered the names, apologies, I don't know if I've actually said that out loud before, their first name.

But I guess they have a newsletter but then they do a podcast as well that has like these sort of fireside chats with founders and engineers, mostly engineers like that been in industry for a long time. And they did it with the CEO of Sourcegraph, Quinn Slack, and great conversation.

To be quite honest, I've never actually listened to podcasts. I always just read it. It's just a thing that I've just never thought, "Oh, let me listen to the podcast." Instead, I find that they make really good reads.

So I read the interview and the one thing I'm going to take away from this is Quinn had mentioned 200 was the point at which they stopped doing just flat salaries for everyone in the company and actually doing index based on location.

And I come from, I worked at GitHub previously and we had index-based salary based on a region as well. So pretty much a norm in the world I came from.

But I thought it was interesting because in the same conversation we're having right now with like the return to office, Twilio just recently, was my second read, is like the CEO tweeted and also posted on LinkedIn that their returning to office is remote only.

So like, still have an office, but like they're removing the restriction of like trying to get people back in San Francisco and other places.

So the exact opposite the prior week before, or two weeks ago with Amazon's return to office five days a week. Apple famously has got their currently four days a week now, but everyone's slowly incrementing.

And I'm actually curious, like for both of your feedback, John, do you have personal experience previously working at Amazon and this whole remote work region-based salaries?

John: Yeah, I had joined AWS originally, back in, oh gosh, 2022 sometime when it was all remote. You know, still the pandemic work style. And I think ultimately, these businesses are going to do what's best for these businesses.

Whether that means that they need to preserve some of the real estate investment that they've made by bringing people back into office, or by, yeah, ensuring that they can still hire competitively and effectively across different regions as their engineering orgs continue to scale.

I think that makes a ton of sense for them. I don't know if I always believe, I guess candidly like this is the best for work-life culture and this is the best for us to like continue to innovate.

I think the read between the lines is really that it's what's best for the business and I actually really appreciated that Quinn Slack said outright, like, "You know what, this is what was good for the business so we can continue to hire competitively and make sure that salaries are fair across the engineering org and that works for employees and works for the business."

Brian: Avi, you guys bringing everyone back to Scarf HQ or are you sticking remote?

Avi: No, we're sticking remote, but I am a big believer in the benefits of in-person work when you can do it as well.

And so, you know, like we just did a team offsite in Chicago a couple weeks ago, which was really great and was super effective for us and I wish that we could do it more.

And so, you know, I really do prefer when we can hire folks that are local to the Bay area, but you know, some of the best engineers are not here and I don't want to not hire them just because they don't live here.

And so, for us, I really like the more hybrid approach where you can have a little bit of both, but I do think that every business is a little bit different here and like, you know, for us, the very first time we were really hiring was in 2020 like peak pandemic where we just had no idea if we were ever going to be able to do something in person in the office.

And so making huge culture shifts can be very painful and that hasn't really been the right thing for us. And so we're going to continue with a remote culture ourselves.

John: Yeah, even to play devil's advocate to my own point, like I've seen incredible work done in those in-person settings, like at OpenSauced, we had this kind of conference offsite thing we did in Miami last year and that was pretty, pretty wonderful and foundational for a lot of like the AI stuff we did end up building.

So yeah, I could definitely see the benefits. What about you, Brian? What's your hot take?

Brian: My hot take, I love being in the office, but I've been working remote for way too long that I don't know, it'd be culture shock at this point, but I do enjoy the offsites.

Like at GitHub when I worked there for almost five years, like we had these team offsites, so like, we call 'em mini summits. So very similar to what we just explained with Miami and Chicago.

You get the team together, it's like your immediate, we don't use this term at GitHub at the time, but some people would call 'em tribes. So like the folks you interact with on a regular basis, you bring your tribe to the city, that's the easiest for everyone to get to, sometimes San Francisco, and you spend a week together working.

And the benefit of that is like you can understand people's humor. Like async culture is great, but like you don't know if someone's telling a joke or not unless you actually meet 'em face to face.

So like you can work through all those sort of cultural road-- speed bumps rather by meeting face to face.

So I don't know, I get the whole Amazon thing as well, like they're going to get some kickbacks from the government by bringing people back to Seattle.

Apple's got a huge campus that people got to go to, and like they got to pay, not just like the employees, but folks who keep the grounds up together, like clean the bathrooms, like serve the lunch. Like, there's an entire economy that would just like crumble if we decided that everyone's working remote forever.

So I think there's a capitalism, like there's a benefit for remote, there's a benefit for in-person. And I think I always say this thing like, let water find its own level and I think that's what we're going to be witnessing for the next year, or maybe couple years, to be quite honest.

So I will mention aside, I was walking downtown San Francisco in Soma, so 2nd Street down South Park and it's quite different, like, it's not like pre-pandemic, but I think some things are opening up.

I think there's a sort of push of getting AI companies taking up leases. It's a different vibe.

So if you haven't been downtown, I will just recommend folks, go get your lobster roll at Ed's Lobster or whatever that place is in the corner and hang out at the LinkedIn building 'cause that's where I work every now and then is bottom floor.

Well, that sort of rounds up our conversation. Appreciate you all bringing some reads and commentating on it, Avi, and listeners, stay ready.