Ep. #11, Frictionless Observability with Yechezkel Rabinovich of Groundcover
In episode 11 of How It's Tested, Eden Full Goh sits down with Yechezkel Rabinovich of Groundcover to delve into the evolving landscape of observability. They explore the high costs of early observability measures and how Groundcover aims to make these processes more accessible and affordable. Yechezkel shares insights on eBPF, the rise of Flora, and the impact of using an open-source stack. Discover how Groundcover's innovative testing methods and commitment to metrics are reshaping engineering practices and what the future holds for this pioneering platform.
Yechezkel Rabinovich is CTO and Co-Founder of Groundcover. He was previously Development Manager for The Office of the Prime Minister of Israel. He holds a degree in electrical engineering and physics from the Tel Aviv University and a degree in biomedical engineering from the Technion Institute of Technology. Prior to founding Groundcover, Yechezkel acted as the Chief Architect at the healthcare security company CyberMDX.
In episode 11 of How It's Tested, Eden Full Goh sits down with Yechezkel Rabinovich of Groundcover to delve into the evolving landscape of observability. They explore the high costs of early observability measures and how Groundcover aims to make these processes more accessible and affordable. Yechezkel shares insights on eBPF, the rise of Flora, and the impact of using an open-source stack. Discover how Groundcover's innovative testing methods and commitment to metrics are reshaping engineering practices and what the future holds for this pioneering platform.
transcript
Eden Full Goh: Hey, Chez, thank you so much for joining me on the How It's Tested podcast.
Yechezkel Rabinovich: Hey, thanks for having me. Pleasure.
Eden: So, really excited to dive into what you've built with Groundcover and with Flora today. I think we have a really exciting story to share based on what you and your team have been hard at work building.
But I think maybe something that is always helpful to orient our audience here is telling us a little bit about how you co-founded Groundcover and maybe your background before you joined or started Groundcover.
Yechezkel: Yeah, sure. So, I've been developer for about 15 years now, usually around Linux kernel embedded in the last 80 years. I also worked in the high-scale backend data processing.
I actually had a one particular task. I had to analyze a data pipeline we had in my previous employee and we just couldn't figure out what's the problem. So, we started to instrument all the classic observability approaches. We started SDK instrumentation and Prometheus metrics and all that and took us months and we didn't solve the problem. It was really, really hard.
And eventually, the bill came of those observability platforms and we had to remove all those instrumentations and all those metrics because it was too expensive. So, I talked to Shahar, which is my best friend, and we figure out something is definitely broken here. And we founded Groundcover, which is the first frictionless observability, frictionless in terms of eBPF.
Not a lot people know about that. It's a kernel feature that allow you to instrument your application without changing your code. So, you can think about it as a sandbox in the kernel that allow you to inspect applications when they they're running, so you can deploy it instantly on 100 nodes cluster and it just works. And it's very, very safe in terms of the kernel, and very fast. So, that was our first goal.
And the second goal was to reduce costs. So, we are basically using a lot of cool sampling algorithms that allow us to decouple volume, the data volume from the pricing, which is very compelling to our customers, because they don't need to worry about log lines and their costs.
No one really knows how many log lines they're going to have at the end of the month. We don't care. It's the same price.
So, yeah, that's the story of Groundcover. Sure.
Eden: Yeah, it'll be exciting to see more engineering teams adopting Groundcover and it sounds like, yeah, you started to approach this problem to solve, 'cause you personally experienced the inconvenience of instrumenting all the logging, and then having to remove it, because of the price tag.
Do you feel like over the next few years there is going to be a change in perspective around either the way that these observability platforms are going to price, whether they're going to start thinking along the same lines that you and the Groundcover team are thinking or how do you think like the end user engineer is thinking about this?
And it sounds like, just to make sure I understand correctly, a lot of the pricing right now is around like number of lines and there's been no way to optimize or streamline that previously.
Yechezkel: Yeah.
I think the whole market is broken in a way, because you have huge public companies build around this pricing model. And basically, they treat themselves as data lake. It's like if you want some kind of big query or data warehouse for observability. And the problem is that with that approach, you're not aligned with the customer, because the customer don't really want just to query 20 billion of log. The customer wants to answer what's the problem, what happened, what changed?
And if we don't have any incentive to make this process more efficient. So, we're just going to encourage our customers to send more and more data if you like, garbage data. And everybody wins besides the customer, besides us developers that actually want the data to be efficient and in reasonable price. So, I think it's definitely going to change.
This is not a modern approach to how we want to consume service. This is not correlated to business metrics. Log lines can be added accidentally, one developer added one log line in a hot path, didn't know how bad it's going to be. Boom, 20% increase in their observability platform. What about the company? Did it change, the status business change? No, it didn't.
So, I think we're going to see more and more companies adopting it. I think we already seen companies like Honeycomb trying to explain those kind of changes for the last few years and I think recent technology changes actually make it a lot more approachable and accessible. So, yes, I think we're going to see other vendors starting and changing their pricing model.
Eden: So, you guys are very focused right now on eBPF, which is this up and coming technology that the IT industry is really starting to rally around. Are there any other alternatives or other options besides eBPF that the industry is considering or you feel like this is where the future is going?
Yechezkel: That's a good question. I think eBPF is one way to collect data. It's very easy, very efficient, but it's not the only way. You can always instrument your application. And if you instrument your application, it's going to be a lot harder. But you can have a lot more nuances.
When you instrument your application, you can decide you want to add some specific attributes that you have in your code. Maybe you want to add a label about your customer or something that is not reflected in the API.
So, I think instrumentation and eBPF going hand to hand together is the way to go. You probably want eBPF to cover your entire environment and you probably want OTel instrumentation to pinpoint very specific points that you really, really care about in terms of business metrics. So, the combination between the two of those are amazing and all of our customers are actually adapting it and we can also see how much of their production is instrumented and help them get to the 20, 30, 40% that they actually need. But it's a slow process. Instrumenting your application can be very slow process.
Eden: And so, I know that with the solution and your flagship platform, I think it's called Flora, right?
Yechezkel: Yeah.
Eden: How are you guys building that observability agents? Tell me a bit more about what is Flora, who's using it, and what's been the adoption so far?
Yechezkel: Yeah, so Flora is our sensor, our eBPF sensor. We deploy as DaemonSet on the cluster, on the Kubernetes cluster. And instantly, they get all the data. We're talking about two minutes, you get all the APIs going in and out of your cluster.
It was a very hard beginning, because we needed to build the framework that allow us to add more and more protocols and more and more run types of support. The technologies today are very wide and we're seeing 20 type of databases, seven messaging queues, different DNS protocols and versions, encryption libraries.
So, we had to build this framework to allow us to build fast and to grow with our customers. And that's basically what we're doing. We're trying to make sure that we maintain our SLA in terms of how many CPU and memory are we consuming and still move fast and add more and more support to new protocols for bigger, newer customers.
Eden: That's awesome. And so, when did you guys launch Flora or are you in the process of launching it? Because I also know you co-founded this startup a couple of years ago, is that right?
Yechezkel: Yeah, so at the beginning, we experiment with open source tools, why not take someone else code that already is doing that? But when we try that in production in our first customers, we saw that the performance hit that we got was not good and we couldn't make it production-ready. So, we figured out we have to rebuild the entire thing from scratch and redo it. So, we released Flora last April.
Yes, a year ago. And from then, we kept adding more and more features. We actually improved our performance over the last year a lot. But we already started with benchmarking Flora to against all other sensors that currently were in market just to make sure we are on the right track. We want to make sure our theory about how to make it faster, more efficient is going to hold in real life scenarios.
So, since then, we constantly optimizing, constantly improving hot path in our code, because at the end of the day, we want to support huge clusters with huge volume data. You can imagine that especially in some industries like gaming industry and AdTech industry, we're seeing dozens of millions of APIs per second, and we need to ingest all that, pass it, correlate with metrics, understand baselines, detect errors, send databases, all of that.
And all that happen at the node level of our customers and we don't want to interfere with that. So, a lot of hard work in terms of performance in there.
Eden: When you're engaging with a new engineering team or a new customer, because I know you had mentioned like you can basically go head to head with Datadog or OpenTelemetry or I think New Relic as well. I saw in the blog post that you guys wrote about like Flora's capabilities.
Do you actually engage with customers where they can be using both, for example, Datadog and Flora side by side and you're able to demonstrate very obvious efficiency and improvements compared to that? Or how do engineering teams get to know Flora assuming they're using an existing observability platform already?
Yechezkel: Oh, yeah, absolutely. It happens a lot. We're talking to an engineering team, they're using New Relic and they say, all right, let's run Groundcover, it's easy. And two minutes later, they can see head to head comparison between their New Relic agent or SDK and Flora Groundcover sensor.
What also interesting is that they can also one day, remove the New Relic SDK or Datadog SDK, they also see another beneficial about their latency decrease. So, they can also measure things after they remove the instrumentation. And most of the people that we talked to already know that. They already had this kind of trade off.
They knew that it's going to happen when they added this instrumentation, but they didn't have any choice. So, they come to us saying, hey, we heard we can reduce this overhead with you, guys. Can we check it out? So, that's pretty fun, right?
Eden: Yeah. And I can imagine also for this technical user, this technical buyer and decision maker, when they are making the decision to adopt a new tool, everyone is very quantitative and data-driven and to be able to do the side by side and just see a direct comparison and see how Flora is better, I can imagine it's a no-brainer in terms of a decision at that point.
Yechezkel: Yeah, definitely. Look, it's not an easy transition to move observability platform, that's for sure. You'll have all your dashboards in it, you have your alerts. It's not fun, it's not easy. We know that. We do make it a little bit more fun. So, eBPF, we removed all your instrumentation needs. So, that's gone.
Eden: Mm-hmm.
Yechezkel: Also, you probably have more data now, so doing an instrumental application. So, also seeing new things kind of give you a bit more motivation about why proceeding with that.
And when we started Groundcover, we knew we were going to have a problem with non-transition companies from different tools. So, we build the entire platform to support open source interfaces. So, we integrate easily with Grafana, all our metrics that are PromQL-compatible. We're using open source stack to let people see what we're using and leverage it with their own tools and knowledge.
That also helps. So, if you're already, because we're seeing a lot of companies using Datadog, NewRelic, but they also use Grafana to consolidate all those data. So, those dashboards can be inputted in seconds. So, it's not easy, but it's definitely worth it, not just in terms of the pricing, also from the value you get.
Because you literally see everything. And the most scary thing with testing is what you don't know, you don't want. But what those APIs that you don't know, that you don't know what happened in your class. Those are the scary ones. So, yeah, you don't have those with Groundcover. That's a huge factor.
Eden: Yeah. So, switching gears a little bit. The reason we originally got connected to do this podcast episode was because I heard that as you were developing Flora, you had to put together a pretty rigorous and robust test bench and test process. So, tell me a little bit about that, like setting up that test bench was something you decided as a team to do before the launch, after the launch, and what is the process for testing?
Yechezkel: Yeah, that's interesting, because most of the company comes from the security aspect, security industry. And when you develop sensors for security, anti-malware, antivirus, or endpoint protection, you need to face a hundred variations of OS systems, different Windows versions, different kernel drivers, all that.
So, basically in every security company, you have a lab that ran automation on hundreds of devices with different configurations. You probably know about that a little bit more than me. So, we all build this kind of labs in our history.
When we started Flora, we knew that we're going to have that lab, but one crucial feature that we can now use Kubernetes. So, we basically created this kind of lab, virtual lab that's spin up 20 clusters with variation of workloads and different kernel versions. So, you basically spin it up in 10 minutes, you run your deployment, your benchmark.
We also use tests with like K6 and obviously Groundcover just to measure the API latency and all that. And that goes down after the test and we have results. And we did that for every PR that we wanted to. So, slow process, because it's a 20-minute session, but very, very efficient in terms of where are our sweet spots? Is it specific kernel, specific library, specific protocol?
We had all those metrics, synthetic metrics, and we could definitely find another hot spot to solve. Having said that, we also knew that no matter what we're going to do in the lab, out there, it's going to be a bit wild. Also with that, that's a very unique approach that I think we took Flora sending home a lot of telemetry in terms of metrics, a lot.
And I'm talking about we have probably 100 million metrics ingested in a stack every 30-second to analyze very different workloads, very different cloud providers, different machine types. So, even though we pass the benchmark, we also constantly, every release that we do, and it's usually once a day, three, four hours, we have very addicted users that actually deploy Groundcover 20 seconds after a new release.
So, we already have metrics and benchmarks on that version. So, we have a report that we built that show us previous and current and where are the hotspots and I think that's the main driver to take those benchmarks from the lab to real life. So, you have to have them both, because the lab is very repeatable. So, that's important when you try to optimize a specific use case. But real life is very important, because that's what really matter. So, we use those two and leverage that.
Eden: So, I know a part of your test setup is you have to simulate thousands of requests per second, representing realistic usage of Flora. Did your original assumptions like before launch of what you needed to test for and design for, how did that change post-deployment, post-launch?
Did you have to change the test setup? Any insights or anything that you learned after actually getting a lot more adoption and ramp up from users?
Yechezkel: Yes, sure. So, when we started, we took common websites that serve APIs and we thought that's probably represent how people will build their APIs internally. So, we scraped for 20, 30 websites that also APIs and we realized the patterns and created that. But you're surprised to see or to know that people do very crazy things.
So, we saw Redis commands with 30K projects getting standard. We thought that's not how you're supposed to use Redis, but that's what happened. We saw 50 megs of HTTP requests being uploaded frequently. So, we had to make our lab a bit more crazy after we saw that. But that's where we realized no matter what we'll see, no matter what we'll build, there's always going to be another use case.
So, at that moment, we went all in on the metrics from our sensor itself. So, we currently, you can imagine that we have 100,000 sensors sending us metrics, telemetry all the time, profiles and metrics that allow us to slice and dice CPU bottlenecks with business metrics of amount of products and what type of products and what size of the products in terms of percentile.
So, it helps us and the product team to understand what are the weaknesses and if we care about it. Maybe it's fine that you're not good when it's an SQL that inserted 20 gigs once an hour. Maybe it's fine. So, we understood that lab is going to be very limited to a product that capture production environment in a lot of different use cases.
Eden: Do you think that as engineering teams adopt Flora or adopt more observability platforms like Flora, that there's going to be a reckoning in terms of people cleaning up their hygiene or engineering best practices more like how do you see that evolving over time?
I've looked at like the logos and the customers that you work with. Some of these are like top tier companies and firms like, Viya, I think I saw, and Geosite and other companies like that.
How have they changed engineering team behavior, product team behavior as a result of implementing Flora or Groundcover's technologies?
Yechezkel: I don't think we're after making the code cleaner in the world. I don't believe that code should be always performant or should be always fast, because it depends on what you're trying to do in terms of business metrics. But what I do believe is us as engineers, we want data to take decisions, and I think before Flora, you could have a lot of assumptions.
You think, oh, those are the bottlenecks. It's a very big HTTP request. And people will say, hmm, that makes sense, but now you can prove it, now you can see it. And that's like, no, it doesn't correlate with payload size. Hmm, interesting. Maybe it's a specific user that's filter by header. Hmm. Yeah. I can see it's very different type of customers having different latencies. Okay, that's interesting. And let you dig deeper.
I think that's what we're trying to bring. First of all, it's bringing all the data in a very economic way. And the second is to let you analyze the data in a very efficient and modern way. You don't want just to run SQL queries. It takes time.
You want context. You want to be able to slice and dice groups and separate them by values and visualize that in different drafts and to get to the insight that you're really after. And that, we're definitely changing when we're deploying our platform for our customer.
Eden: Yeah, that makes a lot of sense. Switching gears a little bit, I'm curious, what is your philosophy when it comes to building and scaling your engineering team and the way that you resource the team to build Flora and to build other tooling around Flora?
We talked a lot about the underlying platform and there's also this simulated environment that the team is investing in.
Do the same engineers who are building the platform also write tests and design the testing protocols or are there different roles on the team for engineering? How do you think about building and scaling your team?
Yechezkel: First of all, I think to start something like Flora, like to incubate this new technology or new concept, I think you need to be very conscious about the number of people that you put on it. You need to protect those people in terms of focus and contact.
So, I think three people is the right number. For a long time, you probably want to get to the first version with three people max working on it constantly. More than that, you're going to have them separate the task into logical units and it's not what you want to do at the beginning of it.
You want everyone to have the entire context, because everything is related. You don't want to get into this local maximums. So, it's the number of people. Obviously, it's the people themselves. You need people to, you need engineers that can think about very wide spectrum of the problem, because the problem could be an algorithm that can solve something, but it also can be solved by another tool or a library or tech stack that you want to choose.
You want to experience people and you want to give them all the context they need. And if it's possible, even lock them in a room and don't change them for a long time. But on the other side, you want to expose the product as soon as you have something.
And to be honest, the first version of Flora, and it was prior version one, we only shared with our bravest customers. We said, look, this is not really ready. We need to test this. We are running it in production, but you probably shouldn't, but if you are up to it, let's do it and we can do it for a day or two and collect metrics, collect insights, collect bugs, and feedback from you guys, and get back to you.
And you're going to earn from the ability to impact on the project long-term, because you're going to be the first users. We did that for I think for three months, experimenting with beta users. And that time, we already had customers, paying customers that it was really hard to change the engine behind it, but we had a full support from our customers to push it once they saw the improvement.
And I think that was the secret sauce. They saw that we are living up to the promise that we told them.
Eden: That's awesome. Yeah, I think in these early days, the early adopters, the customers that are fully bought into the vision, they're patient, they're willing to be design partners, they're willing to give feedback.
Similarly at my company, Mobot, we are also building tools for engineers and product folks and QA folks is yeah, those folks, they have an understanding of what it takes to build a real solution that solves a problem. And they like being able to participate and be informed as a part of that process.
Yechezkel: Yeah.
I really like working on products for engineers. It's something very magical to relate to the person that you're actually selling the product to. It's almost like you don't really sell it. It's like you celebrate it together.
Eden: Yeah.
Yechezkel: You both get excited about it.
Eden: Yeah, I feel that way as well. I guess in terms of the future of Groundcover, you have a number of customers and partners already and the platform is really taking off and there's been more adoption. Where are you guys going from here? What do you think Groundcover will look like in the next two, three, five years?
Yechezkel: Oh, wow. The first year, we focused on getting the data. And now, we're really focused on the user experience. So, how do you make fans from all that data in a way that will be very, very delightful for our users.
Next year, we're probably, will be a year where we expand platform to ECS, EC2, basically any machine that you'll run into cloud. Because our customers, usually, they're heavily based on Kubernetes, but they have those EC2s running somewhere. They have some ECS tasks. We already start doing that.
So, we already ingesting data from serverless and CloudWatch and other integrations. But we want to double down on that. We want to be a one-stop shop for everything, all your observability data. And also, we want to change the way monitoring and investigation works. Nowadays, monitoring is like, oh, just pick a number.
How many active time series you should have or how many insights to the database you should have? And to be honest, once you've onboarded another customer, that number is already obsolete. And the way we build dashboards and alerts are the same way we did it 10 years ago.
You build a dashboard every time you have a problem, and then you have a hundred of dashboards where they're all doesn't fit the new problem that you're trying to solve. So, we are going to innovate in that field as well, because once you have all that data, it makes sense to also take it to the next step and offer a new investigation flow or root cause analysis a bit more modern.
Eden: I'm really looking forward to continuing to follow the journey and following your blog, following just like all of the company updates in the upcoming quarters. I think there's a lot of need in the market for what you guys are offering and it totally makes sense to expand, I know, into EC2 and some of these other instances. I think that's going to be a great extension to the platform.
Yechezkel: Thank you. Amazing.
Eden: Yeah. Really appreciate you taking the time to join me on the podcast. We are always looking for a diverse range of different engineering teams, building different products, different platforms, different ideas, and it was really interesting to hear about your unique approach to building Flora and also the testing and the process of developing and launching it.
I think a lot of our audience is going to find that really interesting as well. Anything else that you'd like to share while we have you here on the podcast?
Yechezkel: No, I think if you didn't experience eBPF yet, you should. It's magic, it's beautiful. So, you can check our platform, you can check other platforms or some kind of tutorials of how to get started, but it's a very, very impressive framework that I think every developer should experiment.
Content from the Library
O11ycast Ep. #75, O11yneering with Daniel Ravenstone and Adriana Villela
In episode 75 of o11ycast, Daniel Ravenstone and Adriana Villela dive into the challenges of adopting observability and...
O11ycast Ep. #74, The Universal Language of Telemetry with Liudmila Molkova
In episode 74 of o11ycast, Liudmila Molkova unpacks the importance of semantic conventions in telemetry. The discussion...
Machine Learning Model Monitoring: What to Do In Production
Machine learning model monitoring is the process of continuously tracking and evaluating the performance of a machine learning...