Ep. #76, Managing 200-Armed Agents with Andrew Keller
In episode 76 of o11ycast, Jessica Kerr and Martin Thwaites speak with Andrew Keller, Principal Engineer at ObservIQ, about the evolving world of telemetry and observability. They explore the role of OpenTelemetry, the groundbreaking Open Agent Management Protocol (OpAMP), and how open standards are shaping the future of data pipelines. Learn how ObservIQ’s innovations help standardize telemetry collection and empower organizations with dynamic, scalable tools for managing their observability ecosystems.
Andrew Keller is a Principal Engineer at ObservIQ with over eight years of experience in telemetry and observability. He has been a driving force behind key innovations, including contributing ObservIQ’s Stanza log agent to OpenTelemetry and co-developing the Open Agent Management Protocol (OpAMP). Andrew is passionate about open standards and creating tools that empower organizations to manage their telemetry pipelines with ease and flexibility.
In episode 76 of o11ycast, Jessica Kerr and Martin Thwaites speak with Andrew Keller, Principal Engineer at ObservIQ, about the evolving world of telemetry and observability. They explore the role of OpenTelemetry, the groundbreaking Open Agent Management Protocol (OpAMP), and how open standards are shaping the future of data pipelines. Learn how ObservIQ’s innovations help standardize telemetry collection and empower organizations with dynamic, scalable tools for managing their observability ecosystems.
transcript
Andrew Keller: We've been in the telemetry space for a long time, and historically, every vendor has kind of done their own agent solution responsible for a collecting metrics and logs and traces, and you end up with really these individual silos.
And it turns out that customers want to send data to lots of different backends, and they want more control over the collection of their telemetry. They may change backends over time.
Jessica "Jess" Kerr: What do we mean by backend?
Andrew: Like the telemetry backend, like the place where you have the dashboards and the alerts and the actual telemetry solution.
Jess: So like the database where the telemetry is stored for access by pretty things.
Andrew: Where it's going, right.
So what OpenTelemetry allows you to do is standardize the collection of that data and then send it anywhere. It turns out vendors don't really love writing the agent and the collection part of it anyway.
There's always this challenge of trying to provide as many integrations as possible, collect to as many databases as possible, connect to, you know, message queues, and 200 seems to be about the right number that everybody's sort of always targeted as a number of integrations.
Jess: Wait, wait.
Andrew: And there's always some esoteric things that-
Jess: 200 what?
Andrew: Integrations with the different platforms you're collecting data from.
Jess: Oh, like on the input side.
Andrew: Right, right.
Jess: Okay, okay, so get data from 200 different places, send it to half a dozen data stores?
Martin Thwaites: I mean, that 200 is just the JavaScript frameworks from last week. There's another 200 this week, so.
Andrew: Indeed. Yeah, so OpenTelemetry provided an opportunity for the community to come together and standardize the collection of telemetry and still have their own...
You know, vendors all individually have their own telemetry backends, where, again, the customers can collect and observe and create alerts, et cetera, look at their telemetry data.
But the actual collection part of it is hard, and it's nice to have the community coming together and building one solution for the collector.
Jess: One solution with 200 arms.
Andrew: Right.
Martin: What would that be? 'Cause an octopus has eight arms. What is the term for 200 arms? I mean, is there...
Jess: There's got to be a Latin word for 200, but I could ask my kids.
Okay, so now that we have an agent with 200 arms, and the trick there is how do you control them?
Andrew: Right, and traditionally, you know, we would have like a YAML file for configuration, and you would figure out a way to automatically deploy that with your agent. The challenge is if you want to make changes to that, you need to now modify that.
We found there's a lot of good DevOps tools for deploying things en masse and at scale, but it really helps to have a direct connection to the collector, to the agent, so that you can change its configuration on the fly.
It helps with a couple of things. One, you want to just understand what all you're collecting. Somebody may have installed a collector on some test box somewhere, and it's sending a ton of data, and you're being charged for the consumption of that data. You may have forgot about it. It's just running over here on this other box.
And so if that box is connected to an agent management platform via OpAMP or via really any agent management protocol, but obviously we're here to talk about OpAMP, that gives you visibility into what your agents are actually doing, where they're deployed, what kind of configuration they're running, and the ability to change that configuration dynamically.
Martin: So I think that would be a great place for you to tell us why you care about OpAMP. What is it? Who are you? And why is it you're interested in this kind of stuff?
Andrew: Yeah, so I'm Andy Keller. I'm a principal engineer at observIQ. I've been with observIQ for eight years. The company's been around longer than that, and we've always been in the telemetry space.
The history, there's the long version and the short version. I'll try to give you the medium version, which is, we had our own Java-based collector and implemented our own protocol for managing it.
We then built a open-source log agent, which we subsequently contributed. It was called Stanza. We contributed that to OpenTelemetry. It's the log agent in OpenTelemetry.
It's changed quite a bit since we donated it, but it's where it began. And we had our own protocol for that agent. So we started out with an HTTP-based protocol, then a WebSocket-based protocol, then implemented another agent with a different WebSocket-based protocol.
And then we decided to standardize on OpenTelemetry as did many other vendors. But we still wanted some way to control and manage that agent remotely. And it turned out lots of other people did as well.
So OpAMP really started out with the collaboration with Tigran from Splunk, who put forward the OpAMP protocol spec. I reviewed it with him and eventually developed a OpAMP Go implementation of the OpAMP protocol.
The Opamp-go repo contains both the client implementation and the server implementation of the OpAMP protocol. I'm a maintainer of that repo as part of OpenTelemetry.
Martin: I think it's worth mentioning as well, like, what does OpAMP stand for because-
Andrew: Good question.
Martin: I love OpenTelemetry. I absolutely love OpenTelemetry, but none of the abbreviations that they give really mean what you think they mean. So what does OpAMP stand for?
Andrew: Yeah, well, I think it means what... It's a good abbreviation. Here's the problem. If you Google OpAMP, go ahead and try it. You'll find it's actually a thing.
Martin: Operational improvise.
Andrew: Yes, yes. So OpAMP stands for the Open Agent Management Protocol.
Jess: Open not operational, oh.
Martin: Exactly, like-
Jess: Oh, so the AMP is capitalized but the Op is not.
Andrew: Yeah, so the P is lowercase because it's part of open. You know, o, a, just this, you needed another consonant in there, you know.
Martin: Somebody needs to fire the people who name things in OpenTelemetry, I'm sorry.
Jess: Naming is half the job of OpenTelemetry. And bless those people like Ludmilla who sit there and argue about it for so many hours. OpAMP is totally fine.
It is hard to Google because it is an analog circuit block that takes a differential voltage input and produces a single-ended voltage output, obviously.
Andrew: So, you know, I would say my personal goal is to eventually be the first hit when you... If you're trying to find operational amplifiers, it's going to be really hard in Google because Op-AMP is going to rule the world.
Jess: People are so excited about controlling their OpenTelemetry collectors.
Martin: I mean, we've got to have goals in life. I mean, top of Google for OpAMP is definitely up there.
Jess: All right, all right.
Martin: So that's what OpAMP means anyway.
Jess: You'll need to produce some TikToks for that.
Martin: Yeah. I'm here for it. I'm here for it.
Jess: Okay. Okay, so it sounds like you at observIQ have a ton of experience in controlling 200 armed agents even before OpenTelemetry existed.
Andrew: Yes.
Jess: And now, rather than trying to make your tentacle monster win, you've adopted OpenTelemetry and contributed to that instead.
Andrew: Exactly.
To your point, it's not about our agent winning or anybody's agent winning. We are in the business of helping you manage your telemetry pipeline. And the pipeline is really what people care about. They care about the data. They don't care about how the data is being collected.
You know, it's not about can I write a database query better than somebody else can write a... We're all writing the same queries at the end of the day.
We're all hooking into, say, Postgres, to grab metric information, or we're tailing log files. And there isn't a lot of secret sauce to tailing log files, and ideally-
Jess: There isn't?
Andrew: Well, to be fair, there is a lot of subtlety in the file log receiver. So maybe there's some secret sauce, but, like, let's share that secret sauce because that's not the part that people feel like they want to compete on, I guess I would say.
Jess: Okay, so there's totally sauce, but let's make it not secret.
Andrew: Right. I like that, yeah.
Martin: It's like the KFC secret sauce, you know. Eventually everybody knows what it is, you know?
Andrew: Yeah, and that sauce just keeps getting better, so.
Martin: Indeed. I mean, I think that if you go down under everything, I come back from the Logstash age where everything was grok. Everything was basically a modified regex for parsing of files.
And the secret sauce was knowing what the regex was, and there were people who kept their regex to themselves, and there was no real goal to that. Everybody should be able to just take some regex and add it in 'cause that isn't the secret sauce. The secret sauce is how do we provide insights.
And I really appreciate like people going for open standards like the OpAMP protocol. And you know, the Stanza thing is really, really amazing in the file log receiver for all of those different operators.
Jess: So, hold on. Andy mentioned Stanza earlier as observIQ contributed Stanza to the collector. How does that fit with the file log receiver in the collector?
Andrew: Well, Stanza was the log agent that became essentially the file log receiver. I guess you could maybe put it that way, but it's probably a little bit more complicated than that. It's the whole log component of the collector, so.
Jess: So did Stanza have operators before it was part of OpenTelemetry?
Andrew: Yes.
Jess: Because speaking of OpenTelemetry naming, why the F does the file log receiver have operators that are completely different from the OpenTelemetry operator, which is a Kubernetes operator? Now I begin to understand.
Andrew: Yeah, because partly because they existed in Stanza before it was part of OpenTelemetry. And part of the reason those operators exist is that it's just efficient to process logs with those operators as quickly as possible before kind of getting them into the pipeline in OpenTelemetry.
Jess: So are they, like, processors? Are they plugins?
Andrew: There's parsers, processors. You can move attributes. You can rename things, You can, you know, basically take an unstructured log, turn it into a structured log.
Jess: Before it gets into the collector's pipeline?
Andrew: Right, Stanza allows you to do that. But an Open Telemetry pipeline allows you to do that as well, so it's redundant in a way, but there are some advantages to doing some of those operations in the receiver.
Jess: Oh, I think we have a really good example of that right now because you can parse JSON if you have JSON in your log body. You can parse it in the file log receiver, is that right, Martin?
Martin: Yep.
Jess: And you said yesterday somewhere that that was more efficient than doing the parsing in a processor in the collector pipeline?
Martin: Yeah, so I mean, like I said, there's a lot of crossover, and this was something I've been doing quite recently, is looking at how we take those unstructured logs or even just structured logs with JSON and get them into the collector.
And trying to work out where we should use the JSON parser, which is part of the operators and the Stanza stuff, but where we should just take the whole data out as a string attribute into the transform processor, do a parse JSON, do a map, and then condense those things down.
Ultimately, it's more efficient to do that in the file log processor 'cause we know all of those are going to be JSON rather than just running it on every single log.
So I think there is probably a lot of crossover, and I think this is one of those things where, at scale, it's probably better to do it one way versus the other.
And I think obviously scale is a lot of where the OpAMP stuff is really, really useful, being able to then do all of this kind of, where do we do the parsing? What's the most efficient way to do all of this parsing?
Jess: Oh, 'cause then you can like test this stuff-
Andrew: Exactly.
Jess: Without redeploying your YAML and restarting your pod or whatever. You can be like, "You, collector, start doing this," no?
Andrew: Right.
Jess: Nice because it turns out that each arm has many elbows.
Martin: And the wrist bone's connected to the elbow.
Jess: And the elbow's connected to the next elbow or whatever you configured today. Okay, so how does OpAMP, which is open agent management protocol?
Martin: Yay.
Jess: So this is like a... Is it a WebSocket, no, that the collector's listening on?
Andrew: It supports most WebSockets and HTTP. So with HTTP, it would use long polling basically, and it'll reach out to the server. If the server doesn't have anything to send to the agent, like, say, a new configuration, for example, it will just leave that connection open.
And then eventually if it does have a new configuration, it can send it as a response to that HTTP request. Or if you have a WebSocket, and you have a persistent connection, it can just push down a new configuration whenever it has one.
Jess: Okay, so this collector is like super actively listening for, "What do you want me to do now?"
Andrew: Yeah, I mean, super actively is... It turns out it doesn't take much to listen on a port, so it's not consuming much, but yes, it is waiting for a response, either over HTTP or the WebSocket. And as soon as it gets one, then it can reconfigure itself and start doing something new, so.
Martin: So is this just about configuration though? Is there anything else that OpAMP can do?
Andrew: It can do lots of things. So configuration I think is sort of primarily what people think about. But first of all, if we decide to kind of start at the very basics, it's connecting to the server.
So, you just know that this agent exists, and it has a connection. It's the agent identifying itself, so maybe it's the host name. Maybe it's the OS it's running on, the version.
Jess: Oh, so it's like this phone-home, here-I-am kind of thing. So your collector becomes observable.
Andrew: Right, exactly. You know, this is all about observing observability, right? It's the meta.
Martin: Well, who watches the watchers?
Andrew: Yeah. So there is a sort of a read-only use case for OpAMP, and that's implemented in the OpenTelemetry OpAMP extension.
And that's just knowing that the agent exists, some basic, we call it the agent description, so some attributes about the agent, like I said, OS, unique identifier for it, things like that.
It could report the configuration it's running. It could report information about its sort of own telemetry that it's collecting, so telemetry about the agent, how many logs it's processing. You know, it could be performance metrics, all sorts of things.
And then, so if we then step into sort of the right side of things so we can change the configuration on the fly, we can actually send new packages to the agent, including an entirely new agent, so we can support remote upgrading of the agent.
All of these things, by the way, are the way the protocol's designed, it's opt-in with a bitmask. So the agent will report what it supports, so I support accepting remote configuration, or I support accepting remote packages.
And if it doesn't, then it's up to the agent management server not to send those because they're going to be ignored by the agent. I'm talking about these capabilities, but it isn't sort of a, you want to use any of it, you have to take all of it.
It could only be used in a read-only fashion. It could only support any of the configuration changes or only support package-updating. You can configure the own telemetry that the collector is reporting.
And there's also something called custom messages, which I sort of reluctantly added about a year ago. And I say reluctantly because you want this to be an open protocol, and so as soon as you introduce something that's sort of undocumented, let's say, it could be abused by vendors to just kind of send everything as custom messages.
Jess: Oh, so they can be like just, "Hey, we have our message already we're going to send. We'll just call it a custom OpAMP message and not conform"?
Andrew: Well, then it just becomes harder for another agent management platform to understand messages your agent's sending or vice versa.
However, the reason we added it is that you already have an open connection to the agent, and there's lots of different ways you might want to use that connection.
And so to give you an example, there's a processor we wrote that allows us to grab a little snapshot of telemetry just so that you can kind of see what's actually flowing through here, what's going through this processor, and am I processing it how I expect to be processing it, for example.
And that's implemented with a custom message that gets routed to that processor, and the processor can respond. And that allowed us to sort of spike out this capability without needing to either change the protocol in an incompatible way that would...
You know, again, this is like a kind of a special feature that we were trying out. And we already have an open connection to the agent, so let's utilize it rather than... Our original spike of this was opening up another connection from the processor to the agent management server.
And it just seemed silly to have another WebSocket when we already have one. So a custom messages allows you to do some interesting things with that connection. And I think the goal would be as those things are sort of documented and become accepted standards, they could remain custom messages and just be documented as such, or they could move into the protocol.
Jess: So the custom messages allow you to test out new extensions to the protocol before making them official before you even know whether they're right?
Andrew: Basically, yeah.
Jess: Excellent.
Andrew: Another interesting example would be discovery. So, there's always this sort of this idealized story of discovery.
I don't know how well it's been implemented in reality, but it's like, you know, you saw the agent on a box, and it suddenly detects that, you know, we've got a Postgres server running here. We want to start monitoring it. We've got this other thing running. We want to monitor that other thing. And we're kind of discovering what's running on that.
Jess: Oh, okay, so like the arms go out as feelers?
Andrew: Right, basically, yeah. So you could use custom messages for something like that where the server's able to say, "What do you see running on that box? I want to offer some configuration for those things."
Martin: So it becomes kind of like a probe at that point as well. So it's kind of not just the collector. It's also becoming like, it's got a little probe aspect of going, "I'm going to say what are the processes running. What could those processes be?"
And say, "Actually, look, it looks like something is running on here that I wasn't expecting, but let's get some metrics for that. Let's get some logs out the back of it."
Andrew: Exactly.
Jess: And then can it say, "Hey, I think I have a Postgres server running next to me. It feels like a Postgres server over here"?
And then the controller that the agent that is like speaking OpAMP to it can say, "Oh, okay, here's what I want you to do with that. Take these log files, parse them like this, and send them over here."
Andrew: Exactly, and to be clear, there aren't custom messages that I know of that exist like this and, you know, this isn't necessarily a feature. It's certainly not a feature of OpAMP. You know, it's just a hypothetical example of some way you could use custom messages.
Jess: 'Cause it is a feature of some observability agents, right?
Andrew: It is, yes.
Martin: And I think this is that the open standards--That's what I love about the OpenTelemetry stuff is that what we're doing is providing a baseline of framework that's extensible.
So everybody can opt in. Everybody can do something with it. And then vendors still have the ability to build on top of those open standards to do more.
Like you said, the observIQ being able to add on some extra bits with custom messages allows that extensibility without saying, "I'm doing something completely different. I'm rebuilding." But I kind of do a bit of OpAMP, but then I also do all my own things as well.
Allowing that escape hatch to be able to say, "I still do all the normal stuff, but I can also do a few extra bits," is really key to me what open standards are about.
Andrew: I agree, yep.
Martin: And I think that really is what's going to make that adoption hopefully for things like OpAMP to be amazing because it can be used not just for managing collectors. It's not the collector management protocol. That's not what it is. OpAMP is something different.
Andrew: Right, one thing I want to make clear is that while this is part of the OpenTelemetry project, it's not the goal for this to be the OpenTelemetry protocol for agent management.
That's why it's called the Open Agent Management Protocol. We'd like every agent that's out there to speak OpAMP and allow it to be managed in a consistent, you know, vendor-neutral way.
Jess: And by agent there, we mean like a process that runs somewhere out remotely in your network and does stuff?
Andrew: Exactly, you know, there's other telemetry agents in the market. You mentioned Logstash earlier. Elastic has their set of agents, and I know they're also working with OpenTelemetry. And I think, you know, there's a lot of security agents out there as well.
So there's really an opportunity for any agent to speak OpAMP and then to allow an agent management platform to observe and configure those agents consistently.
Jess: Agent management platform, that's also AMP.
Andrew: We should have consulted you on the naming.
Jess: It's okay, I can complain about anything.
Martin: I have seen some of the names that Jess has given things. No.
Jess: And my theory is if you don't have a good name, give it one that's clearly bad.
Martin: Don't go down the middle, you know?
Jess: Right, right, don't make them think, "Oh, this might be about voltage in circuits." No, no, name it like crunchy bacon or something so you know when you're wrong.
Martin: So I heard a rumor, and I don't know whether this is one of the plans is for the OpAMP stuff, but because it's about configuration, that we should be able to use this to actually configure the SDKs inside of deployed applications.
So not just an external process, but also being able to use OpAMP to dynamically reconfigure applications like remote head sampling and potentially which processes to enable, what instrumentations to enable, for instance, that we can do that hopefully potentially on the fly inside of these applications using something like OpAMP.
Is that the sort of direction that you see as maybe an enhancement for this?
Andrew: I think definitely eventually. I don't think there's any effort there right now, but that's definitely on the roadmap, is to be able to remotely configure SDKs and applications.
Because you can kind of solve that right now if you have a collector deployed, and that's like a gateway that all of your applications are then funneling their telemetry through that gateway. And then you can control what that gateway is sending along.
Jess: So for instance, you have a Java app running with the collector alongside it as a sidecar or whatever, and you want to change the sample rate. And you can tell the collector to drop half the traces or something.
Andrew: Exactly, you're still collecting them. The Java app's still collecting all of those and producing all of those, and that's not ideal.
But I think it's also we need to figure out the exact right architecture for the SDKs, you know, are then going to need to have a WebSocket or HTTP connection. They're going to need to be OpAMP clients, effectively, for that remote configuration to happen.
Jess: Right, in Java it's even called an agent.
Andrew: And it's going to differ per language to how that actual implementation happens, you know, exactly how that configuration changes on the fly because a lot of that configuration is then controlling... Like in the Java case, you're controlling the log library, and so you've got to hot reload that or, you know, make sure that that's possible.
Jess: Right, and these things weren't built for a hot reload.
Andrew: Exactly, and the collector itself, by the way, isn't built for hot reload, and we can go into that a little bit. There's another-
Jess: Did we change it?
Andrew: Well, I don't think the solution will be to change it. So there's an effort in OpenTelemetry right now with the OpAMP supervisor. I don't know if you're familiar with this, this is a small process that is responsible for running the collector, and it speaks OpAMP.
And actually it connects to the collector, which speaks OpAMP. So the supervisor actually sort of acts as a proxy where it's connected to an agent management server. Then the supervisor receives messages.
And if it receives a message to change the configuration for the collector, what it actually does is shuts down the collector, writes a new configuration on disk, and then spawns a new collector.
And that's sort of how we handle the hot reload, if you will. And what that assures is that you have a real clean shutdown of your telemetry pipeline. All the telemetry flows through. You know, you stop at the receivers.
Everything flows through the processors. Leave the exporters. Get a nice clean shutdown. Write a new configuration to disk. Start up a new collector. Get your telemetry flowing again.
Jess: So you're not running two collector pipelines at once. There's a clear boundary. In the meantime, your Java app next door is getting, is able to send to the collector at all?
Andrew: The collector will stop during that time. You know, it's very quick to restart, but it'll stop, write a new configuration, start the collector. That's how the supervisor works.
Jess: How long is very quick? I mean, it depends, but if you're going to stop it, and you've got to like finish your pipeline processing, then it depends how much stuff you've got in there, right?
Martin: And if you've got a tail sampler in there, and the tail sampler is waiting for 30 seconds, you know, I mean, this is the reason why I am so happy that there are some incredibly smart people working on OpenTelemetry 'cause this breaks my brain.
Like, I would just like to send some telemetry and ask questions about it. Like, I'm good at that bit. That bit I'm good at, but, you know, this is the really hard problems that you only really get at scale, really.
You know, if you've got 20, 30 spans going through a collector, who cares? Restart it. You lost three spans, boo-hoo. But when you've got like a bajillion spans going through something, and you've got to wait for those pipelines, and then, so which thing do we do?
Do we stop it? Do we start it? Do we let it flush? Like, I really don't want to care about that, which is why I'm so happy that some incredibly smart people are working on this.
Andrew: I think there's a opportunity with OpAMP to do some dynamic configuration of individual processors and you know, any component I should say, you know, connectors, receivers.
Jess: Oh, okay, so you could make, say, the file log receiver speak OpAMP and be controlled. Although I kind of like the clean shutdown and startup because the first attribute I want to put in my new configuration is some version of the configuration that gets added to every record that goes through it.
Andrew: Yeah, I think for things like sampling is a really good example and like a processor or something you might want to just dynamically control.
And there's really no reason we couldn't just, you know, change a number if that's all we really want to do, is change a sample rate for. In a live telemetry pipeline that data's flowing through, you know, obviously there's some locking involved, you know, not a big deal, but to be able to dynamically just flip some bits.
Jess: So a finer grain of control is possible in theory.
Andrew: Right, exactly.
Martin: I mean, it's just like updating the attribute processor. To add a new attribute is not the end of the world. Like, you know, the next ones that come through will get the new attribute. You know, it's not the end of the world to be able to do that kind of stuff.
Jess: So there could be some specific changes.
Marin: Yeah, but this, you know, a load of.... It's all of those edge cases that blow my brain.
Jess: Ah, those are optimizations.
Andrew: Yeah, there are definitely a lot of edge cases. The cleanest, like, foolproof way is with the supervisor to shut down, reconfigure, restart.
You're sure that everything's been properly initialized, and there's little risk in running into problems that might creep up from a long-running process being reconfigured dynamically over time.
Jess: Okay, this episode has been very happy. Can you tell me something that you don't like about OpAMP or maybe a story of something that went wrong in the development process?
Andrew: Honestly, I think OpAMP's really great.
I think probably custom messages are the piece that really kind of unlocked a lot of flexibility and the ability to experiment and to utilize that connection to do some really interesting things with remote management.
And so I think before that, it felt a little constrained. Like, we could only do this with OpAMP, and if we want to do anything else with the agent, we needed to open a new connection. We needed to create some-
Jess: Some sort of matrix-management operation where it's listening to multiple supervisors.
Andrew: Exactly.
Jess: Nobody wants that.
Andrew: Right? So it really opened the door, I think, to a lot of really interesting customizations and I think has created an environment where we can experiment, and then we can share those experiments, and we can introduce changes as needed.
Jess: Have there been any good failed experiments?
Andrew: Not that I can think of off the top of my head, but...
Jess: Okay, so call-out to listeners, tell us what not to do.
Andrew: Well, one thing that's tricky, so the server side of OpAMP, the OpAMP Go library gives you kind of a very thin API on top of the protocol.
The client for the agent is really robust and handles a lot of synchronization issues, and you can kind of just start using it. The server side, you know, you really need to implement a proper agent management platform.
And one of the things that's really challenging is just on the surface, let's say we've got like 500,000 agents connected, and somebody decides to change the configuration, and they're all using the same configuration.
The kind of naive approach would be like, let's just send out new configuration of 500,000 agents.
Martin: What could possibly go wrong?
Jess: 500,000 things.
Andrew: I mean, it's just WebSocket messages, no big deal. You know, the problem though is that, well, first of all, what if that configuration's not working? You know, you set it down, and then there's an error.
Jess: Yeah, you broke all of them.
Martin: And then you can't access them because now their config is broken. So now you have to go and, you know, go onto the data center and click the buttons, which in Kubernetes is a, you know...
Andrew: Yeah, and then you also have this challenge of, okay, so these agents all shut down, reconfigure, start back up. And now you have 500,000 inbound connections that your server needs to suddenly, you know, handle all of these agents.
So you get these like massive spikes of activity, and at scale, that can be hard to handle.
Martin: Yeah, I mean, the stampeding herd problem is a big one, yeah.
Andrew: 'Cause, you know, most of the time, it's just a dormant WebSocket. You know, it maybe reconnects periodically based on your firewall settings or something, but it's mostly just sitting there dormant.
And then all of a sudden, you have this massive burst of configuration and then this massive burst of reconnection. And so there's some ways to ideally throttle that. We have some tools for starting out with a couple and then ramping that up exponentially over time.
And also if you label your agents certain ways, you can kind of start with maybe your test agents and then go to your stage agents or something like that and then roll it out to prod last.
Jess: Nice.
Martin: Can you think of any recent examples globally where people pushed out an update that maybe caused maybe millions of things to break at once and then- (Andy laughs)
Andrew: Never happened.
Jess: This episode was recorded not very long after the CrowdStrike debacle.
Andrew: That was not the best look for remote configuration, I would say.
Martin: Really? I mean...
Jess: Very true, okay, yeah. So, great, I was looking for like spiky pits to fall into, and that is definitely one. Be careful with writing your agent manager, not super easy.
Andrew: But the good news is that the protocol, you know, stands up.
Jess: Right.
Andrew: It hasn't been, you know, an area of concern or something that needs to be optimized.
Another direction I see it potentially going in is the ability to proxy the outbound connections similar to the way we proxy telemetry in an OpenTelemetry gateway. So with an OpenTelemetry gateway, you've got an OTLP receiver, an exporter.
Jess: Wait, does that mean you can have agent management platform-management platforms?
Andrew: Exactly. Well, if you think about a deployment where you might have all of these agents running on a pretty locked-down network, and we only want, you know, we've got a gateway running in a DMZ, and we let that talk to the rest of the world.
Ideally those agents that are running on a network don't need to reach out directly to the management server. They would also reach out to the gateway, and the gateway would basically be responsible for managing those agents.
So, I think of it more of like a shepherd-and-flock kind of analogy than another agent management server behind the-
Jess: Right, well, like with collectors, you-
Andrew: But the shepherds need to be managed too, right, so.
Jess: Like with collectors, you often run a sidecar right next to your app. Maybe you additionally run one per node, but each of those is sending to a bigger one with more context, which in turn might send to your backend.
Andrew: Exactly, so out of the box right now, each of those agents would then have their own individual OpAMP connection to the agent management platform. And it would be nice if we just sort of follow that, that path of the telemetry with the management component.
Jess: Nice. Okay, where can people go to find out more about OpAMP, observIQ, open agents in general, and you in particular?
Martin: We know we can't Google it, so you need to give us something really, really specific.
Andrew: Okay, so for OpAMP, opamp-spec in the OpenTelemetry GitHub org. Opamp-spec is the OpAMP specification repo, and that's where you can make suggestions or open issues or even PRs for the spec to change the actual spec.
And then opamp-go is the Go implementation of that spec and also in OpenTelemetry. You can learn more about observIQ at our website, observiq.com. Check out our blog.
And I'm on the CNCF Slack. I'm Andy Keller, also Andy K-E-L-L-R on Twitter. You know, it was like the cool drop the E, and also some other guy took it, so. I'm Andy K-E-L-L-R.
Jess: That dates you.
Andrew: Yeah, it seemed really cool at the time. So, you know, most places, that's where I am.
Jess: Great.
Martin: Awesome.
Content from the Library
O11ycast Ep. #75, O11yneering with Daniel Ravenstone and Adriana Villela
In episode 75 of o11ycast, Daniel Ravenstone and Adriana Villela dive into the challenges of adopting observability and...
O11ycast Ep. #74, The Universal Language of Telemetry with Liudmila Molkova
In episode 74 of o11ycast, Liudmila Molkova unpacks the importance of semantic conventions in telemetry. The discussion...
Machine Learning Model Monitoring: What to Do In Production
Machine learning model monitoring is the process of continuously tracking and evaluating the performance of a machine learning...