Ep. #32, Managing Hardware with Gianluca Arbezzano of Equinix Metal
In episode 32 of o11ycast, Liz and Shelby speak with Gianluca Arbezzano of Equinix Metal. They discuss diversifying the way observability is evangelized, the indicators that an organization has outgrown off-the-shelf tools, and managing hardware.
Gianluca Arbezzano is a Senior Staff Software Engineer at Equinix Metal, a bare metal server provider and part of Equinix. He was previously a SRE at InfluxDB.
In episode 32 of o11ycast, Liz and Shelby speak with Gianluca Arbezzano of Equinix Metal. They discuss diversifying the way observability is evangelized, the indicators that an organization has outgrown off-the-shelf tools, and managing hardware.
transcript
Gianluca Arbezzano: I think it's really a strategic, a deep strategic choice.
When you start, it's easy to just install an open source software and try to hack around it.
But at some point, even more when you do monitoring, you have to monitor and take care of your monitoring systems.
So, you kind of go deep on a loop that is not easy to escape from.
It's a lot of the same topic we do with hardware at Equinix Metal.
Like, when do you go to cloud? When do you buy your own hardware?
So it's a never ending question. I think it really depends on how do you want to approach the problem?
It is strategic for you to solve it by yourself, because you want to do more with the solution, or you're just looking for the end solution, and you want it reliable, and you want to blame somebody when it doesn't work.
So, I think that's my idea, how do I try to decide when I have to?
Liz Fong-Jones: So it sounds like you're alluding to almost a question of "How much do you need to customize it?" versus "How much does the off-the-shelf thing work for you?"
Gianluca: I think it's definitely one point.
If you want to invest, and if you think that it's a great idea to be able to have your end solution as you want it because it is good for your team, it's good for your product, you definitely need the ability to hack the code around the product.
But if you just want something that works, and you can take the trade off of making your workflow good enough to work with the solution you are buying, buying is good enough.
Liz: So, what are some of those kind of tipping points for you? What are kind of the signs that you're outgrowing something?
Gianluca: I think at some point, you definitely need to question about why the request that you receive are the ones you get.
So if you have a problem with your monitoring system because it's growing too fast, and your team keeps asking you to add more storage or more hardware.
Is that a reasonable answer to just add more SSDs, or more objects storage, or more capacity?
Or you have to figure out a different solution for that, and maybe the solution is to stop and just look at the metrics that you are sending, and figuring out why they come so heavily, because maybe it's a bug.
So sometimes, you have to stop, even if you buy or if you don't, you just have to stop and look at the context you are in, and try to figure out if it's reasonable for your scale, for your project, for your team or not.
Because if it gets too heavy, it becomes really hard to understand, because the noise grows so much that it becomes almost useless, because you're always looking at signals that are not clear anymore, you lose trust on your system.
And it will be very frustrating for a team.
And I've been there. I've been to a team where we had a single alert channel on Slack that was always on fire.
And at some point, I just mute all of them, and nobody cares anymore.
And still, sometimes I get new requirements.
Maybe from different stakeholders, if they're not a developer, then maybe just, they use Slack in a different way, or they use, like the Jira in a different way, and they want to know what's going on from their perspective.
So they ask me, "OK, can you set up like a Slack channel for all the notifications?"
The answer is always like, "You're going to get so much noise that you won't look at them anymore."
But I learned that it really depends on the stakeholder sometimes, so you have to stop and, just try to figure out who is the person that's asking what and if it's reasonable.
Shelby Spees: So, now would be a great time for you to introduce yourself.
Gianluca: Yeah, thank you. I'm Gianluca. I'm from Italy and I'm from Turin, so from the Alps, Graian Alps.
So I'm a bit sad that this year I won't be able to ski, at least in the beginning of the year.
But I'm a software engineer. So I started as a developer and I moved to apps when I learned how many things I was able to do with a few API requests.
And now that I learned that, I started a new journey at Equinix Metal and I work on hardware and data center, because I was a bit sad of not knowing a lot of technologies that were behind those API requests.
So now I am increasing my visibility on the stack that I know.
And yeah, I'm an open source enthusiast and contributor, so I had projects like Kubernetes or Docker, or Testcontainers, mainly in Go Libraries or Go Projects, growing or fixing bugs and trying to keep up with the community.
For me, it's a good way to learn where I am.
I have the possibility to contribute to a project. So, that's mainly me.
Shelby: That's so much good stuff.
And I like how you started as a developer, and you've just gone deeper and deeper down the stack.
So can you talk more about what you work on the hardware side at Equinix?
Gianluca: As I said, it's completely a new field.
For me, I started my career as a developer eight years ago, and Amazon was already big and there.
So cloud providers for me are the way to go when it comes to provision software.
So, at some point, I had the opportunity to start to look at this problem in a different way.
And currently, I'm working at Tinkerbell that is a project that we use internally to provision our data centers, in a handless way, so without having to touch them all the time.
So, I have to say, I discovered that there are no people running a data center with USB sticks and install the operating system.
So I thought it was that the case, but it's all automatic, so it's different.
Liz: Tinkerbell is open source, right?
I think I was on a live stream with Amy Tobey working on the Tinkerbell source code a couple of months ago.
Gianluca: Yeah, it's open source and we recently joined the CNCF as a Sandbox Project.
And I'm very excited for that, because to me it looks like the missing layer in the CNCF landscape.
We have a lot of tools that we can use to build monitoring solution or identity and authorization access, or orchestrator.
But managing hardware in the cloud landscape is something that wasn't that clear.
Liz: Right, exactly, because a lot of people have been using public cloud providers.
But for people that want to roll their own private cloud, there hasn't been a CNCF project to handle the hardware wrangling. That's really exciting.
Gianluca: That's what we do. And there is a lot to learn for me, so I share a lot on Twitter.
My journey is there learning in public, let's say.
Liz: We'll be sure to put a link to your Twitter bio in the show notes.
Shelby: So, I'm curious how you got so interested in observability from a hardware perspective. Connect those dots for us.
Gianluca: I definitely missed a big chunk of my career during my presentation, because before joining Equinix Metal, I used to work for InfluxData.
That is a company that developed a time series database.
And that's where I learned way more about what it means monitoring a system, or what it means to treat a time series databases as a first citizen in a stack.
So, that's where I learned all the stuff that we will probably talk today.
But to tell you about the breach between observability and hardware, I think something really good that we do at Equinix is the logistic, build quickly data centers and finding hardwares.
And knowing what it is, where it's hooked. All the switches and the cables, all those stuff in my mind pictures has a trace.
Because you have to slice and dice and see where the package is going, or all the interconnection between the servers, and a lot of those end up as a trace.
So I keep that in mind, even if I'm not in the time series.
Liz: I remember the first time that I and other teammates who were Google cloud, we started looking at data center provisioning and data center turn up and visualizing those as traces.
It was a very eye-opening moment of wait a second, "It's taking three months to do what?"
Gianluca: Yeah. The trace gets very long.
Shelby: And it was really cool to see this firsthand when Liz, you were starting to roll out on Graviton2 for us, and Amazon actually ran out of capacity of the Graviton2 instances.
And so, I was like, "This is something I've never thought about before," Because this is the first time I've been on a team where we're using bleeding edge hardware.
And so that back and forth conversation with the Amazon support to just make sure we got the hardware we needed to scale up in time was really cool and something that maybe the average team doesn't have to think about.
But when hardware it's a bottleneck for you, that's really important to be able to think about.
Liz: Yeah, it's all real, physical processes.
Shelby: And there's somebody at the other end plugging in wires and slinging servers on racks.
So it's very exciting to get visibility into that.
Gianluca: Yeah.
Liz: So you mentioned your earlier work on InfluxDB, how much performance engineering and tuning were you doing?
How were you measuring the results of your software engineering on InfluxDB?
Gianluca: I joined as an SRE when we started to consolidate the current SaaS offer, we had, so let's call it version one.
And after few months, we started to plan the version two. This is the one that is currently running today.
And we also had the opportunity to heavily be the user of our project.
That is not something that obvious when you start a new SaaS.
Because you're scared about all the new unknown that you will discover.
So the way we monitor the system was with InfluxDB itself.
So we deliver a bunch of InfluxDBs that were our monitoring system.
And we had agents running on all the servers that were mainly monitoring the amount of points that every customer were sending.
And the amount of reads. The data point was the unit of work in their modern CPUs was and memory.
Liz: Right, exactly. Where it's about the critical business metrics and the critical user journeys.
That's what matters and not necessarily as you were saying, CPU relaxation, that's really cool.
And also, who watches the watchers? Who observes your observability system?
Gianluca: That's definitely something we're learning the hard way.
At the beginning we were relying way more on logs.
I remember when I joined, we were using paper trades.
It's like a lot more traditional way of doing monitoring even if we were the one fighting the new age of bartering back then.
But yeah, as soon as we did this transition with my background as a developer not as a operation person, I saw the struggling of my colleagues trying to make the application speak in a language that was understandable and it was a pleasure to see that transition happening in my teammates and myself.
Shelby: What you said just now, "Speaking language that's understandable," it reminds me of what our Honeycomb CEO Christine was talking about in a talk she just gave at GitHub Universe.
It's about observability for developers and teaching production the language of developers, right?
Like Liz said, it's not just speaking CPU utilization and memory pressure and things like that.
It's talking like this is where in your code that's being affected. And where you can find the answer those weird questions.
And so I think there's a lot of value there where we're trying to do a better job of helping people understand.
Observability is for developers and it helps everybody to teach prior the language of your business logic.
Gianluca: I think it's crucial to share this responsibility with the developers.
Because at the end, as you said what matters is that the end user is happy and they can use the product.
And use of these CPU memory disk are collateral requirements that make this journey enjoyable.
But that the business is what matters most. And the developers are the one that empower application to run someway.
So they are the people that know the tricky part of the code there is and how it really speaks.
I spoke at conferences and what my title for my talk often was, "Teaching to your application how to speak."
Because this is what I think a developer should do.
You have to think about is this feature or this piece of code observable.
So can I figure it out from the outside what's going on? So code review here plays a big role.
NAME: So when did you start getting introduced to this concept of observability?
As distinct from monitoring, because I think that we're getting there now.
Gianluca: I can hide myself behind my fingers, is coming from--
I met Charity Majors a couple of times during conferences and I started to follow the journey of the scuba paper back then, and the raise of Honeycomb definitely helped me to consolidate and place some good word explaining what I had in my mind, what I was seeing around myself.
Honeycomb played a good role in my-- I think as I always like to say when I read books about how a developer architecture layout is--
When you read good authors' writing, even if the topics are something that you may see around already, the way they describe it makes it more clear or more solid in your mind.
That was important.
Liz: Right, exactly. Something finally starts clicking, right?
Gianluca: Yeah, you are hands-on deep with your hand the problem.
And you're just see that there is a pattern. But you can't clearly see it until somebody tells you in a good way that that's what you're doing.
So, it was a good time when I started reading about observability.
Shelby: Yeah, and that's exactly why I appreciate having people like you going in and being part of the observability community and sharing this stuff.
Because as much as Liz and Charity and I talk about this, the way we say things isn't going to click for everybody.
The more people who talk about observability in different ways and help promote this stuff, the better it is for our entire tech community.
And that's what we're in it for, we're just trying to make everyone's jobs a little bit easier and help us build better systems.
So I love when I see people go out and give talks and post on Twitter and write blog posts about this stuff because it helps everyone to have more voices in the mix.
Liz: And speaking of blog posts, we've really enjoyed your blog post about kind of instrumenting with varying solutions and piping the data to varying places and seeing how it turned out.
Gianluca: I think it's very important for the ecosystem to be a real ecosystem.
We need people, because metrics come from everywhere in every form and they go everywhere in different forms.
Nobody can hold them forever, so it's important to build a system that can exchange the information in a common layer.
And the work you're doing with OpenTelemetry is very good.
It kind of took me awhile to handle that because I was coming from OpenTracing before OpenCensus, and the merge made the, at the beginning, the ecosystem a bit struggling.
Because transitioning is never easy, and even less when the tools that you treat have a life cycle of two years, or whatever.
But I can clearly see the end goal and I can't wait to be there when all the system that we install, if they come from open source or closed source, who cares?
They will speak the same language.
We'll be able to understand them not as a black box anymore, if developers do a good job of instrumenting their tool.
Liz: Let's talk about that. What does it mean for developers to do a good job instrumenting?
Gianluca: That's a good question.
As I said, I think collaboration and a code review is for sure something that has to be built in the culture of a team.
And if code review is already build, make sure that looking at how a feature in the code is written and the backable and observable is another checkpoint that you have to add in your list of important stuff we have to review during the code review session.
So that's definitely one. Another is reading logs. If you've write logs, if you've write trays, use them, is not the time anymore to you write print line random lines in standard output and forget about them.
They have no traction, they need to, share context to build context.
You need to be sure that those lines are in the right place in the right formats, sending the right stuff.
So you have to use them, not waiting for an outages or waiting for latency to spike to look at them.
Otherwise you won't be able to really figure out if it's you or the system.
Shelby: Like, being proactive and actually interacting with the telemetry that you're generating making sure that the stuff you're sending is meaningful and accurate.
I think it aligns a lot with how we teach testing, like "You don't want to just write meaningless tests just to get the check mark on the test coverage report. You want to write tests that actually test what your code is doing and all the cases that you care about."
It's the same thing for instrumentation. "You don't want to just spew out junk that you're never going to look at."
I think a lot of teams that's exactly what they have, just the log stream, and whatever.
So this is where having tools and having data that's structured and meaningful encourages people to actually interact with it, and it makes a better experience.
Liz: Yeah, I've just said "We have observability driven development and this idea of not treating your logs as a write-only a data source."
Gianluca: Yeah.
Liz: One question I had for you, Gianluca, InfluxDB had both a SaaS offering and a pure open source offering.
What were some of the major differences that you encountered when you were trying to adapt the open source offering into being a multi-tenant SaaS solution?
Gianluca: I think that the big challenge is the entropy that more users create on the system.
So the fact that you are not looking at a single bus anymore, that you can figure out in its pattern, but it's like mass coming from all over and trying to step over each other.
So it's completely different signal. And in the first version of the SaaS we developed, every customer had his own little cluster.
And that was very consuming in terms of resources because you had your five, six, seven, 10 AC choose for every customer with doc.balance or whatever.
So in terms of resources was a huge waste but obviously you treat every customer and all the signals has a single one and you know that it's coming and there is no noise, let's say, it's almost like, it's way easier to figure out what's going on because it's one customer doing their own stuff.
If something changes drastically it's even a bug or they change their model.
And you have to figure out if it's reasonable or legit or not.
When you start to mix customers and patterns altogether, you need a different dimension.
So usually, you have to architecture your code in a way that can represent that dimension.
So typically it's the user dimension or the API token dimension.
So the troublemaker is the API token, not even the user.
Liz: Right, which starts getting to the debate of high cardinality.
Gianluca: Yeah, that's where it plays.
And with tracing, as you pointed out you get to even more cardinality issue.
Because you have the smallest dimension you can get.
Like the single request that is what really at the end all want.
And this is definitely, I mean, the changing signal even if the product was the same was definitely challenging because interestingly it was the same, the people were still writing and reading points but as soon as we started to mix them together in a tenant fashion, the signals as we knew them, didn't have much sense anymore.
So we had to figure out that new dimension the user or the API token, yeah.
To get back to the previous question because I think it's nice it was nice, What I think developers should do is also learning about how to architect an application in a way that is for example, traceable.
Like how can you write an application that is easy to trace? How do we do that?
So I recently, not that recent anymore I think it was two years ago and it's like ages in computing.
And I learned about reactive planning and how you can do control loops almost like the Kubernetes architecture but much simpler than the old Kubernetes.
And that pattern is also very easy to trace because you have an executor that is a central one that takes planning staffs and actions and executes them.
And having a single place for the execution of the logic is a very good strategy.
Because when you have to add trace, you just go to the executor part and you add the code there.
And for me was when I first instrumented a cathode orchestrator we had at the influx when I joined, we saw that all the elaborate requests that were happening and Amazon has very good limits in terms of how many requests you can show and fire.
So we never really experienced problems, and also when you develop application like an orchestrator, they have to be safe and controllable, but like being fast it's not really priority there.
You add maybe a new request over here, another one over there.
And over time to get to like 150 requests for each provisioning.
And that times at some point gets back and you need to figure it out.
And for us, it was very easy to figure out all of those because we had a single place for the execution of the code.
And we also used database's decay that luckily for us had a good hook for pre and post execution of requests.
So we just hooked the tracing in there and magically we had visibility on all the system.
So having this kind of configuration in mind when you architect the code makes all the difference because if it gets too hard, you wouldn't do it.
Shelby: Totally.
I think it's in a similar way that both the design at the individual/repository level and then also at the bigger architectural level, thinking about what makes things traceable, what makes things modular, What makes things where the context is easy to promote and send out to your observability tool? It's really important and it helps you write better code, It helps you build better systems.
And then also in the process, like this example, I love this example because there's no one thing that is a really obvious cause of latency or whatever.
It's just growing and growing over time, but when you have good tracing that answers the "Where" question.
Like, "Where do we start to optimize this?"
That's a really hard question to answer if you don't have that visibility.
So, all of these things and all of these goals tie into each other. And it's worth it, it's worth the effort.
Liz: And also good DevX and good ergonomics can really help in terms of having one place to hook in traces or one place to automatically instrument your API calls, where that really makes a difference to having effective observability.
Rather than saying, "We have these main signals or these main tools." Instead, talking about that developer experience.
Gianluca: I really think we should keep saying that, but it's a developer responsibility to having all those dots in line.
Because definitely your operation would benefit from it, but it's really a developer effort.
I think it's also useful to build a reinforcing loop around tracing and stuff like that.
If cloud providers-- I keep saying Amazon because I know them, but I'm sure other does that as well.
You get a request ID back as a developer from the API, and if you tell that request ID to the support team they will be able to look at your specific method is failing.
Usually the problem will be yourself.
An at-rest fault is always your fault, but it's a very good-- it's a reinforcing loop that you should have in your mind.
If you're doing tracing, share your trace idea with your customer.
And if your customers are not developers, find a way to sneak the trace ID in their UI, in their user experience, in their journey. Be in the loop for that.
We made a very interesting staff at the Influx back then when we shared a part of our internal metrics with our customer.
So, every customer had like a call monitor. It was a dashboard in their Chronograf.
And in this way we kind of share the pain, operating their solution. And as a side effect, we had a lot less broadcasts from support because they weren't able to figure out data problem by themselves.
If they were delivering, the syntax is very hard to figure out how much you are spamming with your application.
Because you maybe had a new for loop, with a log in there, and from there stuff gets crazy.
So it's way easier to demand to the destination of your monitoring system, to know what's going on.
So we had customers that were delivering a new, maybe the version shoe of their application.
They weren't really sure about what was going on and the amount of logs or the amount of noise that they were generating was the best metrics they had at that time.
Liz:
It's definitely really critical to have people share that same view between different teams or between customers and a service provider.
That's really how you get ahead of problems rather than have people complaining afterwards.
Shelby: It's something I'm really excited about that we're starting to see AWS involving open telemetry, and Jaana Dogan talks about.
Liz, you and Jaana were on an event the other week and you were talking about just that provider observability-- Like, it's something that Charity's talked about a lot too.
Where when you're building a platform and people make a request and they don't realize it's the most inefficient way to write that query ever, so they're thrashing your system.
Just having a little bit more visibility into that, the provider of the platform offering that visibility helps both sides.
It helps the client debug their stuff, and it helps the provider help the client debug their stuff.
So even if we're on the buy-side more often than the build side, we can still be a shared team with the services that we buy and the people who run those.
It's something that we do at Honeycomb, when someone has--
They're really struggling, we've gone and helped them debug their instrumentation and stuff.
So I think that relationship is going to just continue being really important, and shared observability can help with that.
Gianluca: Yeah. You build trust, and that's important in a monitoring system.
Liz: As we reach the end of our time, Gianluca, I wanted to ask one question.
I don't think I've asked all my other guests because most of our guests have been from the native English speaking world.
What's it been like for you developing the SRE practices and developing the observability practices in the Italian community?
Gianluca: I have a fun story. A few years ago, maybe six or seven, and I was working in a company in Italy and I had colleagues that were-- I use this example because it's very fun.
We were all together watching tiny logs from a couple of servers, and the speed of the logs of the tape were the argument for how good or bad an application were working.
If the logs were too fast, we were under pressure, so the application wasn't working well enough.
If they were not enough, maybe we weren't getting enough requests. "What was broken?"
That was our measuring system back then. And I mean, it worked.
It's definitely the sharing experience we spoke just a few minutes ago, it wasn't really great because if the person who had the knowledge about the pattern of the logs was leaving or was on holiday, that was a really painful situation.
So it's been a long journey since then, but I organized a meet up in my city, that's the CNCF one.
It's good to speak about those topics.
I think the people I know from the Italian community are really pragmatic.
They want to see the difference between their current approach. Even if they know that it's not the best one, but it works for them, maybe it's good enough. So you really have to show them what are you speaking about, and how it changes your way of developing and understanding a system.
With tracing, for example, this way it's very easy.
You will start to inject the library, and you trace a bunch of controllers or API requests, and you can show the difference in the approach. This helps a lot.
For my experience, that was a great way to start speaking about observability.
That's probably because I'm a developer. For me, it's easier to open the code base and dig into that.
But having the opportunity to look at their issue in the code and tell them, "OK. Try to put a trace there, or try to change these log and try to make it more verbose, or verbose in the right way," let's say.
That small stuff had a huge impact for me.
Liz: I can definitely imagine that if this is the first time that you are seeing a case study or a demonstration of value from someone in your community who speaks your language, that it can be really powerful.
Gianluca: Yeah, that's true. We need KubeCon we need SREcon, all those to learn how stuff should be done.
But at some point we need a few layer of translation to get to where sometimes the struggles are.
Shelby: Thank you for all your awesome advocacy work in the community.
You're prolific on Blogger, and you're involved in all these projects and organizing a meet up.
It's really cool to see all of that. So, thank you for taking the time to join us today.
Gianluca: Thank you for having me.
It's easy to try stuff or share what I do for me, because it's a way for learn and double check that I'm doing something that is useful.
But it's also good to hear that it's going well and people that are enjoying my work or the outcome of my struggle. So, thank you for that.
Liz: Yeah, we appreciate it very much. Thank you.
Subscribe to Heavybit Updates
You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.
Content from the Library
O11ycast Ep. #76, Managing 200-Armed Agents with Andrew Keller
In episode 76 of o11ycast, Jessica Kerr and Martin Thwaites speak with Andrew Keller, Principal Engineer at ObservIQ, about the...
O11ycast Ep. #75, O11yneering with Daniel Ravenstone and Adriana Villela
In episode 75 of o11ycast, Daniel Ravenstone and Adriana Villela dive into the challenges of adopting observability and...
O11ycast Ep. #74, The Universal Language of Telemetry with Liudmila Molkova
In episode 74 of o11ycast, Liudmila Molkova unpacks the importance of semantic conventions in telemetry. The discussion...