NOV 28, 2018

33 MIN

Ep. #7, Observability at Asana and Honeycomb

GuestsCliff Chang, Phips Peter

light mode

about the episode

In episode 7 of O11ycast, Charity is joined by fellow Honeycomb team member Michael Wilde along with Asana’s heads of engineering and tech, Cliff Chang and Phips Peter, to discuss how observability has shaped their organizations.

about the guests

Cliff Chang is Engineering Lead–Growth at Asana, a productivity and collaboration tracking tool. Phips Peter is a software engineer and the tech lead for the Growth and Adoption pillar at Asana.

show notes

Luna

about the episode

about the guests

show notes

transcript

Charity Majors: What does observability mean to you?

Phips Peter: Observability means to me the ability to see what our customers are doing. The actual performance they see.

A lot of times you can test the feature on your sandbox, see how it loads, but it's not representative of our customers. It's not representative of our customers overseas, or our customers with Windows laptops or even our customers with Firefox. It's nice to be able to get those real metrics into our system and get a sense of what our customers are seeing and feeling.

Charity: How is this different from traditional monitoring?

Phips: I don't think that most front-end had monitoring or any real user metrics. A lot of the companies I've talked to when trying to figure out and understand this better did either experiment labs or other tasks, or relied on their own internal reports of performance.

Charity: Sad, but true.

Phips: One of the worst things, and one of the reasons why I got into this, was when you have a Slack channel and someone in staff or general will be like, "Asana feels slow today." And even if it wasn't slow we had no way to prove it wasn't. And everyone would be like, "Me too," "Me too." and that would be my entire day, just trying to figure that out.

Charity: God, I feel that pain.

Michael Wilde: Everyone starts reloading and it gets worse.

Phips: Especially with our old architecture, it would exacerbate it.

Charity: This feels like a great time for everyone to introduce themselves. I'm Charity. You're used to hearing my voice. We have a pinch hitter for Rachel, who hurt her foot.

Michael: Hey, everybody. It's Michael Wilde from Honeycomb.

Cliff Chang: I'm Cliff. I'm an engineering lead at Asana.

Phips: Hi, I'm Phips and I'm a tech lead at Asana.

Charity: Phips, you are the tech lead and original architect for much of Asana's conversion from legacy JavaScript monolith to the current stack. What were the goals of migration, and what is your current stack?

Phips: Our current stack is TypeScript and React on the front end, and then we also have a system called Luna DB which is a graphical server where all the queries are live by default. Which means that if someone else changes a task name, you'll see it immediately in your browser.

For a lot of it the change was about making sure that it was going from imperative to declarative. We had this already in our existing system with we called Luna1 where everything worked on function-reactive programming, and React was similar in that model but it extended to the rest of our code base.

We started using Bazel for our build system which let us have declarative correct builds by default, Kubernetes for deployment infrastructure, and just moving in that direction helped people understand what was going on and reasoned about the system better.

Beyond that, the biggest reason was obviously performance.

It's a huge challenge to build an in-house framework and spend years on it, and then decide that it's not working and you have to change it.

We needed to prove that the performance was going to be better.

Charity: I always feel like in order to rewrite a system you need to be goddamn sure that the performance is going to be at least 10x better than what you had before. It has to be 10 times better, or it's not worth the pain.

Phips: That was the biggest leap of faith for the company, that we didn't have any strong evidence that it would be. We had some intuition and it's pretty easy to reason from first principles that it would be true, but it was hard to--

Charity: That's what they all say. "Then it exploded."

Phips: It did turn out to be true. There's this one moment in 2016 where we finally had enough pieces that I could do the new stack by itself, and it was fast, and it was on my birthday of 2016. I was like, "Yes!" After two years of effort it worked.

Charity: Happy birthday! That's amazing.

Cliff: The first time we had that side-by-side histogram, and that right side was 20 times smaller than the left side, I was like, "Yes!"

Charity: That is a giant leap of faith.

Michael: Happy birthday to you, right?

Phips: Yeah, I was like "Yes!"

Charity: I love how modern web stacks seem to be the revenge of the functional programmers where you least expected it.

Phips: I also think revenge of the compilers. Increasingly what's going to be the next trend for front end is investing in compiler and AST tools. You see this already with TypeScript and Babel 7, but with a web assembly and more you're going to need to do increasing amount of works at build time, so that way you're shipping less to the client.

Charity: I've thought about this in the back end, but it never clicked for me that would be the same in the front end. But it makes sense.

What performance tuning did you need to do? How was it that you did the perspective shift of thinking about your users instead of thinking about the performance of the service?

Phips: The biggest thing that took a while to realize and was the first of many shifts was picking a single frame of reference. I was thinking about it from the relativity perspective. And with the front end perspective, you can either think about it from the perspective of the server or the perspective of the client but the perspective of the server is not that meaningful.

Especially because it's controlled, and the way that your client is super different. The server can be helpful for understanding why certain client times are high. Picking that one frame of reference was helpful.

Charity: The path of the client request includes lots of things that aren't even on the server.

Phips: Or even things on your service. A lot of browsers will either share memory or share performance concerns, IFrames also share the same execution context. There's a lot of things that could be going on in browsers and stuff.

I would say the next major thing that we learned was to pick a target customer, not because we cared about them more, but because they had less variance. We were able to set cultural and product goals around them, which is really helpful.

Charity: Normalizing your monitoring.

Phips: Exactly. We found that people in the United States who paid us for a lot of seats would have pretty stable performance. It makes a lot of sense because if you're paying for Asana you already bought pretty nice laptops and pretty nice Wi-Fi. So we removed a lot of the variance right then and there. That was helpful so we could set goals around them.

Charity: That's a good point, and better than any fake user that you could mock up.

Phips: For sure.

Michael: Did you do much instrumentation at all, as well as far as measurement, and how did you view all that?

Phips: We have a tool that helps with the observability, and the biggest thing that--

Charity: You built it in-house?

Phips: No, we had one in-house that wasn't working and we switched in 2014 to another service. The two things that were helpful with that were the ability to set our own metrics, specifically, we stopped thinking in terms of percentiles but thinking in percentage of users under a certain time.

They're the same metric, just which one is independent and which one is the dependent variable? But let people click with them. I noticed that some people would sometimes think that distribution was uniform, or whatever, 50% under one second, you know what one second feels like for a page load and you're able identify with that.

The other thing that was helpful was start looking at distributions instead of percentiles. Once we started looking at distributions we were able to find all sorts of bugs. We started to find if you had two peaks in your distribution you had a bad case, get rid of it. If the distribution was fat or wide, that usually meant that something was scaling with the amount of data we're loading so we need to put a limit on that query or limit on that work. If it had a long tail it usually meant that something was in serial, and if we could put them in parallel we'd bring it up.

Charity: Sometimes you'll see a weird little spike somewhere and you'll realize that it's because you have a distributed system and some shard is down, and even though it's 2.5% of your traffic 100% of those requests are failing, or 100% of those reads or those writes. Distributed systems, because of the way we've made them all resilient, it's actually more fragile in certain ways because small chunks of people-- these never show up in your time series aggregate when these small chunks of people are completely hosed.

Phips: For sure. We'd have bad things from outages, but the one that was interesting was discovering massive differences between EC2 instances.

Charity: God, yes. Let me tell you about that. I come from databases and it took us a while to realize some of them have ghosts. Co-tenancy problems, whatever. We prefer to think of them as ghosts. You have to run a regression test on the performance when you spin one up if you care at all about performance. How long have you both been there?

Cliff: I've been at Asana for about six years.

Charity: You must really like it there. That's nice to hear. That's awesome to hear. I love it when you hear about someone who's been someplace happily for over two years.

All right, sidebar. No one in this industry should hate their job. There is too much opportunity and there are so many good companies out there. You can find one. Don't settle.

Cliff: I feel strongly about this sort of thing. It's a tangent for what we've been talking about, but engineers get so much better when they understand the long term consequences of their choices. When people are switching every 18 to 24 months you'll build something and then you have no idea how it turns out.

Charity: Yes.

Cliff: And then someone who never met you rewrites it and you have no idea why.

Charity: It makes it hard to ever develop into a senior engineer.

Phips: I'd definitely agree with that. I've been there for five years, and one of the biggest things was designing the framework. The new framework was well and easy, it was a lot of fun and it was just a great time. But the hardest part was the tail end and finishing it. Scaling for the number of developers, not the number of users, and trying to teach everyone how to do this, and--

Charity: I don't have kids, I don't want kids. But I think about this the way I hear parents talking about having kids, like "My younger days were great. They were wild and crazy, but now I've grown up and this is a much deeper joy. More fulfilling." Maybe. But I feel that way about my job.

In the beginning it's quick and easy and fun, startups especially. Everything's new, you build it from scratch. But there comes a point where it gets a little hollow and you want to see, experience, and you want to have an impact in the world. The things that they say make you happy in life are autonomy and impact and--

Cliff: Purpose.

Charity: Purpose.

Michael: Question for both of you. What cultural changes needed to happen for the migration to work out? You talked about the tail end of it, having to teach developers new stuff. Was there any bumps in the road, and how'd it work?

Phips: I can talk about a small slice of it. One of the best parts about working with Cliff is that this is something Cliff shines at, so I'm really excited to hear what he has to say. One of the things that we'd talk about is the engineering performance philosophy, because we hadn't been measuring page load performance and it was so bad. People thought about putting everything on page load to make other subsequent actions in the app quick, and it turned out most of our users once we started looking at it never did anything more than one navigation.

We were doing all this work on page load for things that users never did. Getting PMs and engineers to think about deferring work, and then the actual product experience around that. Having loading indicators or whatnot for these things that used to be instant, and having them be okay with it, was a huge cultural shift.

Charity: I think of this in terms of, it's like you want observability-driven development. I was at Facebook and I went through this whole massive trauma with Parse, reliability-wise, but after we got into a tool that you could drill down to the raw requests--

And by the way, if you can't drill down to the raw requests, it's not observability, in my opinion.

Because you don't have the ability to ask new questions that you didn't pre-aggregate by or index by, or something. You've got to get to the raw requests. But I found that once we had that tooling the entire way that we thought about how we chose our work changed, and it was like "I don't want to build this until I have a good sense that it's going to have impact.

So if we're rolling out a storage engine change, or considering building it, we would calculate "How much would this buy us? Would it be evenly distributed or would only a few people reap the rewards?" Then you have this muscle memory, this habit of checking yourself that you go along. "Did I build what I thought I'd built? Did I ship what I thought I shipped? Are my users using it the way I thought they would?" It makes the idea of going back to not having that feel driving without your glasses.

Cliff: We have a great series. We were working on performance in parallel with instrumenting our performance, where we had this end-to-end, "OK, we have a number." And they're like, "This number is big," or, "It's not a single number; it's whole distribution number."

That was another thing that we had to change. "How can we make these numbers go down?" We didn't know which parts are which. We had had this, "OK. There's three parts. There's a server part, there's the client part and there's over the wire." Then we started breaking those into finer and finer pieces and we found that in the beginning when we were working on performance it was almost like guesswork. "Here's a thing that's bad."

Charity: "Can't be good."

Cliff: Then we made it way better, but it turns out that thing was happening in parallel with something that was taking longer. So even though we reduced that step we hadn't changed the performance.

Charity: The thing that people don't realize, it looks big and daunting when you have all these performance problems, but in fact in the early days you get so many easy wins. There's just so much to do, and yet you're going to find that there is so much wrong with your stack that you had no idea about because it was covered up over by these other things. But once you get instrumented and you start looking under the covers, you're just like, "Fuck."

Cliff: Wait a sec, we can fix that. Someone wrote in 2011 that, "This might be about idea, but leaving this here for now."

Charity: Yes. Exactly.

Phips: That did happen.

Cliff: Yeah, the Constantine thing.

Charity: It happens to everyone.

Phips: The one thing that we didn't know until we started looking at it, was we spent a lot of time on our new server making it fast because our old server had been so slow. It turns out it was so fast that the client couldn't process the next frame of web socket requests, and it's still even mostly true today. Renders and DOM paint APIs and all those things take so long that the next web socket thing is delayed for milliseconds.

Charity: This reminds me of the longest performance capture replay thing that I've ever done in my life was back at Linden Lab, upgrading from MySQL 4.0 to 5.0. We did it once, it crashed and burned, we had a day and a half of downtime and some lost data. It was traumatic.

I went off on this year long quest to develop software that would let us capture 24 hours worth of queries and replay them on hardware, tweaking one thing at a time, and I got it. I did all the stuff so that instead of 20% slower, it was 1.5 times faster. I was so proud of myself.

Then I quit and I checked in on them a few months later, I'm like "How's it going?" And they're like, "We just replaced them all with SSDs." I just went, "Holy crap. OK. I feel great about the last year of my life." Question, are your software engineers on call?

Cliff: Yes.

Phips: Yeah.

Charity: How do you feel about this?

Cliff: One of the nice side effects of bottling up half of Asana's product engineering to work on performance are rewriting for a year, or a year and a half.

Charity: You got your feet wet.

Cliff: Yeah. Now that we have a lot of engineers, we have an engineer on every team who cares about performance and knows how to measure performance and knows how to think about it, and can help teach the people around them.

Charity: Can you describe your on call rotation?

Cliff: Sure. We went from three on-call rotations down to two. We have one for stability and uptime, so they're the ones who are receiving the alarms from databases, AWS instances and that sort of thing. We had one that was just focused on release on call, so that was how we do build and deploy, making sure that each deploy is going smoothly and that sort of thing. Helping the web on call, which is the last one, understand the state of what's on production doing cherry picks, rollbacks, that sort of stuff.

Web on-call is often less timely, but they're responsible for generally triaging performance bugs as well as any other kind of bug. Something that happened at Asana is that people internalized that performance was the product and that performance is a feature on top of everything else, so performance bugs get triaged in the same queue as other stuff. We're less successful than we could be on having everyone believe that, especially as the dark times when the performance was the number one terrible thing about Asana, which gets further away in the rearview mirror.

Phips: I also think the thing that we've learned this observability thing in numerous cases, we improved our builds that way, but one area I don't think we've done as good as we can yet is errors on the front end. We have our own in-house tool for it, and it's good, but it could get even better if we can do that level of observability and query for understanding why specific things happen.

As we were saying earlier, it's a single-page app, everyone merges into one app and for better or worse it's a tragedy of the comments. Complicated things happen and I wish we had more tracing and more ability to see all the logs that customers were doing. Not from a creepy perspective, but just so we can give them the best software experience possible, in perspective.

Charity: Do you do distributed tracing?

Phips: What do you mean by that?

Charity: Have you set up Jaeger or open tracing, or anything like that?

Phips: I think we have some on the backend. I'm not super familiar with that, I haven't touched it in a while.

Charity: No worries, just curious.

Phips: It's not that distributed when you have a single threaded process on the client.

Charity: This is true, but it can still be very useful. It's a waterfall of overlay over your event, so you can see if this hop is 50% slower than usual, or something like that. Especially if you're lining up lots of database requests, it's really valuable.

Phips: We do have that for the Luna DB system itself. We can do a waterfall and see how long each individual function resolves, so Luna DB acts like a GraphQL server and a proxy for all of our backend services for the respect of the front end. There you can observe what the time is, but for the most part it's not been the issue so I don't think most people at the company know how to use it.

Charity: Another key thing that we're seeing when we talk to people, a refrain that we hear echoed is, "It's not fixing the code that's a problem or debugging the code. It's knowing where to look and what part of the code to debug."

It's triaging, slicing dicing in real time and knowing where the errors are coming from. Especially if you're a platform, any error or any latency increase that you emanate can theoretically infect everyone and make the latency rise for everything. Then it's a matter of tracing it back to the original requests. Were your software engineers on call originally? Or was this part of the process?

Cliff: Since the beginning people have been on call. Originally we just had, I joined as the 12th or 13th engineer, so I don't know about the beginning-beginning. But it was just a single on call rotation where everyone responds for every thing. We specialized over time.

Charity: Are there are many complaints about that, or is it accepted? How bad are your on call?

Cliff: It can be exhausting. The web on call is less demanding, because when things are wrong it's easy to roll back. We have a sophisticated activation system for what releases people are on, so it's often "Emergency!" to, "Just roll back." Then you have time to debug it.

Charity: Debug at leisure.

Cliff: It is more stressful for stability on call, or infrastructure on call. Because those fires when they happen, they need to happen. We have a few people now across the world in terms of engineering, so we have much more coverage for the nighttime hours.

On-call engineers aren't expected to wake up at 3:00 in the morning to diagnose what's going on, and that's been a big help. But we do need to make sure that as Asana engineering gets larger and engineers get more specialized and product engineers are further away from the infrastructure, that product engineers are there to help infrastructure engineers. When the cause of the thing sometimes is that a database went down, but sometimes it's--

Charity: It's really important for management to commit to allowing enough time for the engineers to pay down the technical debt that they need. There's nothing more frustrating than knowing what needs to be done, but being like, "We have to ship these features, so you're just going to keep getting your sleep abused for the next 'n' weeks." That's terrible.

Cliff: That's one of things we like about having all of our engineers do some form of on call once they've been onboarded, is because they see the negative effects of having a lot of voices externally in the company saying "We need this shipped by this deadline." But you have the engineer unified going, "No. We're all going to get woken up."

I agree, that's powerful. We've found that at Honeycomb there are two questions that predict above all others whether or not a customer will be successful with us. They are, "Are your engineers on call, and can use summon the engineering discipline to structure a fucking log line?" Without the swearing.

Michael: "Did you summon the engineering discipline to structure your log lines?"

Charity: "Are your logs structured?"

Michael: Exactly.

Cliff: Are our logs structured?

Phips: Yeah.

Charity: Of course they are. I didn't even need to ask. I could tell. What are the cultural aspects of Asana that lent themselves, or which of them did you feel like you were fighting or needed to be changed? And which of them were the wind at your back, helping you?

Cliff: There was definitely both pros and cons to the Asana culture, one thing that made it hard in the beginning, especially in the beginning when the team was struggling to figure out exactly what was going on was Asana's culture has a big emphasis on transparency and having well-defined problems. It's hard to define the performance problems often, because a lot of the work--

Charity: There's a problem.

Cliff: Right, "There's a problem. That's all I know." All the work is in defining the problem.

Charity: There might not even be a problem, it's possible that the reporter is unreliable. But we think there might be a problem.

Cliff: "We've had six people complain about it today, so that's our metric that we're using." It was hard for the rest of the company to understand why performance was so hard if they hadn't been working on it, because "Why aren't you just solving it the normal way?"

Charity: "Why don't you just make it faster?"

Cliff: Right. "Why don't you define the problem and then solve it?" "We're trying to." That's the hard part.

Charity: I've always struggled with this, being an operations engineer. So often it's like, people say, "It took me a long time to even understand how people could assign amounts of work that it would take them to ship things." And I'm just like, "How do you know?" Because so many of the things that I have are just like, "Figure out what's wrong." I'm like, "I can't attach a number of hours to that. You're crazy."

Cliff: As soon as I figure out how many hours it's going to take.

Charity: Could be 1, could be 10.

Cliff: Something that helped was Asana has always been good at practicing long term thinking. It's not been a iterate wildly to try to find product market fit kind of thing, the founder is in product--

Charity: No, fail fast, move fast and break things?

Cliff: People were understanding how at some point, this is going to take years. People were like, "OK. We can't imagine Asana five years from now without having done this, so I guess we need to do it at some point. We might as well be doing it now." There's obviously questions about the timing.

Michael: That's a keynote speech right there.

Charity: This would be an awesome-- I don't know if you guys have given talks about this transformation, but I would watch the hell out of that.

Cliff: Phips should think about that.

Charity: In fact, if you want help editing or crafting a blurb or whatever, I would be super down for that.

Phips: I would love that. That would be great, thank you.

Charity: Totally. What cultural changes did you need to help make, and enact?

Cliff: We've talked about some of them. For the team it's that timing thing, because most of the people working on this were product engineers and they were used to having this, "We think this will take about four months," and we have this this predictable cadence of happiness.

Where you're going to be working at the beginning and it's fun, and then it's getting closer to crunch time but we've got two weeks left. "We can do it." There's this whole rhythm that you're used to, and this performance investigation work, emotionally it's a lot more unpredictable.

Charity: That dopamine hit when you find it though, there's nothing like it.

Cliff: But it comes out of nowhere, because you're like, "We have no idea what's going on, we have no idea what's going on. That thing that we did, it worked."

Charity: That requires a different ownership model too. If you're used to being able to hand everything off at the end of the week when you're off being on call, and maybe you can't do that with something that just stretches on. It's not worth handing it off.

Phips: I do think the thing that's coming to the next level though, now that we understand it so well is to try to internalize it and make tools for it. I was on performance for a long time, now I'm back on our product team and enjoying the regular four month cadence.

But it was funny, one of the problems I found for page load performance came back, and I helped make an alarm, the alarm on it. I made it worse, an alarm sign. I was like, "Yes. Victory." Getting more and more of that so developers are unable to do the right thing by default is a thing that we focus a lot on in our code.

Charity: Think of it like crafting a path. It shouldn't be impossible to do things, but by default the right thing should happen. When I talk to a lot of companies who are making that shift from being 50 people to over 300, and they've passed that Dunbar's number, the successful companies tend to start out with a lot of chaos. With a lot of, "You do your own thing. Engineers are empowered to make their own decisions, pick what you want to pick and go." Then they reach a point where they have spaghetti, and nobody can get anything done because they're all fighting each other's tools.

The advice that I always give people was don't take away that engineering autonomy, but craft a golden path.

Bring your most senior technical people into the room, don't let them leave until they've agreed on, "Here are the tools that we are going to support as a company and that we're going to recommend that people use. We're going to use this for monitoring, we're going to use this language, we're going to use this to put-- You're free to choose something else, but you will be supporting it." That's the best path that I've seen.

Phips: We go through that occasionally. but people stayed to the single stack and the tooling efficiencies around it have been so much faster.

Charity: God, it's amazing.

Phips: Language was pretty easy, everyone got that, but the biggest one was build system and tools. At first people--

Charity: What about databases?

Phips: That's an interesting one. We just had that come up again. For the most part we're staking with a single database. A lot of this has to do with our actual database structure, so through the Facebook DNA of Asana we store everything in what we call an OKV store, but it's usually entity, attribute, value, traditionally.

So because everything is stored in this flexible way, you need the query adapters for the most part to do any product works. You have to go through that and it's just not that interesting to scale up a new database.

Also, our product data models is so coupled together that in order to provide that wonderful experience of creating tasks in one project and seeing them show up in the next project, that there hasn't been a lot of need for databases external to the product. The newest one is for making sure we understand how our customers are paying us and why, and that's been an interesting question of, "What level of functional programming and immutability do we want in our database?"

Cliff: For billing history you want that generally to be immutable. System versioning is a useful thing for that because that's just guaranteed that you understand where each transaction came from. Someone I manage was the tech lead for this project and he wanted to use MariaDB which is a fork of MySQL that has system versioning by default.

And there's this, "MariaDB has the promise of being exactly like MySQL, except plus system versioning." But is that true? While it takes, historically it's been 12 to 18 months for them to catch up with new versions of MySQL, but they're doing it, but it's getting a little bit slower.

How much faith is there in that, and how important is it for system versioning to be first class in the system? There was a lot of gnashing of teeth and debate about whether or not this was the right thing to do. We ended up being conservative and going with MySQL.

Charity: You did the right thing.

Cliff: Thank you. Folks we've talked to from other companies have helped validate that. It was one of those moments where it was like, "What is the golden path?" That was a major influence of it. We're fairly immature in terms of thinking about services, we haven't had to choose which database we're going to use that many times.

Charity: The closer you get to laying bits down on disks, the more risk averse you should be.

Cliff: Yes, thank you.

Phips: That's what we ended up doing, and for that reason too.

Charity: Nice. For each of you, we've talked about this a bit. But what were the biggest lessons that you learned? What advice would you give to folks who are contemplating doing something similar, and finally, what were your biggest mistakes?

Cliff: One of the things I learned, and I was learning at the same time as the organization, was how much to trust engineers in this sort of thing. In the beginning it was this high profile project, "We're going to fix performance. Rah, rah, rah." Then after the first month our numbers hadn't moved that much, and we were like "What's going on?"

Having more people care made it worse in a lot of ways. When we finally started having success we were like, "It's too complicated for us to be explaining things. let's trust the team that's working on it and trust that they are motivated to solve the problem that we think is the problem." At that point, when they had a lot more freedom to pursue the spark of random idea that they had, even though it wasn't on the roadmap of, "Six most promising things to try." "Something is weird here, I'm going to spend a day or two looking at it." That made a huge difference.

One thing that we've learned from that is we've continued to maintain. Mandate is now more broad than performance, but it's a client infrastructure team, and we don't question its staffing. We're not every year, "Is 12% of engineering the right amount of engineering to spend on meta projects?"

We're not going to try to compare the apples to oranges. We're going to reserve engineering time to make engineers lives better, things more reliable, things faster. That's something I'm really proud of. As the legacy of this big performance effort.

Charity: You need to write a blog post, dude.

Cliff: In terms of mistakes, there's a lot. Letting it get to that place that I was talking about first, where it's like, "What is our weekly status update about performance?" It's just not the cadence of how performance work goes.

Charity: But it always goes this way. It can always be put off, and put off, and put off. Until one day it can't, but all along would you have achieved your goals as a company if you were regularly parceling out? Because you don't know which features are going to be popular until they're popular. It's more likely to be a sin to overly optimize in advance, than to just wait and do your best and then sweep it all up once you need to.

Cliff: Another mistake, because I don't want to leave out too many of my mistakes, is setting goals. We just did it in a vacuum. We were just like, "In the first six months we think we can move this number by this much--"

Charity: Just throw a dart at the board.

Cliff: Right, and it was totally random, and when that number wasn't hit everyone was incredibly disappointed. Like, "This has been a failure." But in those six months we made tremendous strides in observability and logging, and the next six months after that went amazingly. But by then people had no expectations, so it was like we didn't understand the order of how things happen.

Charity: Picking that first goal number is such a shot in the dark. That first interval is always about figuring out what the number you should even set is. I've got to imagine a lot of this was education, like internal education.

Cliff: All of us were figuring it out. Phipps, when he inverted that graph to be not percentiles but here's what the bands are and here's how my people are in each band, and everyone was like, "That's what it looks like. OK. Let's move that part."

Charity: I feel like we could do an entire podcast just talking about the user education. But, your turn.

Phips: The things I learned early about focusing on the developers using your tools, when designing a new framework I always thought about it as a product.

I had been at companies in the past where they didn't treat their tooling that well, and they didn't think about their customers as hostage users or their other employees. Trying to do that well, you're never done. You got to keep going and keep improving.

One the big mistakes is Asana had made a great framework called Luna and then they stopped. Because we stopped we had to reinvest and it took a lot of effort, and that's why I'm glad that we had the client infrastructure team going. I also say that doing it incrementally and continuing to ship product is important. You can't just stop the ball for the rewrite.

Related to the developer tooling, the advantage of treating your code like data, and as another product, and getting involved with code mods and getting rid of bad patterns. The tooling around code sets people up . really well, so at this point we have the ability to enforce patterns and get rid of old patterns through code mods, we can delete bad types of code.

That big change from the learning I had was, "We need to convert the client." We did that with this incremental rewrite and now we can fix our server code because Asana traditionally ran the same code on the client and the server with a stable process. I can fix the server code by actual incremental changes to our framework, and code mods transform the code from one state to the next state, and hopefully not lose data and information along the way.

Charity: I love that you say that because I was sitting here thinking, "I have a little bit of trepidation coming into this because I am so far away from being a front end engineer. Will I know what we're talking about, will I have anything to add? Will it feel completely awkward like I'm talking to them about knitting, or something?"

This felt just like a conversation that we would've had about instrumentation and observability on the backend, or the databases. Same thing, minus a couple of words of, "This framework versus that framework." People get, or I know I get a little mentally distracted by the idea that there's a browser. Like, "It must be completely different." It's just another client.

Somebody once said that the web was the original distributed system, that browsers are the original distributed system clients. I think that's fascinating. I can't think of how this would be a different conversation. This has been awesome though, thank you guys for coming.

Cliff: Thank you.

Phips: It's been a lot of fun.

Charity: Cheers.

Subscribe to Heavybit Updates

Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.

Content from the Library

Visit library

Jan 17, 2024

Podcast

O11ycast Ep. #66, Building Observability Platforms with Iris Dyrmishi of Miro

In episode 66 of o11ycast, Jess and Martin speak with Iris Dyrmishi of Miro. They dive deep on what it takes to build an...

Oct 12, 2023

Article

Incident Response and DevOps in the Age of Generative AI

How Does Generative AI Work With Incident Response? Software continues to eat the world, as more dev teams depend on third-party...

Apr 15, 2024

Article

MLOps vs. Eng: Misaligned Incentives and Failure to Launch?

Failure to Launch: The Challenges of Getting ML Models into Prod Machine learning is a subset of AI–the practice of using...