Ep. #72, Mobile Observability with Hanson Ho of Embrace
Episode 72 of o11ycast explores the world of mobile observability with Hanson Ho, Android Architect at Embrace. Hanson unpacks how mobile's diverse device landscape impacts data collection and performance monitoring, the role of OpenTelemetry in simplifying these challenges, and the importance of tailoring observability strategies to real-world user experiences. Join hosts Charity, Jess, and Austin for this insightful examination of an often overlooked observability niche.
Hanson Ho is an Android engineer at Embrace with nearly 20 years of experience in software development. Prior to joining Embrace, Hanson worked in various technical roles at SAP, Salesforce, and Twitter, where he spent over seven years focused on Android performance and observability at a massive scale.
Episode 72 of o11ycast explores the world of mobile observability with Hanson Ho, Android Architect at Embrace. Hanson unpacks how mobile's diverse device landscape impacts data collection and performance monitoring, the role of OpenTelemetry in simplifying these challenges, and the importance of tailoring observability strategies to real-world user experiences. Join hosts Charity, Jess, and Austin for this insightful examination of an often overlooked observability niche.
transcript
Hanson Ho: Oh, yeah, we're just talking about staying online and making sure the upload happens successfully before I close it. And on mobile sometimes you don't even know if there's telemetry to receive if you don't even send it properly on networks, especially in countries that have not super great infrastructure.
That happens all the time, so it's just one of those things that we deal with on mobile. So it was just kind of funny that we mentioned that right before.
Jessica Kerr: So there's telemetry that you never know you didn't get it.
Hanson: Yeah, the awesomeness of mobile is that these are disparate devices, millions and millions of them sending data, perhaps out of order, perhaps it's quite delayed to your backend and you want to kind of aggregate everything, and tell you how your app is doing.
And for the most part it works, but sometimes there's data loss and when you have some interesting edge cases, especially that are especially prevalent in certain regions, some of that data gets lost.
And one of many things that we have to kind of deal with on mobile that I'm sure you have to deal with it in the backend as well, but perhaps not as acute of a problem is the fact that recording data doesn't mean you're going to get it in the backend.
Charity Majors: I used to work with someone who's this cranky Israeli engineer and he was like, "Data, eh, it's like grains of sand. You pick it up, you put it down, you drop some. You always dropped some data." I was like that is a great life philosophy. I'm glad you don't work on my database, but yes, yes, data is like grains of sand. You always drop some.
Austin Parker: It's great attitude for someone who is maybe not a DBA, right?
Charity: Exactly.
Hanson: Well, I mean, if the data is perfectly randomly dropped, sure, I'll take that.
Jessica: Exactly, it's not sending us a sample rate though. I mean, with the sand, at least you feel it going through your fingers.
Hanson: Yeah, yeah, it's like okay, okay, 2G network, I'm going to send you 20% and it's going to be perfectly random, so you're going to be able to take the data and the aggregates according to proportion.
But no, unfortunately it doesn't work that way and it sucks, it sucks having to deal with that. But we move on because a lot of folks use mobile apps and not everybody has the fanciest phones. So not only then does the network have issues, sometimes your phone says, "Ah, I'm just going to throttle you. You don't need that much CPU. Oh, your heap size is fine. I'm just going to do a bunch of GC 'cause I need a bunch of RAM in the background."
So all these conditions on mobile that you have to deal with that spike up randomly. So not only do you have to kind of record the application and the state of the application, but also to stay in the environment because well, why is my trace slow?
Perhaps your CPU's being throttled. Perhaps there is a video encoding running in the background because you are trying to upload your latest YouTube. I'm going to use words that make me sound like the age I am, influencers do to promote their brand or whatever, but basically devices are busy.
You don't have control of how much resources you have, so you can't provision. It's up to not only the user. It's up to the OS, it's up to the environment, and then frankly it could even be up to the OEM. Different Android versions run differently on different phones because settings are different.
So this makes having the ability to slice and dice even more crucial. When folks, traditionally when they do crash logging, they have these basic things they log. Oh, OS version, things like that, and it's great, those are good if you have some raw problems. Every single Android Seven crashes, you got that.
But performance, it's a lot more subtle, the way to figure out why something is slow and also if you capture everything, it's really expensive to bundle up and to send. So it's how do you find the ballots on mobile? You need a lot of context. That's expensive to encode that context. It's even expensive to ask for the context 'cause I'm asking the OS to basically say, "Hey, tell me what your network status is."
And the OS says, "Huh, oh, no, wait. Package manager, connection manager, I got to wait. Lot requests for you. So simply by not being able to get the data fast enough but you eat the data, it's the worst of all worlds basically.
Jessica: That sounds so hard.
Hanson: It's hard, but it's fun, I mean.
Jessica: Yes.
Charity: Mobile is like a throwback to the early 2000s when resources were precious and scarce and you wanted to do all this stuff, but oh, my god, my x86 chip, better not overload it, and part of me really misses those days.
I think performance tuning is one of the finest arts in computing. It's way more fun in my opinion than building stupid features people are just going to complain about.
Everybody loves a good performance tuning session, right?
Hanson: You have an awesome trace. Even on Android, you have multiple processes, 50 threads, each doing different things. Google Threads gives very good tracing, so you have lots of rich data if you're able to kind of capture that, but it's a couple megabytes for every 10 seconds and it's not feasible to do that in production.
Charity: We should probably have you introduce yourself at some point here.
Hanson: My name's Hanson Ho. I work for Embrace. I do mobile observability. I focus on Android. Previously I did performance improvement stuff at Twitter for the Android app, focusing on initially, well, emerging markets, bad networks, slow phones.
This is how I got into this whole space is I had to work with these conditions that are suboptimal. It's like I can't believe people put up with this. You think app startup being 10 seconds is bad. That's not that bad, especially if you're on a older Android phone. People are used to that.
Charity: You know who used to put up with this? Us 10 years ago.
Hanson: Exactly, people look at it, it's like oh, yeah, nobody's actually going to wait 15 seconds for an app to start up. So clearly anything beyond a certain threshold, we just don't have to worry about.
Jessica: Easy to say when you live in San Francisco and you have cellular everywhere.
Hanson: Oh, yeah, and it's not even just the network as well, it's people's phones, they get passed on, especially Android. iOS is, Apple does a very good job of making certain devices obsolete and making you upgrade, and even old devices perform really well.
On Android, you're getting devices, they're midrange when they're initially released and by the time seven years rolls around, you still have a midrange. So yeah, the context of of of having to deal with things like that on Android informed my proclivity to kind of figure out these problems.
What are these outliers? Why are these things happening? And the deeper you go, the deeper you go on Android, and they exist. Just because your app data--
Jessica: Whatever it is, they exist.
Hanson: Well, I mean, your app data says no one is using your app below a certain version. Well, it's because they can't use it. They load it up, it takes 30 seconds, a minute to load up, and they uninstall it right away.
So you're basically self-selecting for the performance profiles of people who use your app. To raise the bar of improving performance is not making existing users happier. It's basically increasing the ability for older phones and older devices to actually use your app.
So the ability to track churn simply by knowing whether folks returned is the ultimate metric. It's great to know how long a certain workflow took, that's great, but ultimately it's about whether they will continue using your app.
Jessica: So you're expanding your user base by making your app perform at all.
Hanson: Exactly, so especially if you're a big box retailer or somebody whose audience may be older demographic with hand-me-down phones, they pull up your ABC app and they realize I want to go buy a hammer or something like that.
I can't log in, what's going on? Well, it doesn't work. Well, that's because maybe certain things about the device just makes it not work very well and you don't even know who you're losing. So when you're looking at p50, p75, it's like oh, yeah, it's fast. Oh, yeah, those are the people who stayed.
Jessica: Oh, yeah, p75 of people who can use it at all.
Hanson: Exactly, you can't assume a trace will actually complete because users abandon it all the time. So you basically have to understand that some amount of people are going to be abandoning your workflow, so your percentiles are just off. You have to know who quit.
Jessica: So how do you deal with that? Do you send something at the beginning of the trace?
Hanson: So at Embrace we use OpenTelemetry and OpenTelemetry is about completed spans before you export, but sometimes we can't wait that long. So we have contextual clues that we can use, but we deal what we call a stat snapshot. Just basically take the span at the time of termination and send it over.
That's not OTel, it's not OTLP. It's bad, but it works for us because we have to kind of know the state of traces and spans when we need to know about it, not when it's done. When the user backgrounds the app, the OS can just kill it without telling you.
So it's not going to tell you nicely okay, Android app, I'm going to kill your app. It's just going to kill your app.
Charity: Well, yeah, you're misbehaving. Why would it be all nice and polite when it could just smack you?
Hanson: It loves to recover resources.
Charity: Our god is an Old Testament god.
Austin: Yeah, I think it's interesting you bring up that whole thing of backgrounding and there's kind of two interesting parts there.
There's one I think OpenTelemetry included, a lot of these telemetry tools are really designed with a performance profile where you can control almost every aspect of how the application is running, right?
You not only control the code that's running, sure, but if I want to get way down the weeds and start doing different kernels, and different Linux distributions, and different racks even in a data center, and I want to micro optimize the environment that my backend service runs in, I have that option.
Whereas on the front end or for clients in general, there's so much that's completely outside your control, right? There's the OS, there's the physical hardware. You said in the Android case already, but there's the mobile network operator. There's physics play a much bigger role.
Jessica: Oh, my phone shut down on me yesterday because it got too hot. I can take a meeting in 95 degrees, but apparently not my Android phone.
Austin: Yeah, and so I think it is so interesting, how does that shape how you're thinking about the telemetry that you actually need beyond anything around making sure that telemetry exists and gets exported.
When there's so much that's out of your control as a developer, what are the things that are really important to focus on?
Hanson: I mean, the primary thing is that anything you record, you don't lose, so first is make sure that happens. And then basically you have to kind of model your world. What is important to your app? Is this a network connection? Is it whether the device is being thermal throttled? Because that will reduce the CPU logged to you.
Ideally you can actually get what core you're running on because sometimes the process gets assigned, scheduled on a lower end core, and these things you can't tell unless you can dig deep. So it's almost like peel back the onion.
First, you figure out what you initially need and then you're like, "I still can't find," so it's a bit of a journey to go through. And when you're kind of starting on this journey, it becomes a little difficult to know where it ends up.
But I've been doing this for a number of years and you kind of just pick stuff up, but it's really about the context. It's about the execution context. You don't know what the app is going to be executing in in terms of the environment, hardware, software, anything.
So anything that potentially is relevant, you pull down and you make sure that oh, yeah, I could actually slice and dice based on that.
Jessica: So do you try to grab everything that might be relevant or do you start smaller because asking for stuff is expensive?
Hanson: So one thing that we do is we try to not tightly couple every signal that we send with the environment. Basically the environment is a bit separate from the process that's running, the app that's running, so we need to find a way to actually performatively capture both of these things in parallel and then link them together.
Currently not great ways of doing that, but we get by with what we can. And the end goal is to basically minimize the amount of time that the app actually spends recording the telemetry or rather blocking the app access to record telemetry.
So never do it on the main thread. Always do it in the background thread if you can actually do it after the critical workflow is done and basically say I know what the times are, just go and record this in the background.
So it's really about balancing what we capture and then obviously what we send as well because sometimes you need to capture information, but you end up realizing it's not useful so you only send a subset of that information.
So it's a lot of curation simply in the capturing of data. I mean, Embrace, we offer an SDK that does this, but it's very similar to the Android SDK/agent which basically turn it on and it'll just capture stuff for you. Network requests, app application, things like that.
So as a, I guess we're an instrumentation of the platform and in the app environment, we kind of have to make some decisions for folks. Hey, what do you find important?
And especially in mobile, what folks are used to is crashes, or logs, or Google might give you ANRs, but they kind of do it in a weird way that's incorrect. So if you use those numbers, you're going to be looking at the raw-
Jessica: Google gives you ANRs?
Hanson: ANRs is Android Not Responding. So when you have a box that pops up on your app and says, "Do you want to kill your app because it's frozen for five seconds?"
Jessica: Sounds important.
Hanson: Yeah, and that's an ANR. So you click Okay, it kills the app and Google gives you information about that. But what that is saying or trying to say is that your UI thread's been busy for five seconds and at the five second mark I'm going to tell you.
And when you click okay and kill the app, it'll report the call stack at the five second mark which is not what you're doing in the previous 4.9999 seconds. But Google will still tell you oh, yeah, your ANR is caused by whatever the call stack is on the five second mark.
So the tools that folks are used to on Android, and frankly iOS as well, is whatever the platforms give you and usually they don't even give you, well, at least on Android, they give you the Google Dashboard.
You can't get the data raw. They give you pre-aggregated data. They'll tell you, "Oh, yeah, your ANR rate or crash rate per user is whatever per day." Well, when does the day start? Well, trust us, per day.
You don't have access to these low level telemetry that you can then join with your backend data to figure out oh, I see a spike in my backend. Is it caused by something in the front end that I'm not actually logging? And maybe it's content from Japan because a certain ad is slow. It's got some really big image that it's delivering.
So suddenly I see a spike in Japan. I don't know why, it's not the network. Oh, yeah, it's the content and of course, in the backend, you aren't going to log the payload that takes a long time to decode on the client or something like that.
So on the client we're able to kind of capture this and basically extend the reach of your backend data. So to explain unexplainable server spikes, why is the connection being held open for so long? The network seems fine. So things like that.
Charity: Hanson, I've heard you say a couple times one of my favorite phrases which is slice and dice, which is another way of saying high cardinality, attention here.
But I'm curious about your take on this. So I feel like I've been thinking a lot recently about the difference between sort of, let's call them observability 1.0 tools which are ones that are based on metrics, logs, and traces and three pillars. You've got your APM, and your RUM, and your front end, and your metrics and your-
Jessica: And your pre-aggregated dashboard that Google provides you.
Charity: All these things, right? That aren't really actually connected versus it's really 2.0 which, and I know we have a lot of the same philosophical things there about wide events and cardinality and everything, but one of the big differences I think is more sociotechnical between 1.0 and 2.0 is that 1.0, to me, is very much about errors, bugs, stack traces, crush dumps.
Something went wrong and we're retroactively coming to clean it up or understand it, versus observability 2.0 I think is increasingly very much about, it's not about how you operate your code, it's about how you develop it.
And so it's not just about errors and bugs and MTTR and MTTD. It's about just understanding for good, bad, or evil, right? Just understanding what are my users doing? This thing that I built, how are they interacting with it?
Are they doing kind of what I thought they would? How can I understand this? How can I see how the intersection of the environment, and the code, and the people, and the business, and just sort of proactively.
But mobile can make this really difficult in some ways because you can't get automatic, you can't do your CI/CD and deploy really quickly. But as mobile engineers, how do people try and close that gap and be very interactive and proactive with their telemetry without having it all be based around sort of errors and and cleanup?
Hanson:
So the focus on errors, and logs, and crashes, as you said, is primarily what folks focus on, but they do it without understanding the implications. So oh, yeah, my crash rate is a certain amount. Well, so what does that mean? Is your app just crashing in the background and users don't even notice? So yeah, your crash rate is high, but what does that actually mean?
What is useful is ultimately your own product KPIs. Are you having enough conversion when you land on a certain page? These things are almost the ultimate goal of performance and what you kind of want to do, at least in my opinion, is the ability to know what factors affect the ultimate conversion rate, the DAU rate, things like that.
So sometimes it's not obvious. Is login a problem? Is waiting for images to load a problem? We don't know, so you don't want to know the problem that you're trying to find before you measure. You almost have to model the app experience, the user app experience, how the user sees the app or perceives of the performance of the app, and then basically work backwards.
Hey, if I do more slicing and dicing, if I kind of divide these into cohorts, what changes? Oh, I suddenly see that if a particular trace is a lot slower, it has a disproportionate impact on conversion rate.
Oh, yeah, why is that? What trace is this? Oh, they're waiting for the cart to load their credit card information. Maybe they have to kind of call out the stripe and do something. So you don't actually know this until you kind of look at the data and let the data tell you.
So kind of close to p-hacking, so you kind of want to pull out of that a little bit and let the data try to tell you, and corroborate the data.
Charity: I love the emphasis on it being exploratory and open-ended, you don't know what you don't know, which ultimately is my beef with a lot of the v1 toolkit is that you have to predict in advance which questions you're going to ask, which metrics you're going to capture.
You can't slice and dice on metrics, right? You have to have rich context in order to zoom in, zoom out, explore, slice and dice. So when you're just starting out, yeah, everybody cares about request rate, error rate, but that only gives you just half an inch deep.
And Austin talks about this a lot, how all of the interesting questions in observability, you can't wall them off. Well, this is a business question. This is an app question. They're all a combination, right?
Which is why it's so important that the data be co-located, that your app data be combined with business data so that you can understand what's going on without having to just sort of make intuitive leaps shall we call them or fancy guesses. Well, this must be the same spike as that. It's the same shape.
Hanson: Exactly, so if I tell you app startup is seven seconds, p50, and p99 is 10.2, what does that mean? Nothing, if you're used to slowness, that's just fine.
So basically if you're targeting a demographic that's used to having a really slow, there's no objective number, it's user perception. Do they perceive this as slow? Does it frustrate them? If it frustrates them, they will do something that will make your business not happy, i.e., not use your app or not buy your product.
Charity: Would it be overstating it somewhat to say that you don't know what data was meaningful until you've made a change and seen that it's moved to the direction that you want it to?
Hanson: 100%, so I think I got really lucky. My entry into this is I got to to work with Twitter data, so very, very large and we were able to do experiments. Turning on for even 1% gives you a big enough sample.
So we do stuff in the background, how many parallel HTTP requests do we do? What's the benefit of using HTTP/2 versus 1.1? We change up compression sizes like how how much we ratchet up the Z standard? Well, I call "Zed" standard, but you all call it Z standard.
The compression rate, what is the cutoff? Do we just not compress if it's below a certain size? All of these have impact on user performance that is not visible and in the ideal case, you can A/B test properly.
So you can see oh, yeah, after six months, the DEU for the two different cohorts are different because there are folks who have churned as a result of, extended that performance. It's also a two-step process.
Without the ability to experiment A/B tests properly, you kind of have to guess. You see these correlations and you're like they seem to be related, are they? And then you basically have to go and find out.
So even do you think you've discovered oh, look, the lines are moving at the same time, it may not be true. But it may be and if you actually find that golden metric of performance, that could increase or rather I call it you decrease the decrease of DAU, that is gold.
We were able to basically, through budget holdback experiments, improve or measure the improvement of app startup and the number of people that we retained was some very large number, and you couldn't really do that. How do you measure 30 different improvements without holdback?
So that would be great. You can do a holdback, that's fantastic. But if you can't, then you have to kind of guess.
Charity: What's a holdback?
Hanson: A holdback is basically when you make a bunch of improvements and you want to measure the efficacy of those improvements. So you say hey, what percent of users? You don't get that. So while the team improved app performance a bunch of different ways-
Jessica: So this is like the opposite of a canary?
Hanson: It's the opposite of canary. You think you know. Through the initial experiments, you think this is actually an improvement and you put it out. But because the improvements are so small, they may not actually tell you a statistically significant number given the bucket sizes or if you pull out of 30 of these.
So you're going to do it anyway, right? You're going to make improvements. You have measurements that are tangible, it improves, but what are the cumulative effects of basically introducing all these together?
Folks that get it all and folks that don't get it at all, and holdback experiment usually lasts a long time, half a year or a year. And that the end, you look at okay, what is the delta between the two? Not everybody can do it because it's-
Jessica: Okay, so yeah, when a particular website is just way slower for me than somebody else, now I know what to complain about.
Hanson: So if it's perceivable.
Jessica: That's probably not true, but I can feel better about it anyway. I can be like, "It's not me, it's them."
Hanson: Potentially, yeah.
Austin: I had a question. Something that I think is so interesting about the mobile and sort of client space in general is how relatively quickly those areas develop in terms of fundamental technology shifts.
If we look at sort of the backend world, there are so many people that are still using Java 8 and happy with it, and it works great. And there's companies that are doing billions of dollars a year that are built on Ruby on Rails.
Fundamentally we haven't actually, there's a lot of progress sort of at the edges with technologies like WebAssembly, various new ways to deploy and package things, but it hasn't really had this huge seismic shift.
In mobile over the past 10 years or so, what? We've gone from iOS Three to 16 give or take. Android has gone from being very different to what it is now. We've seen these, every couple years, massive shifts in both what mobile and client devices are capable for and also how people are interacting with them.
Can you talk a little bit about what is that pace of development like for those engineers? Where are they stuck at? What are the things that need to happen in the ecosystem to kind of push them to that next level of observability and early understanding of all this stuff?
Hanson: So the good and bad thing is that Google doesn't actually give us very much in terms of data. We have to go and collect it. So the versions that they ship, even if it's Android Five, Six, we're still going to use the same techniques to get some of this information.
Now on newer versions, there are newer APIs so we can get data better. But we started 10, 12 years ago with no API, so how do we find information? Well you install signal handlers at the C++ layer to intercept native crashes.
Ugh, kind of iffy, but that's how you do it 'cause you can't do it any other way. You read the process files to find when the time was created to determine whether it was started in the background.
There's all these hacks that we've had to do to make it kind of work and as the platforms improve, we can start removing those hacks and actually depend on actual APIs and things like that. But not only do the APIs differ as new versions ship, the actual devices themselves change and improve.
When I started this, it was very common for folks to sideload apps outside of North America. They all also clear their caches all the time because they have very fixed amount of disk space.
Jessica: Did you say sideload?
Hanson: Sideload, when you're basically just installing an Android app without going through an app store, or the Google Play store, or the Amazon Store.
Jessica: Oh, okay.
Hanson: What you would do is you just basically have it in your, download it in your folder.
Austin: I think you have a call full of iOS users. I think this is a podcast full of iOS users so.
Jessica: I have an Android phone now.
Austin: Oh, do you, okay.
Jessica: Which was super embarrassing at the last iOS conference I was speaking at.
Austin: Yeah, you may need to explain some of the Androidisms to us.
Hanson: Oh, I mean, iOS is, I shouldn't be so Android-centric because that's been my experience. But I think just in mobile, older phones with smaller disks, people clear caches all the time so you can't rely on the caches being there. Things like that have improved.
Disk space has gotten bigger so people do that less. Data used to be really expensive. It used to be metered by megabyte and now data or least bandwidth is cheap. So if latency is bad, ah, whatever, at least the data isn't going to cost you.
So as the environment changes underneath us, certain assumptions that we make have to change as well. App startup for instance. It used to be, well, I'm going to be Android-focused again.
Google's had some great improvements on baseline profiling so you're able to kind of give the app more information to say how startup is going to work so we can do preloads and things like that to improve things. That didn't used to happen.
So you have to ship your SDK or if you ship an app, you have to kind of build additional information into your app in order to make things faster. So you kind of have to keep up, and look at the improvements, and then be open to fundamentally changing your previous assumptions so one day maybe when the world is completely connected by 12G and latency is non-existent, some of these concerns that we have no longer be applicable.
Jessica: I think they'll just move to somewhere else.
Hanson: True, yeah, it's like oh, one millisecond, what?
Jessica: Yeah, our phones will be amazing, but our toenail polish will have really high latency when it's for reporting how the glitter is falling off.
Hanson: One thing folks used to do is assume screen sizes and pixel sizes based on that form factors where screens don't change. Well, with full foldable phones, your screen can change without your app doing anything so you have to handle that.
So I think that's one thing mobile developers are very cognizant of and have to actually handle. So that type of change on mobile is, I think people are used to it. I think what they're not used to is actually, for observability, is the idea of tracing.
Not everybody understands, especially mobile developers who haven't worked on anything other than mobile apps, they may not know what a trace is. So we try to explain okay, a trace is a set of spans. Well, what's a span?
Well, a span is, well, it's a root. So some of these terminology that we have for observability is quite challenging and frankly one of the things about OpenTelemetry, it's great as a platform, but you get the API to a mobile develop who may not know what a span is, it becomes really challenging to talk about how do you pass your context through calling co-routine 'cause you don't even know what threat is?
So there's a bunch of these things that I think, and not only is it a technology problem, it is a kind of, well, I guess that that can be a technology problem. It's the user base is different, so the assumptions are different.
So to bring observably to mobile, a lot of it is just making it talk in a language that mobile developers understand.
Jessica: In your SDK, do you use different terms?
Hanson: No, we try not to. Google and Apple have put out existing terms and then we try to use that. When we went to OpenTelemetry, we basically took those terms and we want to be consistent with that.
But there is a mushy middle where it's like well, how do you tell people a mobile user or a mobile app developer what a span is? You kind of have to stick with the words you know.
Charity: You are so passionate about data, and exploring it, and understanding it, and all this stuff and you also got to know that most mobile developers have a reputation.
I don't know how well known it is, but for really not wanting to look under the hood that much. Really wanting to just focus on the user interactions and the experience of the app and ugh, that's so backendy. Ugh, do I really have to?
How do you think about trying to bring this-- because you work for a company that is trying to help engineers get better at this, and be interested in it, and be curious. So when you're building your product, how do you approach this? How do you try to pique their interest and draw them in?
Hanson: You kind of have to meet them where they are. Why would they use a, well, crash monitoring product? Some people look at us and say, "Hey, you're a crash model. You're just Firebase, it's the same thing." And to some degree it's true because they do care about things like crashes.
So we take care of those problems for them and then they'll say, "Hey, how do we improve things?" And sometimes it's top down. There may be SLOs coming from high on top, saying hey, your mobile app can't take x amount of time because some consultants told us that that will make conversion low.
So they're like oh, okay. Well, how do I improve performance? And then they have to understand.
The problem with opening up a mobile crash dashboard is it's never going to be empty. There's always going to be issues, so it's an infinite list of things to handle. So what is important, what is not important? Performance kind of gets pushed in the background unless you intrinsically care or somebody tells you that you should care, and we clearly want to build up the intrinsic part.
It's like hey, your app performance should be good and that sometimes it just comes from up on top and say you got to make it fast. Why? I don't know. Accenture told me to make it fast.
Charity: I read it in "CIO Weekly." "Apps should be fast." No, it's interesting though 'cause I was thinking back to the early days of Honeycomb and the wandering journey that we went on trying to find product market fit.
And in my view, I think I would say, I mean, in some sense product market fit is a lifelong journey of exploration and better, blah, blah, blah. But for me, I felt like we kind of, it clicked around the end of 2018, start of 2019 when we did two things.
Number one, we wrote the B lines which would basically, we had gone for a long time, being like we're for people who have already at the edges with your Datadog, your New Relics, and they understand why high cardinality is something that's worth investing your time in.
A little extra manual instrumentation, but it's not that hard, blah, blah, blah. Yeah, that was stupid. We should have made it easy a long time ago.
But it was getting the data in as easy as possible and then instead of popping them into a query builder which if you're popping into a empty query builder, you're like what the fuck is... We dropped them into what we called APM Home which is just the same three fucking graphs.
It's latency, errors, requests per second. And so when dropping people into that, it was they felt oriented, right? They're like ah, I know what I'm looking at. Now I can go and start exploring.
Has there ever been any sort of analogous experience for you guys when it comes to just helping people feel like they know what they're looking at?
Hanson: Yes and no, I think one interesting thing that we tried to figure out recently is that people or rather getting comfortable with observability when you talk about backend. Folks who use Grafana or Honeycomb, other people who-
Charity: Honeycomb obviously.
Hanson: Obviously.
Charity: Just kidding, just kidding.
Hanson: But SREs are very used to observability. They understand observability as a property and not something you kind of add logs to, and they'll want to get more data and they'll be like, "Hey, mobile team, there's all this data. How do you get it to us?"
And mobile team's like, "I don't know, can you get it from Google?" Well, you can't, so OpenTelemetry is actually a gateway because if we make it easy for OpenTelemetry to work on mobile apps and have that data basically speak the same language as your backend, all suddenly stuff starts to connect.
You're going to have evangelizers from the SREs and say, "Hey, look." So we can't tell people, but their peers, their SRE peers, will be like, "Hey, you kind of want to know about your SLO. What are you actually doing in the app startup? What is making it slow?"
So I think the world is getting it. Hopefully we close our eyes for a year and everything will be better. Mobile folks will be like yeah, our SRE friends will say, "Hey, hey, come look at this stuff." And being in the OpenTelemetry sandbox allows us to speak the same language.
Charity: I think we can all agree that OpenTelemetry is the best gateway drug.
We're getting close to the end, but the one thing I think that I would just touch on that you mentioned is if there's one gift that I feel like SREs could give all the engineers in the world, it is comfort with the fact that everything's failing all the time and it's fine.
You mentioned opening up the crash report. There's always crashes, there's always problems, and it can feel daunting 'cause if you're used to looking at the shiny front end face of your code, you're like, "Ah, it's lovely." And then you pick up the rock, you're like, "Ah, it's full of worms and disfigurements."
But that's just the reality of complex systems, and it's fine. It's actually kind of miraculous just how much shit can fail and still keep humming along and giving our users a good enough experience which was why I feel like engaging with this data, with telemetry data, and SLOs, and everything under the rock is such a life-affirming experience because you can always make it better.
There is no such thing as a system that you can't make better. And so if I was to put that on my S3 tombstone, I would be like, "Everything is broken all the time."
Jessica: This is fine.
Charity: And it's fine.
Hanson: These are the people who are using your app. Your p99, you may think it's bad, but they keep coming back. Now maybe it's because it's their bank and they can't pick another bank. But if it's a game or if it's an app, they find value.
Charity: But the beauty of having that information is you know what you can do. You don't have to guess. You can follow the trail of breadcrumbs. You have the power. It is in your hands. You know what to do to make people's lives better. Such a gift.
Hanson: It's almost like it's giving you work. It's telling you, "Come fix me. Look at these problems, come fix me." And you could measure that and you could measure the improvement.
That's what I like. It's not like oh, yeah, I fixed this crash. I made everybody's day better.
Charity: Did you, did you?
Hanson: I can actually see changes as metrics move and KPIs go up.
Austin: I mean, the computer gives me enough work already. I'm not sure I want it to give me even more.
Jessica: But you can see the progress.
Austin: That's true, I do like when the number goes up. I like when the line goes up and to the right, so that's a good point.
Hanson: I always want a reason. People tell me you have to do it, tell me why. Tell me to fix a performance problem and not tell me what that actually will improve? I mean, I'll do it if you're going to pay me, but I'll be intrinsically motivated if I know directly.
Hey, if I improve app startup by 200 milliseconds, I can get the AU to improve a certain way. That's what gets me out. With kids, it's really hard to actually get really deep into hobbies and have focus on interests. Work allows me to do that.
I could get so deep into this, unreasonably, but I have to pull myself up because I still got to ship, right? But it allows me to exercise this bone in my body where it's I want to find out, I want to find out.
How deep can I go? I don't want to ask anybody. I want to find out, and data lets me do that.
Charity: It's gratification, it's like gaming within the real world. Real impact on people's lives. I'm with you, I think it's really exciting.
Hanson: How can I improve the metrics? It's just a number, right?
Jessica: But there's people making that number.
Hanson: Well, if that number turns into dollars, that's even more meaningful and somebody will pay you to exercise your hobby of making numbers increase or decrease depending on what number it is.
Jessica: That is the virtuous cycle, yes. Hanson, how can people learn more about you and about Embrace?
Hanson: Well, we have both our Android and iOS SDK's are open source, so it's on GitHub. Feel free to take a look. Our website, for me, hanson.wtf. I haven't logged in nine months, but hey, I'm on various social medias. I'm still on Twitter. I'm on all of them, Mastodon.
But Hanson Ho is very Googleable. Just know that I'm not the architect from Singapore. I am the Android, indie music, observability fan in Vancouver, Canada.
Charity: There you go. Well, thank you so much. This was really fun.
Jessica: Yeah, thank you.
Hanson: Thanks a lot.
Austin: Thanks.
Content from the Library
How It's Tested Ep. #15, Empowering Upward Mobility with Devin Cintron of Comun
In episode 15 of How It’s Tested, Eden speaks with Devin Cintron, engineering manager at Comun. Devin shares how his team creates...
How It's Tested Ep. #12, Mobile Deep Linking with Daniel Johnson of Branch
In episode 12 of How It’s Tested, Eden Full Goh is joined by Daniel Johnson of Branch to explore the complexities of mobile deep...
How It's Tested Ep. #7, Game Testing with Michael Le Bail of Homa Games
In episode 7 of How It’s Tested, Eden Full Goh speaks with Michael Le Bail of Homa Games. This talk explores game testing at...