A Brief Primer on Observability

If you look up observability(aka o11y) like I’ve been doing, you’ll find a lot of mentions of it being a new buzzword in the tech world. That’s likely the case in some aspects, but I think it’s a great term to describe what it is. “But, I’ve got a bunch of logs setup in Datadog/Sumo/Etc and that’ll let me ‘observe’ my system, right?”

Yes and no.

Monitoring is an absolutely great tool that folks should have set up to help with a particular set of issues. Monitoring, like testing, is great for things like predictable failures. If you’ve got an API server that’s decided not to respond successfully any longer, your monitoring might be showing you that you’re returning a bunch of 5xx errors. Cool. Great! You know that things are broken but do you know why?

Maybe?

Broken production is expensive

If you’ve seen this type of behavior before and have set things up to monitor it, then you may be able to say yes. If what’s going on happens to be something new, you may be out of luck and having to dig in super deep to your system to determine what’s going on. When things break though, you don’t really have the luxury of taking your time to sit there and spend the time you may need to root around looking for the cause of whatever it is that’s going on. When production is broken, your priority is in getting it unbroken. Due to this, you may have to do things that end up eating the breadcrumbs you’d need for effectively troubleshooting what’s going on. If you have logs, great! They might help. If your system was set up to be observable, then you may not be in this mess and your troubleshooting could be minutes instead of hours.

But what is observability anyway?

Charity Majors described an observable system at her talk at the 2018 GopherCon EU as, “You have an observable system when your team can quickly and reliably track down any new problem without anticipating it or shipping new code.”

Any. New. Problem? Holy smokes! That sounds too good to be true or maybe that it requires some crazy machine learning/AI algorithms to solve this. Guess what?

It totally isn’t

There isn’t magic here, it’s just data and being able to draw conclusions from it. The data needed for o11y systems hinge on a few sources: logging, metrics, and tracing.

While you may have all of these types of data being gathered it’s possibly not in a way that is easily usable across all members of your team or even completely there. Has your team had to cut down the amount of logging that’s sent to a provider because you’re “sending too much”? You’re shooting yourself in the foot! If you think you’re able to get around this by aggregating data, you’re also shooting yourself. All of this data is important to be able to successfully ask your system what’s going on! Aggregation can be great for metrics but if you ask your system why it’s behaving a particular way and all you get back is ¯\(ツ)/¯, that’s no good at all.

Just having the data also doesn’t help you answer these questions that well, but thankfully there are folks out there that offer tools to assist in allowing you to query your data from disparate systems in a single place. (Honeycomb, IOPipe, Google)

In closing, remember this:

Monitoring tells you whether the system works. Observability lets you ask why it’s not working. — Baron Schwartz (@xaprb) October 19, 2017

Thanks for reading!