Damon's Blog

Welcome to my blog! I write about things I find interesting, and I hope you do too.

11 December 2025

Observability Engineering

by Damon

Observability is such an interesting topic for me, it is something that I am very passionate about and something that I find a lot of people either get completely wrong, or develop a basic understanding of some concepts and use those concepts as the proverbial hammer to every observability nail.

I’m not saying that my approach is the “one true way”, but I’ve had a lot of success with my approach and I hope that by sharing it you can improve how you approach observability.

What is Observability Engineering?

I like to try and keep the definition simple, this may not be the canonical definition that the Datadog or Honeycomb folks prefer to use, but I think this definition is simple and accurate enough for our purposes.

Observability Engineering is the practice of building your systems such that you have the ability and tools necessary to deeply understand how your app is running in as close to real-time as possible and use that data to ask questions of your system and be able to get the answers.

What does that mean exactly? Well:

Obserability isn’t logs, traces and metrics.

While these are the most common types of data gathered from our system, these are simply just abstractions or patterns for viewing or working with the truly important concept in observability, which is events.

Every bit of data that goes into an observability system is an event.

In fact, if you really think about it, if you structure your logs properly, you could make your logs be traces or events simply by having the necessary data and aggregating those events.

About logs, the most over-abused type of event.

I blame the over-abuse of logs on the fact that for many developers, learning to debug a program comes from shoving print statements all over their code, running it and seeing what gets printed out. When they move into working on things like a web application running in a production cloud environment the equivalent of that is to write a log.

My least favourite type of log is the one which is just a string that concatenates variables into the string value, i.e.

"Customer with ID 12345 added Product 54321 to their shopping cart."

That may seem helpful at the time when you are working on your local dev machine or when it is in some kind of staging environment, but when you are deep in the middle of an incident and you can’t find the relevant log because the string is too hard to search for in your log viewer or there are simply too many logs coming in to find what you are looking for, suddenly this is a whole lot less helpful.

My first bit of advice is that your log messages should be more like events, if I were to rewrite the above example I would do it like so.

"Customer added Product to their shopping cart."

Or, if you’re feeling particularly “event-y”.

"cart.product.added"

Now of course I hear you saying “but how do I know who the customer is and what product they added to their cart!?” And the answer to that is structured logs. Structured logs are a simple concept of rather than just emitting a string, you emit some kind of structured data to your logger, most often JSON (but it doesn’t have to be, so long as it is consistently structured and supported by your log platform!). So lets structurise our event into a good log.

{
	"message": "cart.product.added",
	"timestamp": 1765413966891,
	"level": "INFO",
	"data": {
		"cart": {
			"id": 67890
		},
		"customer": { 
			"id": 12345 
		}, 
		"product": {
			"id": 54321 
		} 
	} 
}

So now if I want to see all the products that the customer added to their cart, in datadog my search would be:

@message:"cart.product.added" @data.customer.id:12345 

And then from there I could narrow it down by cart ID:

@message:"cart.product.added" @data.customer.id:12345 @data.cart.id: 67890

Or, I could even ask which customers have added the specific product.

@message:"cart.product.added" @data.product.id: 54321

What data do I include in my structured logs?

In short, highly unique data and lots of it.

Ok, that may be too short of an answer, so the long-form version is data that has high cardinality and high dimensionality. If you don’t know what those mean, that’s ok because while I know what they mean I can never remember the specific words for them (in fact, I had to google for the specific words while writing this!)

Cardinality: in databases refers to the “uniqueness” of a value. For example, a users UUID is a very high cardinality value as in the context of your data one UUID uniquely identifies an individual user. Conversely, a users full name is a low-cardinality value as it is possible for more than one person to have the exact same first and last name, or a combination thereof. Your structured data should prioritise high-cardinality data as that is the kind of data that makes it easier to drill down and find what you are looking for.

Dimensionality: in databases refers to how “wide” or how much data is contained within a single entry. An entry is considered to have high dimensionality the more data is contained in the entry. In observability we want high-dimensionality for our structured logs because the more data we have available the more questions we can ask our systems and get answers.

Unfortunately, platforms like Datadog make high-dimensionality hard because they place size limits on the total size of the data being sent to datadog and in general recommend that you have no more than 100 key/value pairs in your structured data.

What about structured data on traces and metrics?

Remember how we talked about how logs, traces and metrics are all events and how you could roll your own traces and metrics out of log events? Well you can add whatever structured data you want to any log, trace or metric you want.

tags: observability - telemetry - otel