This post falls into a common trap; conflating logging with metrics.
Log interesting things, where interesting is defined as context outside what the "happy path" execution performs.
Collect and make available system metrics, such as invocation counts, processing time histograms, etc., to make available what the post uses log statements to disseminate same.
Thanks for taking the time to reply! I'm relatively new to working on this type of system (large scale, event driven) and half posted because I know there are people on HN way better than me at this, and was curious about their opinions.
In the end, what's the difference between a log and a metric? Is one structured, and one unstructured? Is one a giant blob of text, and the other stored in a time series db? At the moment I guess I'm "logging my metrics" with structured logs going into Loki which can then unwrap and plot things.
You and the other commenters have given me the vocabulary to dig more into this area on the internet though. Thanks!
As a person who has worked in and around logging and big data processing for 16 years now, including almost a decade working as a senior in professional services (currently a global security architect) directly for one of the largest big data companies, here is my opinion on logs vs metrics:
A log entry should capture an event in time, for example: a person logging in, a failure, a record of a notable event occurring, etc. These should be written at the time they occur when possible, to minimise chance of loss and to minimise delay for any downstream systems that might consume the logs. Arguments for batching could easily be made for systems generating very high volumes of logs.
Conversely, a metric is a single value, point-in-time capture of the size of something, measured in units or with a dimension. For example: current queue depth, number of records processed per second, data transfer rate in MB/s, cpu consumption percentage, etc. These can/should be written periodically, as mentioned in TFA.
> In the end, what's the difference between a log and a metric?
Essentially, a log entry is the emission of state known by an individual code execution path at the point the log entry can be produced, whereas a metric is a measurement of a specific runtime execution performed by the system.
Emits a log entry capturing the processing state known when the statement is evaluated. What it does not do is separate this information (a time-based attribute in this case) from other log entries, such as "malformed event detected" or "database connection failed."
More importantly, putting metrics into log entries forces timing to include log I/O, requires metrics analysis systems to parse all log entries, and limits the type of metrics which can be reported to be those expressible in a message text field.
Maybe most important of all, however, is that metrics collection and reporting is orthogonal to logging. So in the example above, if the log level were set to "error", then there would be no log-based metric emitted.
This is a reasonable first pass answer, but there's more nuance to this...
> What it does not do is separate this information
Logging at scale should really be structured, which means that you can trivially differentiate between different types of log message. You also get more dimensions all represented in that structure.
> limits the type of metrics which can be reported to be those expressible in a message text field
This is another example, logging shouldn't be text based ideally. You might have a summary human readable field, but metrics can easily be attributes on the log message.
The more I work in this area the more I'm realising that logs and metrics are pretty interchangeable. There are trade-offs for each absolutely, but you can convert logs into metrics easily (Datadog does this), and with a bit more effort you could turn a metric into logs if you wanted to (querying metrics as rows in a SQL database is handy!).
Metrics collection is also not necessarily orthogonal to logging, it depends on your system. From a server, you might have logs pushed to an external source and metrics pulled from the server by Prometheus, but that's just implementation details. You can also have logs pulled from log files, and metrics pushed to a statsd endpoint.
I've worked on mobile apps where metrics get aggregated locally and then pushed as log events to the server with one log event per metric and dimension set, only for the server to then typically turn them back into metrics.
It's good to understand the tradeoffs, the technology, whether you're using push or pull, where data is spooled or aggregated, data costs, etc. But this stuff is all pretty malleable and there's often no clearly right answer.
The question is whether you want to do your aggregation by unit time at the application level, or at an observability layer. You're absolutely right that the end user of metrics wants to see things grouped by time - but what if they want to filter down to "events where attribute X had value Y, in 10 second increments" but you had decided to group your metrics by 15 second increments without regard to attribute X?
Various companies, both in-house for big tech and then making this more widely accessible, started to answer this question by saying "pump all your individual logs in structured form into a giant columnar database that can handle nearly arbitrary numbers of columns, and we'll handle letting you slice and dice metrics out of any combination of columns you want. And if you have an ID follow the session around between different microservices, and maybe even all the way to the browser session, you can track the entire distributed system."
Different people might say that Datadog, Honeycomb, or Clickhouse (and the various startups backed by Clickhouse as a database) were the ones to make this pattern mainstream, and all of them pushed the boundaries in one way or another - nowadays, there's a whole https://opentelemetry.io/ standard, and if you emit according to that, you can plug in various sinks made by various startups, and choose the metrics UX that makes the most sense for your use case.
I'm a huge fan of Honeycomb - when I know a certain issue is happening, I can immediately see a chart showing latencies and frequencies, and click any hot spot to filter out the individual traces that exhibit the behavior and trace the end-to-end user journey, with all the different logs from all the systems touched by that request. And I can even begin this discovery from a single bug report by a single user whose ID I know. It's not just metrics - it's operational support. And if I'd pre-aggregated logs, I'd have none of this.
But of course, there are systems where this doesn't make sense! Large batch jobs, high-performance systems with orders of magnitudes more events than a standard web application... it's not one size fits all. That said, I think knowing about modern observability should be part of every developer's toolkit.
I love how open and non-defensive this comment is :)
There are a few ways to slice this, but one is that logs are human-readable print statements and are often per-task. E.g. if you have 100 machines, you don't want to co-mingle their logs because that will make it harder to debug a failure. Metrics are statistics and are often aggregated across tasks. But there are also per-task metrics like cpu usage, io usage etc.
They can both be structured to some extent. Often storage strategies might differ but not necessarily. I think at Google the evolution of structured logging was probably something like (1) printf some stuff, (2) build tooling to scrape and combine the logs, (3) we're good at searching, but searching would be easier if we just logged some protos.
I think logs are basically self-explanatory since everything logs. To understand why you would want separate metrics, consider computing the average cpu utilization for your app across a fleet of machines. You don't want to do that by printf the CPU usage, grep-ing all the logs, etc. You could try to do that with structured logs, and I'm sure some structured logs SaSS companies would advocate that.
If you're new to this space, I really liked the book Designing Data-Intensive Applications.
You might want to check out this very nice article on reservoir sampling, which discusses its application to logging: https://samwho.dev/reservoir-sampling/
I'm not sure I want to weigh in on "log" vs "metric"... but I did want to add some thoughts on logs in general.
If you need to "log" something to give users feedback as the system is running, it may be less of a log and more of a progress or status output.
Logs to me are things which happen and I want to be able to trace later, so summarizing or otherwise dropping logs that come in quickly in succession would be a problem. If I need to filter I pipe to grep, otherwise I can just save it all and read through it later.
Status messaging, which may be informative about your process is useful, and if its goal is to be observed real-time, then yea. A message or two a second seems like a good goal for consistency.
These are just two very different use cases to me. And generally I find the former critical to get right, while the later may be nice to have and may lead to discovery by nature of making it more accessible.
Metrics are way quicker to query due to aggregations and tend to be more stable as features change.
It's good to save metrics for things that remain true under arbitrary aggregations. E.g. sum, count, maximum and avoid things that do not survive aggregations such as percentiles.
log all major logical branches within code (if/for)
if "request" span multiple machine in cloud infrastructure, include request ID in all so logs can be grouped
if possible make log level dynamically controlled, so grug can turn on/off when need debug issue (many!)
if possible make log level per user, so can debug specific user issue
The only one I'll add is: If your logs are usually read in a log aggregator like Splunk or Grafana instead of in a console or text file, log as JSON objects instead of lines of text. It makes searches easier.
One way to reframe this is: "as a user [of the logs], what might I want to know?"
In my experience, this post is often right (and the logs are often wrong). There's a tendency to either log too much or log too little - if only a few items are getting processed, it's fine and maybe even good to log all 7 of them.
But if many, many are getting processed - you'll experience semantic overload as a reader of the logs. What you want is a compressed form
Logging per time interval can be a very handy approach. In my work, we've settled on a hybrid approach - calculate in real time how often things are happening and then log the number of things that have happened, but at a rate that is roughly one log every N seconds.
This takes some more engineering up front but is remarkably often what a log reader actually wants.
Even better: log absolute total counts of received and finished events. You can easily extract the rates from that, and you'll know if the process builds up a lot of simultaneous processing, and you can more easily compute longer-term averages, and you know if it is starved for work or resources, etc.
The practical problem with logging by time is that it's not resource constrained: holding N seconds of logs, even when each line is a bounded size, takes potentially unlimited memory. Logging 'by count' used a bounded amount of memory, and is easy to implement with a fixed size array in memory.
I agree with this. Logging, as well as metrics and tracing, are such hard topics for me to wrap my head around though.
From the log consumer (person) perspective, you'd want logs to provide you with sufficient information when troubleshooting. But since trouble usually happens when things go wrong in unexpected ways, the logging likely won't be well aligned to emit the right info for you to figure out what's going wrong exactly. What then, are you supposed to log the entire application state and every change to it? But then that's way too expensive, and there's a decent chance you might just drown in the noise instead. So you're left with this half artform half science type deal.
One thing I'm grateful for is that over the years most everything now logs in JSON lines at least. I just wish there was a standardized, simple way to access all the possible kinds of JSON objects that might be emitted into the logs. A schema would be a good start, but then I can immediately see ways how that would be quickly rendered lot less useful early on (e.g. "this and that field can contain some other serialized JSON object, good luck!").
Everything is events. The problem is that, as you notice, you frequently encounter situations where there are too many events to handle. Metrics, logging, and tracing are just three different ways to handle that problem.
Metrics handles too many events by aggregating them. You handle too many events by squashing them into a smaller number of events that aggregate the information.
Logging handles too many events by sampling them. If you have N times as many events as you can handle, take 1 in N of them or whatever other sampling model you want.
Tracing is logging, but where you have chains of correlated events. If you have a request started and a request ended event, it is pretty useless to get one without the other. So, you sample at the "chain of correlated events" level. You want 1 in N "chains of correlated events".
But, if you have enough throughput for all your events, just get yourself a big pile of events and throw it into a visualizer. Or better yet, just enable time travel debugging tracing so you do not need to even need to figure out how the events map to your program state.
> I just wish there was a standardized, simple way to access all the possible kinds of JSON objects that might be emitted into the logs. A schema would be a good start ...
While not an industry standard, an open source specification for JSON log entries commonly used is ECS[0]. There are others, but this one can serve a system well IMHO.
> What then, are you supposed to log the entire application state and every change to it?
For replayability/state reconstruction, usually it's enough to log the input data and the decisions made upon them i.e. which branches of the if/switch (and things morally equivalent to them e.g. virtual functions and short-circuiting Boolean operators) you've actually taken.
> But then that's way too expensive,
Yes, it's usually still way too expensive. But when it's not, it does give you information about at what code point exactly the "wrong" decision was made, and from there you can at least start thinking about how the system could get into the state where it would start making "wrong" decisions at this precise point of code — and that usually cuts down the number of possible reasons tremendously.
My personal answer to this is logging very little during normal operation and then logging a lot during errors. Depending on the maturity of the system “a lot” might mean the entire state so I can debug afterwords.
I wonder how do they log mission critical things in general. For instance, how often does a flight data recorder (FDR) log every state of mechanical components? Surely, they can't wait until something "interesting" to happen, right?
> I wonder how do they log mission critical things in general. For instance, how often does a flight data recorder (FDR) log every state of mechanical components? Surely, they can't wait until something "interesting" to happen, right?
There are different types of logging.
What you describe could be defined as an audit log intrinsic to system operation, which is quite a different thing than what the article describes.
This post falls into a common trap; conflating logging with metrics.
Log interesting things, where interesting is defined as context outside what the "happy path" execution performs.
Collect and make available system metrics, such as invocation counts, processing time histograms, etc., to make available what the post uses log statements to disseminate same.
Thanks for taking the time to reply! I'm relatively new to working on this type of system (large scale, event driven) and half posted because I know there are people on HN way better than me at this, and was curious about their opinions.
In the end, what's the difference between a log and a metric? Is one structured, and one unstructured? Is one a giant blob of text, and the other stored in a time series db? At the moment I guess I'm "logging my metrics" with structured logs going into Loki which can then unwrap and plot things.
You and the other commenters have given me the vocabulary to dig more into this area on the internet though. Thanks!
As a person who has worked in and around logging and big data processing for 16 years now, including almost a decade working as a senior in professional services (currently a global security architect) directly for one of the largest big data companies, here is my opinion on logs vs metrics:
A log entry should capture an event in time, for example: a person logging in, a failure, a record of a notable event occurring, etc. These should be written at the time they occur when possible, to minimise chance of loss and to minimise delay for any downstream systems that might consume the logs. Arguments for batching could easily be made for systems generating very high volumes of logs.
Conversely, a metric is a single value, point-in-time capture of the size of something, measured in units or with a dimension. For example: current queue depth, number of records processed per second, data transfer rate in MB/s, cpu consumption percentage, etc. These can/should be written periodically, as mentioned in TFA.
> In the end, what's the difference between a log and a metric?
Essentially, a log entry is the emission of state known by an individual code execution path at the point the log entry can be produced, whereas a metric is a measurement of a specific runtime execution performed by the system.
For example, a log entry of:
Emits a log entry capturing the processing state known when the statement is evaluated. What it does not do is separate this information (a time-based attribute in this case) from other log entries, such as "malformed event detected" or "database connection failed."More importantly, putting metrics into log entries forces timing to include log I/O, requires metrics analysis systems to parse all log entries, and limits the type of metrics which can be reported to be those expressible in a message text field.
Maybe most important of all, however, is that metrics collection and reporting is orthogonal to logging. So in the example above, if the log level were set to "error", then there would be no log-based metric emitted.
This is a reasonable first pass answer, but there's more nuance to this...
> What it does not do is separate this information
Logging at scale should really be structured, which means that you can trivially differentiate between different types of log message. You also get more dimensions all represented in that structure.
> limits the type of metrics which can be reported to be those expressible in a message text field
This is another example, logging shouldn't be text based ideally. You might have a summary human readable field, but metrics can easily be attributes on the log message.
The more I work in this area the more I'm realising that logs and metrics are pretty interchangeable. There are trade-offs for each absolutely, but you can convert logs into metrics easily (Datadog does this), and with a bit more effort you could turn a metric into logs if you wanted to (querying metrics as rows in a SQL database is handy!).
Metrics collection is also not necessarily orthogonal to logging, it depends on your system. From a server, you might have logs pushed to an external source and metrics pulled from the server by Prometheus, but that's just implementation details. You can also have logs pulled from log files, and metrics pushed to a statsd endpoint.
I've worked on mobile apps where metrics get aggregated locally and then pushed as log events to the server with one log event per metric and dimension set, only for the server to then typically turn them back into metrics.
It's good to understand the tradeoffs, the technology, whether you're using push or pull, where data is spooled or aggregated, data costs, etc. But this stuff is all pretty malleable and there's often no clearly right answer.
The question is whether you want to do your aggregation by unit time at the application level, or at an observability layer. You're absolutely right that the end user of metrics wants to see things grouped by time - but what if they want to filter down to "events where attribute X had value Y, in 10 second increments" but you had decided to group your metrics by 15 second increments without regard to attribute X?
Various companies, both in-house for big tech and then making this more widely accessible, started to answer this question by saying "pump all your individual logs in structured form into a giant columnar database that can handle nearly arbitrary numbers of columns, and we'll handle letting you slice and dice metrics out of any combination of columns you want. And if you have an ID follow the session around between different microservices, and maybe even all the way to the browser session, you can track the entire distributed system."
Different people might say that Datadog, Honeycomb, or Clickhouse (and the various startups backed by Clickhouse as a database) were the ones to make this pattern mainstream, and all of them pushed the boundaries in one way or another - nowadays, there's a whole https://opentelemetry.io/ standard, and if you emit according to that, you can plug in various sinks made by various startups, and choose the metrics UX that makes the most sense for your use case.
I'm a huge fan of Honeycomb - when I know a certain issue is happening, I can immediately see a chart showing latencies and frequencies, and click any hot spot to filter out the individual traces that exhibit the behavior and trace the end-to-end user journey, with all the different logs from all the systems touched by that request. And I can even begin this discovery from a single bug report by a single user whose ID I know. It's not just metrics - it's operational support. And if I'd pre-aggregated logs, I'd have none of this.
But of course, there are systems where this doesn't make sense! Large batch jobs, high-performance systems with orders of magnitudes more events than a standard web application... it's not one size fits all. That said, I think knowing about modern observability should be part of every developer's toolkit.
I love how open and non-defensive this comment is :)
There are a few ways to slice this, but one is that logs are human-readable print statements and are often per-task. E.g. if you have 100 machines, you don't want to co-mingle their logs because that will make it harder to debug a failure. Metrics are statistics and are often aggregated across tasks. But there are also per-task metrics like cpu usage, io usage etc.
They can both be structured to some extent. Often storage strategies might differ but not necessarily. I think at Google the evolution of structured logging was probably something like (1) printf some stuff, (2) build tooling to scrape and combine the logs, (3) we're good at searching, but searching would be easier if we just logged some protos.
I think logs are basically self-explanatory since everything logs. To understand why you would want separate metrics, consider computing the average cpu utilization for your app across a fleet of machines. You don't want to do that by printf the CPU usage, grep-ing all the logs, etc. You could try to do that with structured logs, and I'm sure some structured logs SaSS companies would advocate that.
If you're new to this space, I really liked the book Designing Data-Intensive Applications.
There's also tracing
You might want to check out this very nice article on reservoir sampling, which discusses its application to logging: https://samwho.dev/reservoir-sampling/
I'm not sure I want to weigh in on "log" vs "metric"... but I did want to add some thoughts on logs in general.
If you need to "log" something to give users feedback as the system is running, it may be less of a log and more of a progress or status output.
Logs to me are things which happen and I want to be able to trace later, so summarizing or otherwise dropping logs that come in quickly in succession would be a problem. If I need to filter I pipe to grep, otherwise I can just save it all and read through it later.
Status messaging, which may be informative about your process is useful, and if its goal is to be observed real-time, then yea. A message or two a second seems like a good goal for consistency.
These are just two very different use cases to me. And generally I find the former critical to get right, while the later may be nice to have and may lead to discovery by nature of making it more accessible.
Metrics are way quicker to query due to aggregations and tend to be more stable as features change.
It's good to save metrics for things that remain true under arbitrary aggregations. E.g. sum, count, maximum and avoid things that do not survive aggregations such as percentiles.
Best advice I ever got on logging:
- https://grugbrain.dev/The only one I'll add is: If your logs are usually read in a log aggregator like Splunk or Grafana instead of in a console or text file, log as JSON objects instead of lines of text. It makes searches easier.
> log as JSON objects instead of lines of text
Or logfmt which is easier to read for humans, lower overhead, and is still structured and supported in at least Grafana/Loki for parsing and queries.
One way to reframe this is: "as a user [of the logs], what might I want to know?"
In my experience, this post is often right (and the logs are often wrong). There's a tendency to either log too much or log too little - if only a few items are getting processed, it's fine and maybe even good to log all 7 of them.
But if many, many are getting processed - you'll experience semantic overload as a reader of the logs. What you want is a compressed form
Logging per time interval can be a very handy approach. In my work, we've settled on a hybrid approach - calculate in real time how often things are happening and then log the number of things that have happened, but at a rate that is roughly one log every N seconds.
This takes some more engineering up front but is remarkably often what a log reader actually wants.
Even better: log absolute total counts of received and finished events. You can easily extract the rates from that, and you'll know if the process builds up a lot of simultaneous processing, and you can more easily compute longer-term averages, and you know if it is starved for work or resources, etc.
Aggregation by time and count together is a normal batching technique and I have used it a lot to scale out multiple parts of many systems.
In this particular example, I agree with others: this is a case for metrics. "Log errors, metric successes[0]."
0: success events (a bit more than a log typically) may be important, especially if tied to something you charge for.
[dead]
The practical problem with logging by time is that it's not resource constrained: holding N seconds of logs, even when each line is a bounded size, takes potentially unlimited memory. Logging 'by count' used a bounded amount of memory, and is easy to implement with a fixed size array in memory.
If you can can "log by time" then what you need is metrics, not logs.
I agree with this. Logging, as well as metrics and tracing, are such hard topics for me to wrap my head around though.
From the log consumer (person) perspective, you'd want logs to provide you with sufficient information when troubleshooting. But since trouble usually happens when things go wrong in unexpected ways, the logging likely won't be well aligned to emit the right info for you to figure out what's going wrong exactly. What then, are you supposed to log the entire application state and every change to it? But then that's way too expensive, and there's a decent chance you might just drown in the noise instead. So you're left with this half artform half science type deal.
One thing I'm grateful for is that over the years most everything now logs in JSON lines at least. I just wish there was a standardized, simple way to access all the possible kinds of JSON objects that might be emitted into the logs. A schema would be a good start, but then I can immediately see ways how that would be quickly rendered lot less useful early on (e.g. "this and that field can contain some other serialized JSON object, good luck!").
Everything is events. The problem is that, as you notice, you frequently encounter situations where there are too many events to handle. Metrics, logging, and tracing are just three different ways to handle that problem.
Metrics handles too many events by aggregating them. You handle too many events by squashing them into a smaller number of events that aggregate the information.
Logging handles too many events by sampling them. If you have N times as many events as you can handle, take 1 in N of them or whatever other sampling model you want.
Tracing is logging, but where you have chains of correlated events. If you have a request started and a request ended event, it is pretty useless to get one without the other. So, you sample at the "chain of correlated events" level. You want 1 in N "chains of correlated events".
But, if you have enough throughput for all your events, just get yourself a big pile of events and throw it into a visualizer. Or better yet, just enable time travel debugging tracing so you do not need to even need to figure out how the events map to your program state.
> I just wish there was a standardized, simple way to access all the possible kinds of JSON objects that might be emitted into the logs. A schema would be a good start ...
While not an industry standard, an open source specification for JSON log entries commonly used is ECS[0]. There are others, but this one can serve a system well IMHO.
0 - https://www.elastic.co/docs/reference/ecs/ecs-guidelines
> What then, are you supposed to log the entire application state and every change to it?
For replayability/state reconstruction, usually it's enough to log the input data and the decisions made upon them i.e. which branches of the if/switch (and things morally equivalent to them e.g. virtual functions and short-circuiting Boolean operators) you've actually taken.
> But then that's way too expensive,
Yes, it's usually still way too expensive. But when it's not, it does give you information about at what code point exactly the "wrong" decision was made, and from there you can at least start thinking about how the system could get into the state where it would start making "wrong" decisions at this precise point of code — and that usually cuts down the number of possible reasons tremendously.
My personal answer to this is logging very little during normal operation and then logging a lot during errors. Depending on the maturity of the system “a lot” might mean the entire state so I can debug afterwords.
I wonder how do they log mission critical things in general. For instance, how often does a flight data recorder (FDR) log every state of mechanical components? Surely, they can't wait until something "interesting" to happen, right?
> I wonder how do they log mission critical things in general. For instance, how often does a flight data recorder (FDR) log every state of mechanical components? Surely, they can't wait until something "interesting" to happen, right?
There are different types of logging.
What you describe could be defined as an audit log intrinsic to system operation, which is quite a different thing than what the article describes.
Oh, I see. My bad then. Could you expand a bit more?
Quick and easy.