If you’ve ever been in a situation where you only had limited resources and tried to decide what was most important, you know it can be difficult. In some cases, I have been told to build dashboards with 4 metrics or panels. In this blog I will talk about the 4 golden signals of monitoring and why they are important. When it comes to cloud native infrastructure, keeping things simple will make your life easier as an SRE or Devops engineer.
The main concerns for systems and site reliability engineering (SRE) are latency, traffic, errors, and saturation. These are the four golden signals of monitoring. If you can collect data from these four metrics alone, and understand how they correspond to the behavior of your applications, you’ll have a good foundation of the health of your system. Lets dive deeper
The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.
I like to think of latency like when you yell in the grand canyon, its the time it takes for you to hear what you yelled back. Hello ………………. Hello. The slower it takes your webpage to load, the more latency you have in your connection. One of the more strange latency issues I have seen in my career was with a serverless workload that was responsible for uploading connected area network telemetry data. The latency was minimal initially, something like 2–6ms but grew to 25/30 after a few weeks as more pipelines started to be added to the stack. Once the latency passed 30 seconds the searches on these telemetry streams would fail because they were recurring at a 2 minute interval. Latency problems can give you the biggest headache and can be the most difficult to solve.
A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.
This is my favorite signal to look at because I love packet captures and wire shark. When I was helping build VoLTE (Voice over LTE) we were constantly looking at call flow packet captures making sure our traffic patterns were what they should be. You need to know what normal traffic looks like on your systems. You have to be able to define steady state before you can alert on disrupted state. Its also fun pumping Jmeter requests because your developer says his app is unbreakable. Carry On sir.. Carry on :D
The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.
4xx messages are bad, 5xx messages are really bad. When layer 2 and 3 issues happen, your error count dashboards will light up like a Christmas tree. Thanks for listening to my Ted talk..
How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.
In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., “Give me a nonce” or “I need a globally unique monotonic integer”) that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.
Finally, saturation is also concerned with predictions of impending saturation, such as “It looks like your database will fill its hard drive in 4 hours.”
Saturation was the key signal we used to plan our DR in my datacenter days. We would build our systems (Man I miss those Pizza box DL380s) based on saturation levels and what they could handle. If I lose the west region, can we run everything on the east?What does that look like and how will we handle those defcon 5 situations. When looking at saturation, errors and latency are likely to follow. Understand your traffic patterns, understand the CPU/memory utilization patterns, and you will then understand the saturation of your workloads.
In short, if you follow the four golden signals and build around them, you’ll be alerted to the problems that matter most to your service in time for it to matter. These are not the only golden signals out there — a host of others exist that you should also monitor. But if you can’t monitor everything all the time (which happens all the time), these are some of the most valuable metrics to monitor first.