What are SLOs and Why They Are Useful
What are SLOs?
Service Level Objectives, SLOs, which are what many people call SLAs, which are actually service level agreements, define objective values for metrics that indicate whether services are working successfully on behalf of the customer.
They are different from alerts, canary alarms, graphs of errors or latencies, etc.
SLOs are based on SLIs, which are service level indicators, which is basically a metric related to level of service provided.
An SLO is an objective for one or more SLIs. An SLO has to be measured for a window of time to be meaningful. For example, if you say that your SLO is 99.95% availability, then you must measure that over a window of time. Whether you meet or miss your SLO depends on both the definition and the time window you are considering.
Why are SLOs useful?
SLOs provide several things that are not easily directly obtainable from just the underlying metrics or graphs visualizing them.
- Service SLOs allow you to convert aspirations for service reliability from good intentions to an actual mechanism.
- An ability to generate a report for specific time windows indicating which services out of a set of services met or did not meet their SLOs.
- This lets organizations focus on what services are having patterns of problems.
- The ability to define and use error budgets
- Error budgets let you change your behavior in terms of change management and risk tolerance based on predefined strategies or tactics when you have had problems recently that makes it worthwhile to be more conservative to protect customer trust,
- Or conversely, when you have objective data you have been doing well, take more risks so that you can deliver features and innovations faster on behalf of the customer.
- SLO definition and measurement convert a variety of different services with different behaviors and requirements into a common language of success(or not) so that you can measure how things are going concretely against your goals.
- This objective Boolean output for service and time window keeps you from fooling yourself after you have set your objectives.
- SLOs let you ask questions about your monitoring, alerting, incident response, and postmortem process which you cannot do only from alerts, logs, and graphs of metrics, such as:
- When should I have been alerted, but was not?
- How strict should my alerting thresholds be based on my SLOs?
- SLOs give a direct, easy, accessible value for the customer impact of an outage, whether it triggered an alert or not, without consuming engineering time to do manual investigations.
- SLOs enable long-term pattern analysis for reliability that compliments root cause analysis and examination of ticket load with a different type of data.
- You can enable defining automatic alerts and alerting thresholds directly from a service catalog linked to SLOs for each service.
- For a discussion of how this works see https://sre.google/workbook/alerting-on-slos/
- Given your SLO and the SLO of your current actual or future potential dependencies in a design, you can make reasoned decisions on what you can actually take as a hard dependency for your service. Unless you have a soft failure mode, you will not be able to succeed at having a tighter SLO than a critical dependency, at least one that you call for each of your service calls.
Further Reading
- https://sre.google/workbook/alerting-on-slos/
- https://sre.google/workbook/implementing-slos/
- https://cloud.google.com/architecture/adopting-slos
- https://newrelic.com/blog/best-practices/best-practices-for-setting-slos-and-slis-for-modern-complex-systems
- https://thenewstack.io/how-to-correctly-frame-and-calculate-latency-slos/
- https://github.com/OpenSLO
- https://www.oreilly.com/library/view/slo-adoption-and/9781492075370/ch01.html
- https://www.squadcast.com/blog/slos-for-aws-based-infrastructure
- https://aws.amazon.com/blogs/mt/slos-made-easier-with-nobl9-and-cloudwatch-metrics-insights/
By Richard Anton
Apr 20, 2022