The term Service Level Agreement (SLA) is a familiar one, particularly in the context of a cloud or managed service on the web. An SLA refers to the contractual obligations a service provider has to its customers and is the instrument defining permissible performance levels for the service. For example, a service agreement might determine a service level of 99.95% uptime, with penalties for falling under 99.95% uptime (more than about 4.5 hours of downtime in a year or 1.125 hours per quarter).
The term is so useful for describing both requirements and expectations around service uptime that it has been co-opted for other uses where a contractual agreement doesn't or can't exist. For example, a community SLA or free-tier SLA might describe a non-contractual situation with the desire or expectation of maintaining a certain service level.
The problem with this usage is a wonky but important one. In an SLA, "agreement" always means a contract; the contextual meaning of the word cannot be translated to other contexts. The relationship between two or more people is, by nature, non-contractual. That's why contracts were invented: to provide a way to formalize an agreement and its terms beyond the moment of coming to an agreement.
Misusing the term SLA creates specific problems in at least two areas:
- In cloud-native site/system reliability engineering (SRE), two of the tools central to the practice are the Service Level Objectives (SLO), created to make sure user experiences are within an acceptable range, and the Service Level Indicator (SLI) used to track the status and trends of the SLO. Both of these roll up to an SLA in a commercial situation, but there's no good equivalent to roll up to in a non-commercial situation.
- In some cases, managed cloud services are delivered to a user base, but there isn't a contractual dynamic, for example, with IT services in academic settings and open source services delivered as part of an open source project. The groups need a way to frame and discuss service levels without a contractual element.
This bit of word-wonkiness and nerdery is important to my work on the Operate First project, because part of our work is creating the first all open source SRE practice. This includes not only having SLOs/SLIs but also documenting how to write them. We do this because Operate First is an upstream open source project where the content will likely be adopted for use in a commercial context with an SLA.
As the community architect for the Operate First project, I am advocating for adopting the similar, well-used term Service Level Expectation (SLE) as the top-level object that we roll Service Level Objectives (SLOs) up to. This term reflects the nature of open source communities. An open source community does not produce its work due to a contractual agreement between community members. Rather, the community is held together by mutual interest and shared expectations around getting work done.
Put another way, if a team in an open source project does not finish a component that another team relies on, there is no SLA stating that Team A owes monetary compensation to Team B. The same is true for services operated by an open source project: No one expects an SLA-bound, commercial level of service. Community members and the wider user base expect teams to clearly articulate what they can and cannot do and generally stick to that.
I will share my proposal that a set of SLOs can be constructed to remain intact when moving from an SLE environment to an SLA environment. In other words, the carefully constructed SLIs that underlie the SLOs would remain intact going from a community cloud to a commercial cloud.
But first, some additional background about the origin and use of SLEs.
SLEs in the real world
Two common places where SLEs are implemented are in university/research environments and as part of a Kanban workflow. The concluding section below contains a list of example organizations using remarkably similar SLEs, including institutions like the University of Michigan, Washington University in St. Louis, and others. In a Kanban workflow, an SLE defines the expectations between teams when there are dependencies on each other's work. When one team needs another team to complete its work by a certain deadline or respond to a request within a specific time period, they can use an SLE that is added to the Kanban logic.
In these situations, there may be time and response information provided or understood from a related context. Staff sysadmins might be on duty in two shifts from 8AM to 8PM, for example, five days a week. The published expectation would be 5x12 for non-critical issues, with some other expectation in place for the critical, all-services-and-network-disrupted type of outages.
In an open source project, developers may be balancing time working on developing their product with supporting the product services. A team might offer to clear the issue and bug queue after lunch Monday through Thursday. So the SLE would be 4x4 for non-critical situations.
What are cold-swappable SLOs?
The core idea here is to design a set of SLOs that can be moved from under an SLE to an SLA without changing anything else.
An SLE has a focus of expectation, which can be thought of generally as ranging from low-expectation to high-expectation environments. Thus, the act of writing an SLO/SLI combo to work with an SLE environment helps to document the knowledge of how to range the measurement on the indicator for this service depending on how it's used, setup, and so on.
- Establish an SLE with details for different services (if they have different uptime goals) and clarify boundaries, such as, "Developer teams respond to outages during an established window of time during the work week."
- Developers and operators establish one to three SLOs for a service, for example, "Uptime with 5x5 response time for trouble tickets," meaning Monday-Friday from 12:00 to 17:00 UTC (5x5).
- SLIs are created to track the objective. When writing the spec for the SLI, write for the specific and the generic case as much as possible. The goal is to give the reader a high percentage of what they need to implement the pattern in their environment with this software.
8 examples of SLEs
Although not in universal usage, I found many examples of SLEs in academic and research settings, an open source community example (Fedora and CentOS communities), and a very similar concept in Kanban of the expectations for seeing a sprint through from start to finish.
I'll conclude this article with a non-exhaustive list of the introductory content from each page:
University of Michigan ITS general SLEs:
The general campus Service Level Expectation (SLE) sets customer expectations for how one receives ITS services. The SLE reflects the way Information and Technology Services (ITS) does business today. This SLE describes response times for incidents and requests, prioritization of work, and the outage notification process.
Specific services may have additional levels of commitment and will be defined separately under a service-based SLE.
Washington University in St. Louis (2016) SLEs for basic IT services for all customers:
This document represents the Service Level Expectation (SLE) for the Washington University Information Technology (WashU IT) Basic Information Technology (BIT) Bundle Service.
The purpose of this agreement is to ensure that this service meets customer expectations and to define the roles/responsibilities of each party. The SLE outlines the following:
- Service Overview
- Service Features (included & excluded)
- Service Warranty
- Service Roles & Responsibilities
- Service Reporting & Metrics
- Service Review, Bundles & Pricing
Each section provides service and support details specific to the BIT Bundle Service as well as outlining WashU IT's general support model for all services and systems.
Rutgers (2019) SLE for virtual infrastructure hosting:
Thank you for partnering with us to help deliver IT services to the university community. This document is intended to set expectations about the service Enterprise Infrastructure Systems Engineering delivers as well as how to handle exceptions to that service.
Western Michigan University SLEs:
This Service Level Expectation document is intended to define the following:
- A high-level description of services provided by the Technology Help Desk.
- The responsibilities of the Technology Help Desk.
- When and how to contact the Technology Help Desk.
- The incident/work order process and guidelines.
The content of this document is subject to modifications in response to changes in technology services/support needs and will remain in effect until revised or terminated.
University of Waterloo SLEs for core services:
The purpose of this document is to define the services applicable, and provide other information, either directly, or as references to public web pages or other documents, as are required for the effective interpretation and implementation of these service level expectations.
University of Florida Research Computing SLEs:
This page describes the service level expectations that researchers should keep in mind when storing data and working on the HiPerGator system.
There are three categories of service to be considered. Please read these service descriptions carefully.
The Fedora and CentOS Community Platform Engineering (CPE) SLEs for community services:
The CPE team does not have any formal agreement or contract regarding the availability of its different services. However, we do try our best to keep services running, and as a result, you can have some expectations as to what we will do to this extent.
SLEs can be defined as forecasts of cycle time targets for when a given service should be delivered to a customer (internal or external)...
Service Level Expectations represent the maximum agreed time that your work items should spend in a given process. The idea is to track whether your team is meeting their SLEs and continuously improve based on analyzing past cycle time data.