As online services become the lifeblood of just about every business, the health of these interconnected services becomes critical. Google’s decades of operating the most advanced and scalable applications in the world set the standard for how to operate reliable services at global scale, and over the past couple of years, Google has published and promoted their best practices for operations, dubbed Site Reliability Engineering (SRE). Contrary to what some have suggested on the subject, SRE is not a replacement for modern DevOps practices and, in fact, SRE lays out the groundwork for what is required to make DevOps scale for most organizations.
Site Reliability Engineering Essentials
Considering multiple books have been written to explain what SRE is and how to implement it, the fundamentals are what’s important in appreciating why SRE is unique over most existing practices. First, the entire organization must rally around a few well-known reliability objectives, and second, the organization must automate operations via software as the only way to meet these objectives while growing the business. Arguably, a third principle would include the creation of a semi-independent group focused on the previous two fundamentals, but, as discussed later, several factors can affect how specific staffing occurs within an organization.
Regardless, the basics revolve around uniting the organization in determining what to measure and then setting operational goals to balance business needs and customer expectations. Service level indicators (SLIs) provide the key measurements that align to customer satisfaction, while service level objectives (SLOs) determine the healthy ranges for which the organization as a whole strives to achieve these goals. While all or a subset of SLOs may be published to customers, this should not be confused with service level agreements (SLAs), which are the contractual agreements with customers for missed SLOs or other measurements of service health.
One Size Does Not Fit All
Corporations the mega-size of Google obviously have differing needs relative to a small startup feeding off of venture capital. Large organizations would be wise to take the leap toward a separate team comprised of software and systems engineers, while smaller organizations may choose to define SRE roles across multiple departments or create a virtual team with rotating staff members from existing teams. In either case, the rallying effect of shared responsibility for a set of SLOs will improve the reliability equation.
To establish buy-in, teams will be forced to make some difficult trade-offs to determine what’s really important to the customer and then determine how to measure the corresponding SLIs. Injecting SRE into a well-established culture will create typical transitional stress as boundaries are shifted, but for most organizations, SRE fundamentals will provide blueprints to unite groups and balance the functionality-versus-reliability contention. While “site reliability engineer” has recently become a trendy title on job sites, DevOps teams should resist the temptation to retitle team members as SRE while keeping previous job duties the same. This approach leads to accountability confusion and hinders the benefits of shared SLOs and the automation necessary to make them achievable.
Making Site Reliability Engineering Work
For SRE to work effectively in any organization (or service) of notable size, teams must be able to monitor, troubleshoot and eventually automate around the key SLOs. Given that key services may span legacy data centers, as well as multiple public cloud vendors, tooling that can communicate with and correlate information across these disparate landscapes is required. When uniting a myriad of different stakeholders with diverse backgrounds, role-flexible dashboards are necessary, and even more important is the ability for dashboards to dynamically adjust as the underlying apps and services shift. This diverse data set requires a scalable platform that can elastically handle ingest of data streams with suddenly large data spikes, and in most cases, a cloud-based back end provides the only option to minimize the volume of staff needed to manage the tool itself. Making sense of several semi-related data streams requires machine learning capabilities to stitch key relationships together and build models to detect and adjust to an ever-changing ecosystem. While changes in organizational boundaries to build an SRE team will create some strife, leaders will tussle even more with the federation of tools into a singular tool (or a subset thereof) as departments vie to keep their ingrained workflows intact.
Whether this article represents an introduction to SRE or reinforces previous learnings, there’s little doubt that SRE is here to stay — at least until the next technological transformation takes shape. Many executives will struggle with organizational shifts and appropriate tooling to unite internal stakeholders toward a common set of objectives, but ultimately, end users will benefit the most from a fully functioning SRE team representing their interests from behind the scenes.