What Is Site Reliability Engineering (SRE)?

How an organization operates is just as important as the business itself. The way teams are structured, and the methods they employ in carrying out their work is vital to the end product. To improve efficacy, efficiency, and quality, software companies adopt approaches like DevOps and Site Reliability Engineering, two paradigms currently employed throughout the industry.

Today, we’re going to look at these paradigms. We’ll do a deep dive into Site Reliability Engineering (SRE), and identify some core differences and similarities with DevOps. Finally, we’ll look at some specific situations in which either is appropriate.

What is Site Reliability Engineering?

SRE was first coined in 2003 at Google as a drive towards reliability. As Google asked their software engineers to prioritize reliability as they collectively worked towards efficiency and scalability goals, new approaches were needed to solve the underlying weaknesses in traditional paradigms. Over time, these approaches coalesced into Site Reliability Engineering as a general practice, with a primary focus on the leveraging of automation, tools, and processes.

As SRE evolved, solutions such as on-call monitoring, automation for capacity planning and scaling, and disaster response planning were added to the SRE playbook. These and a general concept of automation towards resolutions became core facets of the SRE approach. In basic terms, SRE is about improving operational reliability and efficiency. Ben Traynor, VP of engineering at Google and founder of Google SRE, pinpointed the essence of the SRE role in this interview:

“SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.”

SRE, s ideally should automate away their job – an SRE’s gold standard. SRE, as a general practice, also combines reliability with empathy – the team should be empathetic and aware of the other groups, desiring to work alongside them to achieve both their team goals while improving the system as a whole.

Where Does SRE Lie in an Organization?

In older paradigms, developers were primarily focused on agility, and passed their code through a barrier to the operators, who focused on stability. Operations thus had reduced understanding of the codebase, and development had little knowledge of Operations, causing tension. DevOps, as a principle, was designed to break this cycle through a focus on collaboration, gradual changes, the development of tooling and automation offerings, and useful measurements. While this was helpful, it still focused on the facilitation of communication between two distinct and opposing forces.

SRE, on the other hand, is meant to be the next step of DevOps that overcomes this thought process. Google sees SRE as an implementation of DevOps – in this approach, Development and Operations are the same, and their co-functionality is more about alignment rather than facilitation. They use canaries to test on a smaller user base first, they employ automation, and they develop further methods and practices for more efficient measurement. Ultimately, SRE is a more advanced, thought-out form of DevOps that intends to solve traditional DevOps traps.

Because groups often implement SRE as a hybrid between Development and Operations, in most organizations, this is where it sits. However, in practice, SRE may exist in various locations. It can live as part of varying project bodies, as its own department, or even as a top-level management process. SRE may sit anywhere in the organization as long as this placement doesn’t interfere with the core duties of the role which it oversees.

We should note that there is an unspoken reality that comes with applying SRE to each aspect of the organization. While such broad-base applications can be found in knee-jerk adoptions, it ends up placing both Development, Operations, and SRE at the same level. Depending on the implementation, this can cause the same problems SRE was meant to solve. If the effort is properly collaborative, this is less of a problem, but it’s still something to consider when considering the practical allocation of SRE in the organization.

SRE may have varied focuses. Infrastructure SRE, for example, tends to focus almost entirely on improving infrastructure tooling and processes, product teams are limited to a very limited business segment and company offering, tooling teams are almost entirely focused on software development around reliability, etc. The reality is that SRE should be viewed not as a plug and play team management option, but instead as an ethos and a paradigm.

Benefits of SRE

SRE delivers some major organizational and practical benefits. First and foremost, SRE is almost obsessively focused on reliability – it’s in the name. This focus on reliability across the implementation means that operational expenses are minimized, points of failure are eased and mitigated, and repeated functions that waste time and resources are automated. All of this together results in great economic savings.

Even greater gains in accuracy and efficiency can be made here, especially since the human factor for repeat tasks is removed and replaced with automated processing. A major gain from implementing SRE is the cultural shift towards failure resolution. SRE is much more concerned with identifying failure causes early on rather than addressing symptoms and mitigating holistically.

Reliability is made king rather than functionality, and as such, the focus becomes much more on delivery rather than on product. These problems can actually be resolved far before they become a problem within SRE. The system must necessarily be understood before anything can be automated, and as such, problems are often found and mitigated before they occur – even when they do occur, mitigation solutions are identified and readied, allowing for quick turnaround and highly reliable function.

Another big benefit to SRE is the ability to deliver ownership and distribute expertise effectively. One such example of this can be found in the efforts undertaken by Poppulo. When scaling, Poppulo found a common concern in the growth of product spread across a steady base of expertise:

“Up until now the work required to create and maintain our platform and our reliability responsibilities was spread throughout our teams. As we scaled we found that expertise is spread too thin across the department.”

By spreading their expertise over larger numbers of product offerings rather than focusing on the underlying core concerns of reliability and efficiency, these attributes are negatively impacted – more so, it makes it so that teams can’t really take ownership of their product or leverage their talents in an overall positive direction. By having to wear so many hats across a wide swathe, the entire whole is negatively impacted. Poppulo resolved this by adopting SRE organizational approaches:

Our product teams will still be responsible for deploying their own services, monitoring and running them in production. This is the best place for this responsibility to live. However as we are getting bigger, concentrating our platform development and reliability expertise will allow us to more effectively develop both. Reliability and our platform are first class concerns and need to be treated with the respect they deserve.

All of this adds up to significantly increased reliability. Increased reliability means less downtime, extra staff on hand, the ability to reduce after-hours calls (while still effectively supporting it through automation), and so forth. This not only has major economic benefits but also imparts better morale in the company as well as improved brand trust.

Drawbacks of SRE

There are some drawbacks to the SRE approach, however. Perhaps the largest one is that its still a relatively unproven concept. DevOps, by contrast, is a well-tested, battle-hardened option that is as common as it is understood. SRE, on the other hand, is still relatively recent and has a lower adoption rate. As such, it’s not as proven, and fixes to the multiple potential cracks may not be obvious.

SRE also has a weakness in its requirement for strong and directive management. Because SRE rides a very thin line in terms of business logic and implementation, it’s very easy for an SRE team to “fall off the track” so to speak. The only fix to this is a stronger management body, which can result in micromanagement and loss of efficiency.

There’s also a major concern in SRE being a skill sink position. Wanting everything in a single person or team means that the bar is set ludicrously high for those positions, and as such, it makes hiring that much more difficult. While this is less of an issue for a team moving from an established process to SRE, where the skill sets likely already exist and are simply waiting to be combined, new teams must meet this high goal from day one.

SRE vs DevOps – Choosing a Paradigm

All of this comes down to a simple question – which option is best for a given organization? SRE and DevOps are both valuable options, and there’s no real clear-cut rubric by which an organization can choose one over the other.

It should, of course, be considered that adopting SRE is going to mean more management, new processes, and generally divergent leadership. This has its own cost in terms of opportunity but may also necessitate new hiring. This can affect budgetary goals, and as such, any such adoption will need to be considered within such a context.

Much of the nature of SRE and DevOps requires an understanding of where the organization actually is in its day to day operations. Because SRE is a huge cultural shift, it is often best adopted by new organizations or organizations that haven’t sunk into their respective DevOps positions. Likewise, DevOps is a good choice for organizations that have yet to choose but are already gravitating towards a DevOps-like relationship.

Perhaps the easiest metric by which to judge the appropriateness of either solution is the desired outcome. SRE focuses largely on reliability, so long-term, active user interfacing solutions are going to benefit the most from SRE. Product-driven or single-purpose outcomes benefit most from DevOps.

Conclusion

Ultimately, the choice between SRE and DevOps is a choice of appropriateness for the end result. The outcome is going to drive which option you choose, though it should be kept in mind that there are associated costs – both in monetary terms and in terms of productivity and efficiency – that must be dealt with regardless of which option is chosen.

It should also be noted here that SRE is often considered a “fad” implementation – while it may be true that SRE is often adopted as a knee-jerk reaction to the paradigm de jure, the reality is that it has great value in the right situation, and can be leveraged to deliver that value to an organization poised to reap it. As such, organizations should be careful to not treat SRE as a fad, either negatively or positively, and to instead think of it as yet another tool in the great API toolbox. What do you think of SRE? Is its value overstated? Let us know in the comments below.