What is Site Reliability Engineering?
On the 10th anniversary of the Site Reliability Engineering team’s creation, Niall Murphy interviewed me about what SRE is, how and why it works so well, and the factors that differentiate SRE from operations teams in industry. The interview is being published on G+ in segments; a link to the full interview is included below.
Niall: So what is SRE?
Ben: Fundamentally, it's what happens when you ask a software engineer to design an operations function. When I came to Google, I was fortunate enough to be part of a team that was partially composed of folks who were software engineers, and who were inclined to use software as a way of solving problems that had historically been solved by hand. So when it was time to create a formal team to do this operational work, it was natural to take the “everything can be treated as a software problem” approach and run with it.
So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.
On top of that, in Google, we have a bunch of rules of engagement, and principles for how SRE teams interact with their environment -- not only the production environment, but also the development teams, the testing teams, the users, and so on. Those rules and work practices help us to keep doing primarily engineering work and not operations work.
Niall: How is this reflected in the day-to-day work and responsibilities of an SRE team?
Ben: In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Many operations teams today have a similar role, sometimes without some of the bits that I’ve identified. But the way that SRE does it is quite different. That’s due to a couple of reasons.
Number one is hiring. We hire engineers with software development ability and proclivity. To SRE, software engineers are people who know enough about programming languages, data structures and algorithms, and performance to be able to write software that is effective. Crucially, while the software may accomplish a task at launch, it also has to be efficient at accomplishing that task even as the task grows.
During our hiring process, we examine people who are close to passing the Google SWE bar, and who in addition also have a complementary set of skills that are useful to us. Network engineering and Unix system administration are two common areas that we look at; there are others. Someone with good software skills but perhaps little professional development experience, who also is an expert in network engineering or system administration -- we hire those people for SRE. Typically, we hire about a 50-50 mix of people who have more of a software background and people who have more of a systems engineering background. It seems to be a really good mix.
We’ve held that hiring bar constant through the years, even at times when it's been very hard to find people, and there’s been a lot of pressure to relax that bar in order to increase hiring volume. We've never changed our standards in this respect. That has, I think, been incredibly important for the group. Because what you end up with is, a team of people who fundamentally will not accept doing things over and over by hand, but also a team that has a lot of the same academic and intellectual background as the rest of the development organization. This ensures that mutual respect and mutual vocabulary pertains between SRE and SWE.
One of the things you normally see in operations roles as opposed to engineering roles is that there's a chasm not only with respect to duty, but also of background and of vocabulary, and eventually, of respect. This, to me, is a pathology.
Niall: Outside Google, we often observe that there isn't parity of esteem between the SWE and operations teams, which combines poorly with the fact that they often have different incentives. That’s how we end up with the model that exists in the industry today, where SWE teams write something and throw it over a wall to the operations teams, who then try to make it work, and can’t, and throw it back, and so on.
Ben: It’s interesting in this context to also look at the organizational differences that make SRE what it is, not just the individual work habits.
One of the key characteristics that SREs have is that they are free to transfer between SRE teams, and the group is large enough to have plenty of mobility. Additionally, SWEs are free to transfer out of SRE. But, in general, they do not.
The necessary condition for this freedom of movement is parity between SWEs in general, SREs who happen to be SWEs, and compensation parity between those and systems engineers in SRE. They're all groups that are held to the same standards of performance, the same standards of output, the same standards of expertise. And there's free transfer between the SWE and the SRE SWE team. The key point about free and easy migration for anyone in the SRE group who find that they are working on a project or a system that is “bad” is that it is an excellent threat, providing an incentive for development teams to not build systems that are horrible to run.
It's a threat I use all the time. I say, "Look, we're only hiring engineers into SRE. If you build a system that is an ops disaster, the SREs will leave. And I will let them." And as they leave and the group drops below critical mass, we will hand operational responsibility back to you, the development team.
In Google, we have institutionalized this response with things like the Production Readiness Review, which helps us avoid getting into this situation by examining both the system and its characteristics before taking it on. Also, by sharing operational responsibility between the SRE and DEV teams for a time after launches -- shared responsibility is the simplest and most effective way I know to remove any fantasy about what the system is like in the real world. It also provides a huge incentive for the DEV team to make a system that has low operational load.