So, What Exactly is Site Reliability Engineering

4 min readNov 8, 2019

Often in an interview when I ask the candidate what’s their understanding of Site Reliability Engineering, the most common answer I get back is it is system administrator for production environment or “a kind of DevOps”. It is true in some aspects but that answer doesn’t accurately capture what Site Reliability Engineering (SRE) truly is.

So what exactly is Site Reliability Engineering? I’ll discuss it in terms of responsibilities, technical skills, expectations, salary and finally a bit of its history.

Philosophy

Site Reliability Engineering comes from Google when it runs its massive production environment only to find out that following traditional IT practices would often lead to slow software delivery and “on-call hell”. The core tenant of SRE is “ Embrace Risk “, however, that doesn’t mean any risk is acceptable; rather, SRE takes measured and calculated risk to achieve desired results.

Responsibilities

By definition, Site Reliability Engineers are responsible for making sure production service is available to all the customers all the time (perceived). This includes meeting a set of very tight error budget and availability requirements, and strive to improve the process at the same time.

Balance between doing development and operations work is key to build a successful SRE team and improve the reliability of production and/or service in the long term.

Elimination of toil

Elimination of toil is one of the most important tasks Site Reliability Engineers do. Due to software engineers’ innate nature of hating doing things twice, let along multiple times, automate the operational tasks via software engineering lens is the foundation of Site Reliability Engineering. However, not all automation is good, automation without structure and planning is another recipe for falling into snowflake scripts.

Service Level Objective

Contrary to common sense, perfect reliability (100%) is not a good target to aim at. After crossing a certain availability threshold, humans will not be able to differentiate the difference. Especially various other factors are normally less reliable, e.g. network is always reliable. Ha, good joke 🙂 Also chasing perfect reliability will hamper the speed to deploy new features into production since if anything happens, there goes the perfect stability.

Site Reliability Engineering cares about Service Level Indicators and Service Level Objectives since they are carefully chosen metrics to give engineers an idea of how well the service is running in production. Request latency is often a popular metric to monitor closely for any clue on how the product is responding to user requests.

On the contrary, the most throw around word “Service Level Agreement” doesn’t get too much attention, not because it’s not important, mainly because SLO > SLA, so if we can meet our self-defined SLO, then we meet SLA with customers.

Monitoring

Monitor everything. If you don’t know how your application behaves in production, how can you improve it? Four golden signals are the first things to set up: latency, traffic, errors and saturation. A couple of monitoring methods are commonly used including black-box, white-box, instrumentation and performance.

Skills

As you’ve probably already guessed at this point, Site Reliability Engineer is not System Administrator, though it does require System Administrator skills, it is far beyond just understanding systems. There are different types of SREs out there, and it varies across different companies as well.

Google classifies Site Reliability Engineers into six types:
Kitchen Sink / Infrastructure / Tools / Product & Application / Embedded / Consulting. I’ll discuss the differences in a following up post.

Salary

Senior Software Engineer

Site Reliability Engineer

So, Should you consider it?

YES YES and YES! SDE is the most common and by all means a good career path, but that doesn’t mean no other golden opportunities are lying somewhere else. One common feedback I constantly heard from Junior/entry-level engineers/developers is that it’s boring if not writing code, that statement is true to a certain degree, but to develop skills further or proceed to the next level, a solid understanding of system-level view is required. If you want to try something new, Site Reliability Engineer is a good place to start looking.

Resources:

If you want to hear what a Googler thinks: Site Reliability Engineers: “solving the most interesting problems”

How SRE teams are organized, and how to get started