An Introduction to Site Reliability Engineering (SRE)

If your business is competitive and growing, chances are you have a mission critical software application running inside your data center or in the cloud. Since your application is mission critical, when it goes down, you’re losing money and possibly losing clients. Site Reliability Engineering (SRE) is a discipline focused on maximizing the performance, reliability, and uptime of mission critical applications. This blog is the first in a series in which we’ll introduce and discuss SRE concepts, the history behind it, and the reasons why it’s important.

Why Site Reliability Engineering?

Traditionally, managed service providers will tell you whether your server is up and whether the network is clear or congested. However, those aren’t the most common reasons your application goes down or fails to perform. Your customers experience your services at the application, not the network and hardware layers.  Hence the ability to meet service-level objectives (SLOs) requires solving the problems that affect the application. These include failed code migrations, incorrectly sized hardware, overfilled disk space, changes to third party APIs, misunderstood peak loads, poorly designed applications, security breaches and more.

What is Site Reliability Engineering?

SRE solves problems, not just in the computing infrastructure, but up to and including the application layer. SRE seeks to improve the reliability of existing software, while minimizing the work involved in its upkeep. Unlike traditional operations support, SRE stresses the continual optimization and automation of common IT operations tasks. By automating as many tasks as possible, the SRE team is able to focus on more strategic, higher-level work, such as planning a new deployment or creating a pipeline for faster product feedback. At its core, the goal of SRE is to meet and exceed the business’ SLOs for the application and remove manual work in repetitive tasks.

Site Reliability Engineering Origins

Google is credited with the first implementation of SRE in 2003 when they tasked Benjamin Treynor Sloss, current vice president of engineering at Google, to lead a team of software engineers to keep Google’s websites running as reliably and serviceably as possible. Treynor tasked this team with spending half of their time on operations tasks to gain a better understanding of software in production. For Treynor, SRE is the result of allowing a software engineer to structure the subsequent operations functions — effectively creating a NoOps environment (the concept that an IT environment can become so automated and abstracted from the underlying infrastructure that there is no need for a dedicated team to manage software in-house). Companies currently employing this method include Dropbox, Mozilla, Netflix and LinkedIn.

The Bottom Line

When your application goes down you’re losing money and customers.  Your managed service provider and even your internal IT staff may stop short of the application layer.  You need a Site Reliability Engineering process focused on meeting the service level objectives of the application. In our next blog, we’ll discuss the key components of SRE including automation of the code deployment pipeline, application and infrastructure monitoring, performance analysis, root cause analysis, restoration, and reporting.

Learn more about how our site reliability service line can work for your application by contacting us at or calling us at (913) 283-9343.

Share this post