What is SRE (Site Reliability Engineer)

Before deep dive into the SRE world, let’s talk about, where SRE is derived from. The concept of SRE got originated in 2003 by Ben Treynor Sloss. In 2003, when the cloud wasn’t a thing, Google was one of the most prominent web companies with a massive and distributed infrastructure. They had several challenges to face simultaneously; keep the trust and reputation of their services, provide a smooth user experience involving minimum downtime and latency, manage dozens of sprawling data centers, etc. They needed to rely heavily on automation and, thereby, formulated strategies that led them to implement large-scale automation. Small Companies at that time could bear the loss of a few hours of downtime but giants like Google could not afford it as they were a frontier of best user experience. Therefore, come to think of it, building a team that can help ensure the application’s availability and reliability was an obvious outcome.

But in today’s era, users’ expectations are very high. The application should be fast, robust, and functional at all times. One minor issue or a few ms latencies can cost a fortune. Site Reliability engineers ensure that the application’s availability and performance are superior. Traditional IT operations strategies can’t accommodate these parameters.

SRE also manages and scales the application with an engineering mindset, so rather than manually repeating the tasks on regular basis, SRE automate them with best practices that save time and cost. Automation also prevents human errors. As we know, human errors are inevitable, you can’t fix people but you can fix the system and processes for better performance.

Job of SRE

SRE often focuses on two jobs. The first one is to manage the incident response to meet the user or client’s expectations in terms of SLAs, SLOs, and SLIs. Secondly, introduce the solutions with best practices to optimize the performance, and reduce the overall cost.

To know more about the responsibilities we need to understand the standards of measurement. These help to gauge whether the decisions making by SRE is practical and useful or not.

SLA (Service Level Agreement) – “What does the user expect”. In SLA, the company manually agrees with the user and client. The companies make promises with the client for specific metrics. For example, A organization promises their clients, their application’s availability or uptime will be 99%. Later, All changes in terms of design, functionality, and architectural level happen based on SLAs.

SLO (Service Level Objectives) – “When do we take action”. SLO is a specific goal that a service must meet in order to be in compliance with the SLA. Fundamentally, SLO is individual promises companies make to their users or clients. SLO is used to measure whether the SLA meets or not.

SLI (Service Level Indicator) – “What do we measure”. SLI is a key metric to measure whether SLO is met. For example, as we discuss in SLA, the company promises the availability of the service would be 99%, and then the command or metric to check the health can be an indicator. The outcome of SLI should be equal to or better than the promised result of SLO.

Normally SRE monitors the following common metrics

Availability – It measures the uptime of the application. This metric is one of the most basic and essential metrics of all that depicts the health of the application.

Latency – latency measure the performance of the application. In essence, this metric let us know the time taken by the application to serve the request. For a better user experience, latency should be as less as possible.

Errors – Errors measure the quality of the service. This implies whenever you create a request the service should respond to it in a successful manner.

Saturation – “Measure the resource of the services”. These metrics can be very helpful if we want to understand the current consumption of resources. Which can help us to scale the resources to meet the demand of the applications. It would also help to set up automation like scale up and scale down for peak and lowest traffic.

Apart from observing the health of the application, Site Reliability Engineers also automate the manual activities or tasks that usually take half of the day. Additionally, they also build the utilities, tools, and scripts from scratch to mitigate the issues coming in the infrastructure, and incident management.

In simple words, SRE spends half of their time on manual IT operation tasks, and system administrator tasks like analyzing logs, performance tuning of services, production testing, responding to incidents, and conducting postmortem. Eventually, this helps them to find the root cause of the unfortunate events. Then, SRE is responsible to apply a solution by automation so that it would not occur again. In this whole process, SRE plays the role of a first-line defender if an incident occurs.

Incident Response (Incident Life Cycle)

As we discussed above SRE is also responsible to manage the incident response to protect the reliability of the application. SRE implements strategies that increase the reliability and performance of service through on-call escalations and process optimization. That is why they need to know the critical issues, escalate, and gather concerned teams.

Following are the incident response steps which SRE normally follows.

Prevention is the first and last thing SRE does. They decide the desired state of the application and automate all the incident resolutions which can possibly occur after the delivery using pre-prod testing activities. So, before the deployment, SRE defined criteria for the desired state and implement the tools to keep the availability of the application and notify us if any event occurs.
Once the alert got triggered, decide what criteria the incident belongs to, or identify the severity of the incident. Thereafter, find the potential causes for the incident, find the right channel or team, and inform them. This whole process should be automated to maximize incident response coverage and minimize the mean time to discover (MTTD). SRE would be responsible to understand the SLOs, and implementing the tools or building a system to automate the whole operation. Automation requires the continuous monitoring of the system to maintain the desired state of the application.
The immediate response should bring back the application to its desired state. Few incidents can be known incidents for which SRE normally creates automation. For instance, implementation of auto-scaling or cleaning the space, or cleaning the system. But some incident needs human attention, in these cases, the first preference should be to stop the bleeding. Thereafter, the right experts investigate to find the root cause. Once they find out the root cause through the long and iterative process of reproducing the incident, an investigation, and failed attempts, they provide a permanent solution. Now, it’s SRE responsibility to implement it in the system.

SRE VS DevOps

DevOps’s aim is to automate the process of delivering the application with high speed. DevOps is a combination of tools, and cultural philosophies, which help in the Collaboration of the development and operation teams. It enables faster product improvement processes and quality by enforcing continuous integration and continuous delivery. On the other hand, SRE mainly works on the reliability, scalability, and availability of the application.
Focus – SRE’s main focus is to keep the application at its desired state to protect the SLAs to enhance the system availability and reliability of the application. DevOps mainly focus on the development and delivery of the application. DevOps is responsible for adding new features from the product side, whereas, SRE’s responsibility is that new features or changes don’t damage the system. This means DevOps mainly put the change from development to production. SRE has a perspective of production, they can make a suggestion to restrict failure or prevent the production from any unpredictable event.
Automation – In terms of automation, DevOps automate the build and deployment of the application by using the tools like Jenkins, Spinnaker, or scripting. SRE automates repetitive manual tasks related to the infrastructure side that prevents system failures.

Problems Solved by DevOps

By implementing continuous integration and continuous delivery, DevOps reduced the cost of development and maintenance. Moreover, it also helps to shorten the release cycle.
The quality of code and application also improved by implementing continuous testing in Continues Integration.

Problems Solved by SRE

SRE is responsible for keeping the application running if an unfortunate incident happens in production they can quickly roll back the previous version to reduce the Mean time to Recovery (MTTR).
They also reduced the Mean time to detect (MTTD) time.
They normally automate everything from the operation side to reduce unwanted events.
They also manage the on-call and incident management to keep track of each event and recover the desired state, which improves the reliability of the application.

Conclusion

As we discussed above, SRE plays a crucial part in the application life cycle or IT strategies. In the beginning, SRE only was popular in big organizations, but in today’s world, it is only going to be widespread in smaller companies as well. Having an SRE is going to reduce the cost and tasks of the operation, system administrators’ duties, and other tasks which help to keep the system healthy.

Blog Pundits: Bhupender Rawat and Sandeep Rawat

OpsTree is an End-to-End DevOps Solution Provider.

Connect with Us