• 6 hours
  • Easy

Free online content available in this course.

course.header.alt.is_certifying

Got it!

Last updated on 10/12/22

Recruit an SRE to Manage DevOps Within Your Organization

The DevOps methodology created a whole new role in the early 2000s: the SRE, or Site Reliability Engineer. There are only a few organizations today that have this role, but it's always useful to understand what it is, even if it's just to be able to put a name to a real-life implementation of DevOps.

When you're Google and you’re managing multiple petabyte systems on several continents, you need to think of a whole new paradigm for your operations, based on an automated approach. It's simply impossible for a human operator to manage thousands of servers.

Since then, the SRE role has started to appear outside of Google’s offices, in companies such as Apple, Facebook and Microsoft.

It's certainly not a Google-specific role. The company Clever Cloud, a cloud computing service provider, has only five developers handling the production systems for tens of thousands of applications, keeping them running smoothly for several hundred clients. They do this by maximizing their use of automation, including using robots to quickly detect faults, software problems or incompatibility issues, and providing automatic fixes without any human intervention.

I’ll be running through the ins and outs of the SRE role later in this chapter. But first of all, we need to define a few key terms to help you understand the next part of the course.

Service Level Indicator

This might be an application’s processing time, the number of errors generated when uploading some data or the number of deployments required to fix a bug. Each team can define its own indicators that will be monitored on a daily basis. These indicators are role-specific and of great value to the team that defines them. They serve as a reference when defining the SLO.

Service Level Objective

The objective is jointly owned to prevent any problems between the two parties. It needs to follow the S.M.A.R.T. guidelines (Specific, Measurable, Achievable, Realistic, Time-bound). Examples of SLOs could be to request response times, service availability or the number of calls answered in under 5 minutes. SLOs serve as a basis for SLAs.

Service Level Agreement

This indicator is therefore extremely important because a customer could decide to no longer use the service if the agreement isn't precisely adhered to. It’s also the indicator that is most relevant to the SRE when the team is set up.

The Role of the SRE

An SRE has two roles:

  • Half of their time is spent on Ops tasks, such as on-call duties, fixing production problems or manual intervention.

  • The other half of their time is spent on automating all or part of their duties and developing new functionality.

In my opinion, the time dedicated to automating repetitive tasks is where the true power of the SRE lies. The SRE helps scalable and “self-healing” applications to be created.

SRE Pillars

The SRE role is based around the five pillars of the C.A.L.M.S. acronym, as we saw in chapter 2:

  • Accepting that faults are normal (Culture).

  • Favoring the use of automation tools (Automation).

  • Implementing changes gradually (Lean).

  • Measuring everything (Measurement).

  • Reducing organizational silos (Share).

The SRE implements these pillars as follows.

Reducing organizational silos:

  • The SRE shares ownership of the applications with developers, to create shared responsibilities.

  • The SRE uses the same tools as the developers and vice versa.

Accepting that faults are normal:

  • The SRE accepts the risk.

  • The SRE quantifies the faults and availability standards using SLIs and SLOs.

  • The SRE insists upon blameless postmortems.

Implementing changes gradually:

  • The SRE encourages developers and product owners to progress quickly while reducing costs related to faults.

Favoring the use of automation tools:

  • The SRE specifically focuses on automating manual, repetitive, and laborious tasks with no enduring value (known as “toil” in the industry).

Measuring everything:

  • The SRE defines standard methods to measure different aspects of systems.

  • The SRE has a fundamental belief that the functioning of systems is a software task, not a human task.

The Four Golden Signals

The SRE defines four signals that must always be monitored. These are known as the four golden signals: latency, traffic, errors, and saturation.

These are extremely important as they are essential in guaranteeing a high level of application availability. I'll run through these in detail and give you some tools to help you monitor them.

Latency

Latency is the time it takes to send a request and receive a response. Latency is generally measured on the server side, but can also be measured on the client side to take account of different network speeds. The Ops team has the best understanding of server-side latency, whereas client-side latency is more relevant to end clients.

The target threshold you select might vary depending on the type of application. An automated system, such as an API or a server, might need a much shorter response time than a human using a cellphone. You must also monitor latency separately for successful requests and failed requests, because failed ones often fail quickly without any additional processing.

Traffic

Traffic is a measurement of the number of requests in circulation within the network. It might relate to HTTP requests to your web server or API, or messages sent to a processing queue. Peak traffic periods can create additional pressure on your infrastructure, pushing it to its limits, and this can have effects on downstream systems.

This is a key signal, because it helps you to differentiate inappropriate system capacity configurations, which can cause problems even during periods where traffic is light. For distributed systems, it can also help you to plan capacity in advance to respond to future needs.

Errors

Errors can provide you with information about infrastructure configuration errors, bugs in the code or missing dependencies. For example, a spike in errors could indicate a database malfunction or a network issue.

After a code deployment, it can indicate bugs in the code that were not uncovered during testing or that were only revealed in the production environment. The error message will give you more precise information about the problem. Errors can also affect other metrics, as they might artificially reduce latency, or repeated attempts could end up saturating other distributed systems.

Saturation

Saturation defines the load placed on your network and server resources. Each resource has a limit beyond which performance will start to degrade or the resource might become unavailable. This applies to resources such as CPU usage, memory usage, disk capacity and the number of transactions per second. You need to understand how your distributed system has been designed and how it performs to know which parts of the service might be the first to become saturated. These measurements are early indicators, giving you time to adjust capacity before performance is affected.

Reaching the saturation limit can affect your service in different ways. For example, CPU working at full capacity can result in delayed responses. If storage is full, there will be failures when attempting to write to disk. And network saturation might lead to packet loss.

A Final Few Words

And there you have it! We’re almost at the end of this course. I hope you now have a better understanding of the role of the SRE. Meet me in the course quiz, where you can test your knowledge about the DevOps approach and where it fits within an organization. It’s been a real pleasure sharing my passion for DevOps with you!

Example of certificate of achievement
Example of certificate of achievement