Site Reliability Engineering

https://is1-ssl.mzstatic.com/image/thumb/Podcasts112/v4/a5/da/bf/a5dabfe1-6f6e-1e7a-0cfb-c188822eaba2/mza_137989529075670926.png/600x600bb.jpg

The DevOps Dojo

Johan Abildskov

7 episodes

7 months ago

The DevOps Dojo is an educational podcast focused on DevOps and making the world of building software a little better. Each episode covers a principle, practice or common DevOps fable. Join the Dojo to expand your software development horizons!

Technology

Education

RSS

All content for The DevOps Dojo is the property of Johan Abildskov and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

Education

Site Reliability Engineering

The DevOps Dojo

7 minutes 1 second

5 years ago

Site Reliability Engineering

Site Reliability Engineering or SRE, is the next big thing after DevOps. In this episode I cover what SRE is, some of its principles and some of the challenges along the way. Sources https://www.oreilly.com/content/site-reliability-engineering-sre-a-simple-overview/ https://landing.google.com/sre/books/ Transcript Site Reliability Engineering - Is it just traditional siloed Ops in disguised or is it a functioning DevOps organizational structure that works. SRE comes from the trenches of Google production environments, so many organizations look to SRE after having established DevOps teams. Site Reliability Engineering has some powerful concepts and tools, but it all comes at a price. If an organization has the scale, and the willingness to invest, and change their ways of working, SRE can help you go planet-scale. But there are many obstacles on the road to Site Reliability Engineering paradise. This is the DevOpsDojo #5, I am Johan Abildskov, join me in the dojo to learn. Site Reliability engineering is the fabled world of planet scaled operations, with autonomous DevOps teams. This seems like an oxymoron but is a functional DevOps organizational structure. Sloss, who coined the term Site Reliability Engineering, states - "Site Reliability Engineering is what happens when you ask a software engineer to design an operations team". SRE is how Good is able to run their production environments in large scale. Both in terms of how many applications and services they run and how many engineers are required to develop and maintain those products, but also in terms of the global scale of their applications. Site Reliability Engineering is about scaling services to world-class availability, but it is also about scaling the engineering organization such that teams can continue to be productive as the business grows. For me the core tenants of Site Reliability Engineering are: Minimizing Toil, Shared Ownership and the ability to say no. Engineering is a key part of SRE, and toil is the opposite. Toil is the work that does not, in the long run, add value, or that does not require engineering. This means that if we do too much work on toil, we will not be able to scale superlinearly with the number of engineers. Examples of toil are manual deployment procedures and following complex processes. Google claims to be able to have their teams work with thirty to fifty per cent toil. With fifty per cent being a hard limit. A key point here is that if a team violates this, they will either shed responsibilities or get resources added. This requires a hard buy-in from management in supporting this. This will likely be impossible in a project funded organization. So this way of thinking of eliminating toil, and give the space in the teams to do real engineering work, is the foundation of Site Reliability Engineering. Shared Ownership is a weird construct when the point is to run the applications that someone else is building. But it is a very important pillar of SRE. This shared ownership comes from SRE teams providing valuable metrics from production environments to product teams. It also comes from a shared agreement that if the application itself is too unstable, developers will join the SREs on-call rotation until the service has been restored to the quality the SREs can operate. SREs also helps product teams with production readiness reviews and checklists. All in all SRE enables the product teams to run applications at planet-scale without having to maintain all the competencies that this necessarily requires inside the team. This again requires commitment from management in enforcing the requirements from the SRE teams, so we do not revert back to throwing applications over the wall of confusion. A powerful realisation is that we should consider downtime a resource, and spent it deliberately and respectfully. With this realisation comes the underlying acceptance that absolute 100% availability is not the right