What is Site Reliability Engineering (SRE)? Fundamentally, it’s what happens when you ask a software engineer to design an operations function. SRE is a people discipline focused on the reliability, availability, and performance of software systems, whether web applications or systems software. SRE is a specialized team role, not a job description. SRE is a subset of Site Reliability Engineering, a methodology for designing, building, and operating large distributed systems reliably.
Site Reliability Engineering is a management philosophy introduced by Google in 2008 to describe its internal operations model. The goal of the site reliability engineering team is to create and maintain a platform that can be easily and frequently deployed and updated without any disruption to either services or users. To achieve this goal, the SRE team usually works closely with other teams, such as developers and designers. On large sites, the SRE team also maintains an organizational structure that allows it to move quickly and coordinate projects.
This post is a curated list of awesome Site Reliability and Production Engineering resources. These resources include books, articles, blogs, newsletters covering various topics such as culture, reliability, monitoring, planning, SLA and many more.
Books
Culture
How Google Does Planet-Scale Engineering for Planet-Scale Infra
Site Reliability Engineers — Keeping Google up and running 24/7
Transactional System Administration Is Killing Us and Must be Stopped
PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability
We are the Google Site Reliability team. We make Google’s websites work. Ask us Anything!
We are the Google Site Reliability Engineering team. Ask us Anything!
The Irreproducibility Of Bugs In Large-Scale Production Systems
SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering
Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)
The difference between Site Reliability Engineering, System Administration, and DevOps
Podcast #111 – SRE: Occasionally Maintaining Infrastructure That You Hate
Splicing SRE DNA Sequences in the Biggest Software Company on the Planet
Making the most of an SRE service takeover – CRE life lessons
The Cloudcast #301: SRE and Infrastructure Operations (Podcast)
Beyond Google SRE: What is Site Reliability Engineering like at Medium?
Intelligent Site Reliability Engineering – A Machine Learning Perspective
Understanding Site Reliability Engineering through Movies and Books
GOTO 2017 • Site Reliability Engineering at Google • Christof Leng
The Makeup of Successful Geographically-Distributed SRE Teams
Practical Applications of the Dickerson Pyramid by Nat Welch
LinkedIn’s Kurt Andersen Uncovers Blindspots in SRE Implementations
How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams
Reliability Engineering – The Essential Discipline for Complex Systems
Transitioning a typical engineering ops team into an SRE powerhouse
RELATED
Other Related Posts
3 Free Site Reliability Engineering (SRE) Ebooks by Google – 2020
SRE is what you get when you treat operations as if it’s software problem. 3 Free Ebooks on SRE – Building Secure and Reliable Systems, The Site Reliability Workbook and Site Reliability Engineering.Problem-Solving Web Design: Strategies for Efficient Websites – 2018
This ebook is all devoted to strategies and practices of problem-solving web design. We offer you an overview of the practical questions that could arise in the process of creating websites for different purposes.
Team
Education
From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
The Systems Engineering Side of Site Reliability Engineering
Graduating from Bootcamp and interested in becoming a Site Reliability Engineer?
Do you have an SRE team yet? How to start and assess your journey
Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program
School of SRE: Curriculum for onboarding non-traditional hires and new grads
Hiring
Growing the Site Reliability Team at LinkedIn: Hiring is Hard
Engineering Manager – Site Reliability Engineering Interview Preparation
Reliability
The Ripple Effect Of Outages And Downtime Cannot Be Underestimated
The infrastructure behind Twitter: efficiency and optimization
Using load shedding to survive a success disaster – CRE life lessons
How to avoid a self-inflicted DDoS Attack – CRE life lessons
How Google Backs Up The Internet Along With Exabytes Of Other Data
Performance, Scalability, And High Availability: 3 Key Infrastructure Adaptability Requirements
Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
Designing reliable systems with cloud infrastructure (Google Cloud Next ’17)
Know thy enemy: how to prioritize and communicate risks – CRE life lessons
CRE life lessons: What is a dark launch, and what does it do for me?
Google: A Collection Of Best Practices For Production Services
Trust By Design: The Fusion of Operational Maturity and Risk Modeling