DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
CPU and DB were bored, yet every site timed out: a slow-read bot that starved Apache's workers

CPU and DB were bored, yet every site timed out: a slow-read bot that starved Apache's workers

Comments
5 min read
I'm building a read-only context engine for Kubernetes and AI agents

I'm building a read-only context engine for Kubernetes and AI agents

Comments
6 min read
The Post-Mortem That Taught My System How to Fix Itself Using Hindsight

The Post-Mortem That Taught My System How to Fix Itself Using Hindsight

Comments
7 min read
I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.

I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.

Comments
4 min read
Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Comments
5 min read
What is SRE? A Beginner's Guide to Site Reliability Engineering

What is SRE? A Beginner's Guide to Site Reliability Engineering

Comments
5 min read
Ongrid : open-source ops AI agent for RCA and remediation from chat

Ongrid : open-source ops AI agent for RCA and remediation from chat

Comments
1 min read
Incident Automation: What to Automate, What to Leave to Humans

Incident Automation: What to Automate, What to Leave to Humans

Comments
2 min read
I built a small tool to answer a question I’ve asked too many times: is this production ready?

I built a small tool to answer a question I’ve asked too many times: is this production ready?

Comments
2 min read
DevOps Salaries & Hiring in India 2026: What 800+ Live Job Listings Reveal

DevOps Salaries & Hiring in India 2026: What 800+ Live Job Listings Reveal

1
Comments
2 min read
What Building Website Monitoring Taught Me About Silent Failures

What Building Website Monitoring Taught Me About Silent Failures

Comments
5 min read
Infrastructure Drift: Detecting and Preventing It

Infrastructure Drift: Detecting and Preventing It

Comments
2 min read
System Design - 20. Observability: The 3 Pillars, 4 Golden Signals, and How Netflix Debugs 100 Microservices

System Design - 20. Observability: The 3 Pillars, 4 Golden Signals, and How Netflix Debugs 100 Microservices

Comments
9 min read
Supercharging Kubernetes: How eBPF is Revolutionizing SRE and Platform Engineering

Supercharging Kubernetes: How eBPF is Revolutionizing SRE and Platform Engineering

1
Comments
6 min read
The Engineer Who Owns Nothing: A Cautionary Tale

The Engineer Who Owns Nothing: A Cautionary Tale

Comments
2 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.