DEV Community

# reliability

General discussions on building and maintaining reliable software systems.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Three AI providers went down on the same day. Here's the architecture that didn't care.

Three AI providers went down on the same day. Here's the architecture that didn't care.

Comments
5 min read
Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Comments
5 min read
Sliding-Window Spend Guard: the $47K Loop Per-Call Caps Miss

Sliding-Window Spend Guard: the $47K Loop Per-Call Caps Miss

Comments
11 min read
Graceful Degradation: Circuit Breakers for External API Dependencies

Graceful Degradation: Circuit Breakers for External API Dependencies

Comments
5 min read
Building a Chaos Testing Harness for Multi-Region Video API Endpoints

Building a Chaos Testing Harness for Multi-Region Video API Endpoints

Comments
10 min read
Error budgets when downtime costs money: reliability engineering for payment-critical systems

Error budgets when downtime costs money: reliability engineering for payment-critical systems

Comments
10 min read
Distributed Tracing 101: The Mental Model, the Standards, and Your First Pipeline

Distributed Tracing 101: The Mental Model, the Standards, and Your First Pipeline

Comments
5 min read
Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

Comments
17 min read
Monitoring and Logging: How They Work Together and When You Need Both

Monitoring and Logging: How They Work Together and When You Need Both

Comments
8 min read
AI SRE: What an Autonomous Agent Doing On-Call Actually Looks Like

AI SRE: What an Autonomous Agent Doing On-Call Actually Looks Like

Comments
6 min read
MCP Server Monitoring: How to Keep AI Agent Infrastructure Reliable

MCP Server Monitoring: How to Keep AI Agent Infrastructure Reliable

Comments
6 min read
Deploying Production Systems on Raspberry Pi: Lessons from the Field

Deploying Production Systems on Raspberry Pi: Lessons from the Field

Comments
7 min read
maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures

maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures

Comments
5 min read
Model Selection for Weibull Series Systems: When Simpler Models Suffice

Model Selection for Weibull Series Systems: When Simpler Models Suffice

Comments
3 min read
The Economics of Reliability: When to Invest, When to Accept Risk

The Economics of Reliability: When to Invest, When to Accept Risk

Comments
2 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.