Reliability

Incident Management
Coordinated way to detect, prioritize, fix, and learn from service outages or other unplanned problems so systems get back to normal fast.
Reliability
Service Level Agreement (SLA)
A contract that defines expected service uptime, performance, and support response times between a provider and a customer.
Reliability
SRE (Site Reliability Engineering)
Using software engineering to keep production services reliable, available, and fast.
Reliability
Circuit Breaker
Circuit Breaker is a pattern that pauses calls to failing services to reduce cascading failures during outages.
Reliability
Dead Letter Queue (DLQ)
Dead Letter Queue (DLQ) is a queue for failed messages, used to isolate errors for later retry or inspection.
Reliability
Error Budget
Allowed downtime or errors a service can have before it breaks its reliability goal (SLO).
Reliability
gRPC Deadline
A gRPC deadline is a per-RPC time limit that tells services when to stop waiting and fail the request.
Reliability
Service Level Objective (SLO)
A clear, measurable target for how reliable or fast a service must be over a set time window.
Reliability
Uptime
The percentage of time a system or service is up, running, and available to users.
Reliability
Chaos Engineering
Deliberately and safely breaking parts of a system to see what happens, then fixing weak spots so it stays reliable under stress.
Reliability