Incident Management
Coordinated way to detect, prioritize, fix, and learn from service outages or other unplanned problems so systems get back to normal fast.
Reliability
DevOps glossary terms in Reliability.
Coordinated way to detect, prioritize, fix, and learn from service outages or other unplanned problems so systems get back to normal fast.
Reliability
A contract that defines expected service uptime, performance, and support response times between a provider and a customer.
Reliability
Using software engineering to keep production services reliable, available, and fast.
Reliability
Circuit Breaker is a pattern that pauses calls to failing services to reduce cascading failures during outages.
Reliability
Dead Letter Queue (DLQ) is a queue for failed messages, used to isolate errors for later retry or inspection.
Reliability
Allowed downtime or errors a service can have before it breaks its reliability goal (SLO).
Reliability
A gRPC deadline is a per-RPC time limit that tells services when to stop waiting and fail the request.
Reliability
A clear, measurable target for how reliable or fast a service must be over a set time window.
Reliability
The percentage of time a system or service is up, running, and available to users.
Reliability
Deliberately and safely breaking parts of a system to see what happens, then fixing weak spots so it stays reliable under stress.
Reliability