observability/
Observability
The foundational observability primitives — incident response, structured logging, metrics, SLIs/SLOs, and distributed tracing — that any production system depends on, separate from AI-specific monitoring.
5 topics in this section
observability/incident-response/
Incident Response
The operational discipline that turns service degradation into resolved problem — recognising that detection, severity-routing, response coordination, communication, and learning are distinct activities that succeed or fail independently, and that the architecture supporting each is what determines whether incidents resolve in minutes or hours.
→
observability/logs/
Logging
The data architecture of system events — recognising that logs are an asset only when their structure, cardinality, retention tiers, sampling strategy, and security posture are designed for the questions you'll need to ask of them long after the events that produced them.
→
observability/metrics/
Metrics
Aggregated numerical signals about system behaviour — recognising that the choice of what to measure, how to label it, where to push or pull from, what shape (counter, gauge, histogram) to give it, and how long to retain it at what fidelity is an architectural commitment that determines whether the metrics system answers the operational questions you actually have.
→
observability/sli-slo/
SLIs & SLOs
The discipline of converting reliability from a vague aspiration into a measured engineering target with a tradable budget — recognising that the SLI is a measurement, the SLO is a commitment, and the error budget is the architectural construct that connects the two to engineering decisions about velocity, risk, and prioritisation.
→
observability/traces/
Distributed Tracing
The cross-service observability primitive that makes a request's full journey through a distributed system inspectable — recognising that without trace context propagation, logs and metrics are local signals that can't be assembled into the system-level story they imply, and that the trace itself is what turns "the checkout is slow somewhere" into "the checkout is slow because service C's database call is taking 1.4s in the cache-miss path."
→