playbooks/
Engineering
Playbooks
Step-by-step playbooks for common architecture challenges.
3 topics in this section
playbooks/api-lifecycle/
API Lifecycle
The strategic guide for managing APIs across their full lifecycle — recognising that the team's design discipline before first release, versioning strategy that lets multiple versions coexist, deprecation policy with announced timelines and migration support, consumer migration tooling that makes upgrade tractable, and retirement discipline that actually removes old versions are what determine whether an API becomes long-lived infrastructure that consumers depend on confidently or a perpetually compatible burden whose old versions never die because nobody can afford to break the consumers who never migrated.
→
playbooks/migration/
Migration Playbook
The strategic guide for executing migrations from one architecture to another — recognising that the team's choice between strangler-fig incrementalism and big-bang cutover, the abstraction layer that lets old and new coexist during transition, the data migration discipline that keeps both systems consistent during overlap, the cutover design that makes traffic shift safe and reversible, and the decommissioning execution that actually removes the legacy system are what determine whether a migration finishes as planned or stretches indefinitely with two systems running in parallel forever because nobody can afford the risk of fully cutting over to one or fully removing the other.
→
playbooks/resilience/
Resilience Playbook
The strategic guide for designing systems that absorb and recover from failure — recognising that the team's failure-mode inventory rather than uptime targets, the layered resilience patterns applied per-tier (timeout, retry, circuit breaker, bulkhead, fallback), the explicit RPO and RTO targets per service rather than blanket assumptions, the chaos engineering practice that surfaces unknown failure modes before they surface themselves, and the recovery rehearsal discipline that proves the system can actually fail back are what determine whether a system is meaningfully resilient or whether it merely lacks evidence of being fragile until the next unlikely event reveals it.
→