**DevOps & SRE Excellence — streamlining delivery via automation, IaC, and observability.** DevOps is the operational backbone of modern delivery: the difference between an engineering team that ships weekly and a team that doesn't. This page describes how we approach pipelines, infrastructure as code, GitOps, observability, and the supply-chain security that has become non-negotiable.
DevOps started as a cultural movement against the dev-vs-ops wall and matured into an engineering discipline with specific patterns. The discipline has four pillars: continuous delivery (the pipeline), infrastructure as code (the platform), observability (the visibility), and supply-chain security (the trust). Each pillar has its own tooling, its own maturity progression, and its own way of failing when neglected. A team that does CI/CD well but treats infrastructure as a manual process pays for the imbalance in unreliable environments; a team that has perfect infrastructure but no observability ships fast and is blind in production.
Doing DevOps well means investing in all four pillars, with the discipline to keep them current as the team and the system grow. The tools change every two years; the principles do not.
Six principles
1. Pipelines are code; treat their reliability with the same rigor
A CI/CD pipeline is the gate between every code change and production. When the pipeline is flaky, every team is slowed; when it is unreliable, releases happen with bypassed gates that come back as production incidents. Treating the pipeline as a first-class engineering artefact — versioned, reviewed, tested, and operated — is the difference between fast delivery and fast theatrical delivery. Pipelines defined in YAML in the repo, with tests for the pipeline itself, with explicit ownership, with monitoring that flags flake rates, are what separate teams that ship reliably from teams that ship occasionally.
Flip ↻
Architectural implications
Pipelines are defined in the repository (GitHub Actions workflows, GitLab CI, Argo Workflows) and versioned alongside the code they build.
Pipeline failures are diagnosed and fixed, not retried until they pass; flake rates are tracked as a quality metric.
Pipeline performance (build time, test time) is measured; regressions are addressed because slow pipelines silently encourage workflow shortcuts.
Reference
GitHub Actions Documentation — the reference for pipeline-as-code patterns now widely adopted across the industry.
Flip ↺
2. Infrastructure as code or no infrastructure
Infrastructure that exists outside of code drifts. The team that "documented the production setup" two years ago has documentation that no longer matches reality, and the only way to know what's actually running is to look. Infrastructure as code (Terraform, Pulumi, AWS CDK, Bicep) makes the desired state explicit and auditable; the running state is verified to match the code; changes are reviewed as code changes are reviewed. The discipline scales — a team of three can hold the running state in their heads; a team of thirty cannot, and the team that didn't adopt IaC at five engineers regrets it at fifty.
Flip ↻
Architectural implications
All infrastructure (network, compute, identity, data services) is defined in Terraform/Pulumi/CDK; manual changes are treated as incidents to be reverted.
Drift detection runs continuously; deviations between code and reality are surfaced and reconciled.
Modules are reusable across environments — dev, staging, production differ in size and configuration, not in shape.
Reference
Terraform — the de facto standard for cloud-agnostic infrastructure as code, with a rich provider ecosystem and well-developed patterns for managing state, modules, and drift.
Flip ↺
3. GitOps makes the desired state visible
The natural extension of IaC is GitOps: the git repository is not just the source of code, it is the source of truth for the desired state of the running system. A controller (Argo CD, Flux) continuously reconciles the running state to the repository state. Deployments happen by merging to a branch; rollbacks happen by reverting commits; the audit trail of what's running is the git log of the deployment branch. The mental model is declarative — describe what should be running, let the controller make it so — instead of imperative — run a deployment script and hope.
Flip ↻
Architectural implications
Production state lives in a git repository; controllers continuously reconcile running clusters to that state.
Deployments are pull requests; rollbacks are revert commits; the change log of production is the git log.
Drift detection is structural — the controller flags any difference between desired and actual state, not just changes initiated through the controller.
Reference
OpenGitOps Principles — the formal articulation of GitOps as a discipline (declarative, versioned, automatically pulled, continuously reconciled).
Flip ↺
4. Observability is one question, not three pillars
The "three pillars of observability" — logs, metrics, traces — is a useful taxonomy and a misleading framing. The pillars suggest three separate systems with three separate tools; the actual question observability answers is one: what is happening in this system right now, and why? Modern observability platforms (OpenTelemetry-instrumented stacks, Honeycomb, Grafana Tempo, Datadog) treat the three signals as facets of a single underlying telemetry stream. The discipline is not "we have logs, metrics, and traces" — it is "we can ask any question about production behaviour and get an answer in seconds."
Flip ↻
Architectural implications
Telemetry is collected with OpenTelemetry, the vendor-neutral standard, allowing the backend to be swapped without re-instrumenting.
High-cardinality dimensions (per-customer, per-route, per-version) are first-class — observability without high cardinality answers only the easy questions.
The team's instinct in an incident is to query observability data, not to grep logs. The infrastructure that makes that possible is what makes the difference.
Reference
Charity Majors et al. — Observability Engineering — the canonical treatment of observability as a discipline distinct from monitoring, with high-cardinality, high-dimensionality telemetry as the differentiator.
"Shift left" — finding security issues earlier in the lifecycle — has become a phrase that hides the work it entails. The actual discipline includes static analysis (SAST) integrated into CI, software composition analysis (SCA) scanning for vulnerable dependencies, secrets scanning preventing credentials from entering the repo, dynamic analysis (DAST) of running applications, infrastructure scanning before apply, and supply-chain attestation (SLSA) verifying that what's deployed is what was built. Compliance — SOC 2, ISO 27001, HIPAA, PCI — used to be a once-a-year audit; modern compliance is continuous, with controls that are evidenced automatically rather than reconstructed annually.
Flip ↻
Architectural implications
SAST, SCA, and secrets scanning run on every PR; security findings are addressed before merge, not deferred.
Container images are scanned for vulnerabilities; signed at build; verified at deploy; the chain of custody is auditable.
SLSA provenance is generated for build artefacts; the deployment system verifies provenance before running production workloads.
Compliance controls are encoded as policy-as-code (OPA, Sentinel) and evaluated continuously, not assembled for audit.
Reference
SLSA Framework — the industry standard for supply-chain integrity, originating at Google and adopted across the cloud-native ecosystem.
Flip ↺
6. Continuous deployment is a product capability, not an engineering nicety
Small batches reduce risk. The empirical evidence is overwhelming — teams that deploy daily have lower change-failure rates and shorter mean-time-to-restore than teams that deploy quarterly. This isn't because daily deployment is magic; it's because daily deployment forces every engineering practice the prior principles describe: reliable pipelines, IaC, observability, and security automation. Treating continuous deployment as a target — an explicit product capability, with the engineering investment to support it — is what aligns the rest of the practices around a measurable goal. Without that target, "we should improve our pipelines" is a wish; with it, the gap between current state and target is concrete and can be closed.
Flip ↻
Architectural implications
Deploy frequency, change-failure rate, lead time, and mean-time-to-restore (the DORA metrics) are measured and visible.
The team has documented goals for each metric and reviews progress quarterly; gaps drive engineering investment.
Feature flags decouple deploy from release — code can be deployed inert and activated separately, removing the pressure to perfect every deploy.
Reference
DORA — Accelerate State of DevOps — the multi-year, peer-reviewed research showing that continuous-delivery practices correlate with both engineering quality and organisational performance.
Flip ↺
Flowchart
flowchart LR
%% ─── Source ───
GIT([Git Repository code · IaC · pipelines])
%% ─── CI ───
subgraph CI [CI — build, test, attest]
direction TB
BUILD[Build]
TEST[Test · SAST · SCA]
SIGN[Sign · SLSA Provenance]
end
%% ─── Artifact ───
REG[(Artifact Registry signed images)]
%% ─── GitOps controller ───
GITOPS[GitOps Controller Argo CD · Flux]
%% ─── Environments ───
subgraph ENVS [Environments — reconciled from git]
direction TB
DEV[Dev]
STG[Staging]
PROD[Production]
end
%% ─── IaC ───
IAC[Terraform / Pulumi infrastructure as code]
%% ─── Observability ───
OBS[OpenTelemetry logs · metrics · traces]
%% ─── Policy ───
POL[Policy as Code OPA · Sentinel]
%% ─── Flow ───
GIT --> BUILD
BUILD --> TEST
TEST --> SIGN
SIGN --> REG
GIT --> GITOPS
REG --> GITOPS
GITOPS --> DEV
GITOPS --> STG
GITOPS --> PROD
IAC -.provisions.-> ENVS
POL -.gates.-> GITOPS
POL -.gates.-> CI
ENVS -.telemetry.-> OBS
%% ─── NODE STYLES ───
style GIT fill:#4A90E2,stroke:#2C5F9E,stroke-width:2px,color:#FFFFFF
style BUILD fill:#DCEAFC,stroke:#2C5F9E,stroke-width:1.6px,color:#1A2840
style TEST fill:#DCEAFC,stroke:#2C5F9E,stroke-width:1.6px,color:#1A2840
style SIGN fill:#FBEDE0,stroke:#C96330,stroke-width:1.8px,color:#3A2410
style REG fill:#FEF3C7,stroke:#92722C,stroke-width:1.8px,color:#5C4316
style GITOPS fill:#FBEDE0,stroke:#C96330,stroke-width:2px,color:#3A2410
style DEV fill:#FEF8E5,stroke:#A8956A,stroke-width:1.4px,color:#5C4316
style STG fill:#FEF3C7,stroke:#92722C,stroke-width:1.5px,color:#5C4316
style PROD fill:#FEF3C7,stroke:#C96330,stroke-width:2px,color:#5C4316
style IAC fill:#E5ECF3,stroke:#5A7A9A,stroke-width:1.5px,color:#2C4258
style OBS fill:#E8E8E2,stroke:#5C5C5C,stroke-width:1.4px,color:#2C2C2C
style POL fill:#FBEDE0,stroke:#C96330,stroke-width:1.5px,color:#3A2410
%% ─── CLUSTER STYLES ───
style CI fill:#EDF4FC,stroke:#A8C0DD,stroke-width:1.2px,color:#1A2840
style ENVS fill:#FEF8E5,stroke:#D4C599,stroke-width:1.2px,color:#5C4316
The diagram below shows a canonical DevOps architecture: a CI pipeline triggered by code changes, building artefacts and generating provenance; a CD pipeline (or GitOps controller) deploying to environments; IaC managing the underlying infrastructure; observability collecting telemetry across the stack; policy-as-code enforcing compliance continuously.
Common pitfalls
Pipeline as a one-off script
The CI/CD pipeline started as a build script someone wrote, with no tests, no review, no monitoring. When it breaks, no one knows why; when it's slow, no one investigates; flake is normalised. The pipeline is the most-run code in the system and the least-engineered.
Flip ↻
What to do instead
The pipeline is engineering work with engineering discipline — versioned in the repo, reviewed in PRs, monitored for flake and duration, owned by a named team.
Flip ↺
IaC for new things, manual for old
Greenfield projects use Terraform; legacy infrastructure is "too risky" to migrate, so it stays manual. The two halves diverge in operational characteristics, the manual half consumes disproportionate time, and the team gets the worst of both worlds.
Flip ↻
What to do instead
A migration plan that brings legacy infrastructure into IaC over quarters. Drift detection covers the entire estate; the manual exception is documented and time-bounded.
Flip ↺
Observability as logs
A logging system is treated as the entire observability story. Metrics are bolted on later for SLOs; tracing is "for when we have time." Investigating production issues means greping through logs across services with correlation IDs that may or may not exist.
Flip ↻
What to do instead
OpenTelemetry from day one — logs, metrics, traces as a single instrumented stream. The cost of doing this from the start is small; the cost of retrofitting is large.
Flip ↺
Security as a gate, not a stream
Security is the audit at the end of the cycle. Findings arrive late; fixing them disrupts release plans; the team's instinct is to negotiate severity rather than fix.
Flip ↻
What to do instead
Security in the pipeline — SAST, SCA, secrets scanning, supply-chain attestation. Findings appear at PR time, when the cost of fixing is lowest. The audit becomes a verification of what's already done.
Flip ↺
Continuous deployment without continuous testing
The team adopts continuous deployment because it sounds modern, without the test coverage, observability, or feature-flagging that make it safe. Every deployment is a roll of the dice; production incidents trace back to deployments; the team retreats to release windows.
Flip ↻
What to do instead
Continuous deployment is the result of the prior practices, not a starting point. Build the safety nets — tests, observability, feature flags, fast rollback — and continuous deployment becomes the natural mode rather than a bet.
Flip ↺
Adoption checklist
1Pipelines are versioned, reviewed, and operated as engineering artefacts☐
Flip ↻
Pipeline-as-code in the repo; tests for the pipeline itself; flake rates and duration tracked; pipeline incidents diagnosed not retried-until-green.
Flip ↺
2All infrastructure is defined in IaC (Terraform / Pulumi / CDK)☐
Flip ↻
Manual infrastructure changes are exceptions, documented and reverted; drift detection runs continuously; environments differ in size and config, not shape.
Flip ↺
3Production state is reconciled from git via a GitOps controller☐
Flip ↻
Argo CD or Flux watches the deployment branch; merges trigger reconciliation; the git log is the audit trail of what's running.
Flip ↺
4OpenTelemetry instrumentation is the standard, not an opt-in☐
Flip ↻
Logs, metrics, and traces are correlated by trace ID; high-cardinality dimensions are first-class; the team queries telemetry to answer questions.
Flip ↺
5DORA metrics (deploy frequency, lead time, change-failure rate, MTTR) are measured and visible☐
Flip ↻
Numbers are real, not aspirational; goals are documented; quarterly reviews drive engineering investment.
Flip ↺
6SAST, SCA, and secrets scanning run on every PR☐
Flip ↻
Findings address before merge; the security team and engineering team see the same data; the conversation is "fix" not "negotiate severity."
Flip ↺
7Container images are signed at build, scanned for vulnerabilities, and verified at deploy☐
Flip ↻
The chain of custody from source to running container is auditable; SLSA provenance is generated and verified.
Flip ↺
8Compliance controls are encoded as policy-as-code (OPA / Sentinel)☐
Flip ↻
Continuous evaluation rather than annual reconstruction; drift between intended controls and effective controls is detected.
Flip ↺
9Feature flags decouple deploy from release for risky changes☐
Flip ↻
Code can deploy inert and activate separately; rollback is a flag toggle, not a redeploy; the deploy-vs-release distinction is operationally real.
Flip ↺
10Incident response is exercised, not just documented☐
Flip ↻
Runbooks are current; on-call rotations are sustainable; postmortems happen for every meaningful incident and the actions are tracked to closure.