Data Patterns
Architecture for systems where data outlives the applications that created it — the patterns for ownership, evolution, consistency, and lineage that keep data debuggable, evolvable, and trusted across years and stores.
Architecture for systems where data outlives the applications that created it — the patterns for ownership, evolution, consistency, and lineage that keep data debuggable, evolvable, and trusted across years and stores.
A data-as-app-state approach treats persistence as something the application does. The app owns the database schema, the app team handles migrations, and "data architecture" means whatever ORM the team picked. When a second application needs the same data, the simplest path is granting it database access — and the architecture's first major mistake has been made.
A data-architecture approach treats data as a first-class concern with its own architecture, separate from the applications that read or write it. Each persistent fact has a named owner. Each consumer accesses through a defined contract, not through a shared database connection. The data has shape (schema), consistency rules (transactional or eventual), classification (master, reference, transactional, analytical), and history (lineage). All of these are architectural decisions that survive across applications, refactors, and platform changes.
The architectural shift is not "we use a fancier database." It is: data outlives the applications that created it; design accordingly.
A data fact without a named owner is a data fact that will eventually be inconsistent. Multiple writers create irreconcilable state. When two systems both believe they are authoritative for a customer's email address, eventually one of them is wrong, and there is no architectural authority to resolve the conflict. The first data architecture decision is naming the source of truth for every persistent concept — who writes it, who propagates it, who arbitrates when they disagree.
For your most-debated piece of data — the one where two systems regularly disagree — name the system that is supposed to be authoritative. If the answer is "we sync them periodically," there is no source of truth, only two opinions and a sync job that loses every race.
The principle predates the relational database. It is documented across Patterns of Enterprise Application Architecture (Fowler) and Designing Data-Intensive Applications (Kleppmann), and forms the basis of every master-data-management discipline.
Domain-Driven Design's most operationally consequential idea is that transactions span exactly one aggregate. Within the aggregate, invariants hold strongly: an order's total equals the sum of its line items, a balance never goes below zero, an entity's ID is unique. Across aggregates, consistency is eventual — and it is designed for, not assumed. Most database performance crises and most distributed-systems consistency disasters trace back to teams not making this choice deliberately.
Pick a database transaction in your system that involves more than one table. Do those tables belong to the same aggregate? If they belong to different aggregates, you have either an aggregate boundary in the wrong place or a transaction that will eventually break under contention.
Eric Evans, Domain-Driven Design introduced the aggregate concept. Vaughn Vernon's Implementing DDD samples show its operational consequences in working code.
The shape that's good for writes — normalised, transactionally consistent, easily validated — is rarely the shape that's good for reads — denormalised, fast to scan, joinable across patterns. A single model serving both compromises both: queries get slow, writes get complex, every optimisation creates a new tax somewhere else. CQRS (Command Query Responsibility Segregation) is the named pattern; the underlying principle is older. Splitting read and write models isn't always required, but the architecture must allow it when it becomes required.
Is your most complex query running on the same schema that handles your highest-volume writes? If yes, every query optimisation compromises every write, and vice versa. The teams owning writes and the teams owning reads are paying each other's tax with every change.
Martin Fowler, CQRS. Event Sourcing is a related but distinct pattern that often appears alongside CQRS — the event log becomes the write model, and projections become the read models.
Schemas change. Pretending otherwise is the most common, most expensive mistake in data architecture. Migrations of running systems are expensive, risky, and frequently require coordinated downtime. Additive change — adding optional fields, new tables, new event types — is cheap when the system is designed for it. The cost of designing for evolution is paid once; the cost of not designing for it is paid every quarter, in migration projects nobody enjoys.
Add a new optional field to the most central entity in your system. How many places must change before the system runs again? If the answer is "every consumer," the architecture has not been designed for schema evolution — every change will hurt, and many will be deferred until they hurt more.
Martin Kleppmann, Designing Data-Intensive Applications, Chapter 4 ("Encoding and Evolution"), is the modern reference. Apache Avro and Protocol Buffers codify the practice in their schema systems; their compatibility rules are worth reading even if you use neither.
One database type serving everything is one database type compromising everything. Relational stores excel at transactional consistency and complex joins; columnar stores excel at analytics; document stores excel at flexible-schema reads; search engines excel at text and faceted queries; graph stores excel at multi-hop relationships; key-value stores excel at simple, high-throughput access. Polyglot persistence is choosing each based on its access patterns, paying the cost of operational complexity deliberately rather than accidentally.
List the data stores in your production system. For each one beyond the first, name the access pattern that justifies its presence. If the honest answer for any of them is "we wanted to try it," that store has no business case — and probably no engineer who can debug it under pressure.
Martin Fowler, Polyglot Persistence. Designing Data-Intensive Applications provides the operational depth on each store family and how to reason about the choice.
Where did this fact come from? What did it look like yesterday? Who changed it? When? Without answers to these, every incident becomes forensic guesswork — and every regulatory inquiry becomes a panic. Data lineage and provenance are not an audit feature added before a compliance review; they are properties the architecture either has or doesn't, designed in from the start or absent thereafter.
Pick a metric on your most important dashboard. Trace it back to source records — through every transformation, every join, every filter — without asking another engineer. How long does it take? If the answer is "we'd have to read code," lineage isn't part of the architecture; it lives in the heads of three people who are too busy to write it down.
OpenLineage is the emerging open standard for cross-platform lineage. The discipline pre-dates it: Pat Helland's work on data architecture has shaped the conversation for decades and remains essential reading.
The diagram below shows the canonical write-side / read-side data architecture: an aggregate root holds the source of truth in a transactional store; events propagate every change through a stream; multiple read-side projections (search, analytical, cache) serve different access patterns; lineage captures every flow that crosses a boundary.
Multiple services writing to the same tables, on the theory that "data should be one database." The fastest path to a distributed monolith — every schema change couples every service that touches it, and "we'll just coordinate" doesn't survive five teams. The cure looks like more work; the disease is more work, paid in slow incidents.
Each service owns its tables; cross-service access is through APIs, events, or read-only replicas — never through shared write access. Treat the database as an internal implementation detail of the owning service, not as a public integration surface.
Choosing eventual consistency reflexively because someone said "scale," then accepting all the resulting complexity in the application layer — order ID lookups that sometimes fail, balances that briefly disagree, customers seeing yesterday's price. Eventual consistency is a real architectural choice with real costs; accepting it for capabilities that don't need it imports those costs for no benefit.
Classify each capability by its consistency needs. Strong consistency where invariants must hold (charges, inventory commits, identity assignments). Eventual where it's acceptable (analytics, recommendations, search indexes). Hybrid where it makes sense (write-strong, read-eventual). The choice is per-capability, not per-system.
Big-bang schema changes that require coordinated downtime — usually because the schema was designed for the past and never evolved. Each migration becomes its own project, with stakeholders, project managers, and a date that slips. Three or four of these and the team will avoid changing the schema at all, which means the schema lies about reality forever.
Online schema evolution as a routine practice — add fields without breaking consumers, backfill in batches, rename through deprecation cycles. Tooling (gh-ost, pt-online-schema-change, online migration libraries) makes this routine; additive-change discipline keeps it safe.
Treating data lineage as a feature to add when regulators ask, rather than a property the architecture should have from day one. The cost of retrofitting lineage across an existing system is roughly equivalent to rewriting it; the cost of designing it in from the start is closer to free.
Capture lineage at every cross-boundary data flow from the start. The cost is small per pipeline; the cost of having it later when an incident or audit demands it is enormous. Treat lineage like observability — engineered in, not bolted on.
Adding new data store types because they're interesting, not because the access pattern requires them. The result is operational complexity nobody owns: a Cassandra cluster nobody patches, an Elasticsearch deployment three engineers know how to debug, a graph database whose schema lives in one person's head and disappears when they leave.
Each new store added must answer "what access pattern requires this that the existing stores cannot serve?" — and have an explicit owner accepting operational responsibility before deployment. Resume-driven persistence is a real failure mode; name it in design review when you see it.
Without an owner, every schema change is a multi-team negotiation. With one, change is routine. Look at your data catalogue (or wiki, or Slack channel where data questions land) — for each major data class, a single team should be the unambiguous answer to "who owns this?"
Aggregate boundaries are not implied by table names. They define what transactions are atomic versus eventually consistent. If aggregates are not documented explicitly, any new schema change will accidentally violate one and nobody will notice until an incident exposes the inconsistency.
Migrations as one-off projects with dedicated downtime windows are the surface symptom of brittle data architecture. Online migration tools and additive-change discipline turn migrations into low-risk routine changes the team runs every Tuesday without ceremony.
A single shared model serving both reads and writes is a series of compromises, both for query performance and for write integrity. Separating them when needed is an option that requires architectural design, not a refactor under pressure during a performance incident.
The handling rules differ by classification — transactional needs ACID, reference data can be cached, analytical can be batch, master data needs cross-system propagation. Without classification, the same rules get applied to all, and nothing fits.
"We use Postgres for everything" might be the right answer, but it should be a documented choice, not a default. ADRs that name the alternatives considered (and why rejected) prevent the choice from being relitigated every six months when a new senior engineer joins.
When data flows across stores, the consistency model must be chosen: strong, eventual, causal. Eventual is the default in distributed systems, but "default" doesn't mean "free" — application code has to handle it. Choosing forces the team to think about failure modes; not choosing means they will think about them during an incident instead.
Without lineage, debugging cross-system data issues is forensic archaeology. With it, the team can answer "where did this fact come from" in seconds. The cost of retrofitting lineage across an existing system is large; the cost of designing it in is small.
Policy alone doesn't prevent regulated data leaking into logs, analytics tables, or developer environments. Schema-level tagging plus automated enforcement (linters, masking-on-export, policy-as-code) makes the right thing the easy thing — and the wrong thing harder than the right one.
Untested DR is theatre. The team must have actually backfilled, rolled back, and recovered, on a system close enough to production, recently enough that the procedure still applies. If DR has not been exercised this quarter, assume it will fail when it matters most.
Other substantive pages in the library that link here: