Good Architecture: Data Model vs. Data Flow

Most architectural mistakes I’ve seen in software stem from a mistake either in the domain model or the data flow. Understanding what each of those two things is, how to do them both well, and how to balance the tensions between them is an essential skill every developer should invest in.

Let’s use an example to talk to expand on this.

Financial Transactions

Let’s imagine we’re building a personal finance product. A user has a set of financial transactions (Transaction). Each transaction has a dollar amount, happens on a date, in a financial account (Account) and is labeled with a category (Category).

Further, we know a few other things:

  • The balance of an account at any point in time is always the sum of all transactions up and until that time.
  • Users may want to add, remove or edit transactions at any point.
  • Users will want to see the balance of their accounts at any point in time, and how the balances change over time.
  • Users will want to slice and dice their cash flow, too. They will want to see the sum of their transaction amounts between certain dates, for certain categories, and for certain accounts, and they may want to group that data too (for instance, a user might want to see how much they’ve spent by category, each month over the past 12 months).

Sounds pretty straightforward so far. But let’s dig in.

Domain-Driven Design

When it comes to modeling your domain, the seminal idea is Domain-Driven Design (DDD). The fundamental idea behind DDD is to map entities in your software to entities in your “business domain”. Parts of this process are pretty natural. For instance, we’ve already started doing that above (entities for a Transaction, an Account, and a Category all naturally fell out of just describing what users want to do).

But domain-driven design doesn’t stop there. It requires technical experts and “domain experts” to constantly iterate on that model, refining their shared model and then updating the software representation of that code. This can happen naturally as you evolve your product and use-cases, but often, it’s a good idea to trigger it up front through in-depth discussion and questioning of how the model could accommodate future use-cases.

For example, here are some questions that might help us refine our model, and some possible answers.

For starters, here’s one: what if an account has a starting balance? How do we represent that? Does that violate our initial assumption that an account’s balance is the sum of all its transactions? The answer depends on how you model your domain.

For some products, it might make sense to add a starting_balance field to your Account entity. A more “pure” approach might be to keep the initial invariant (that an accounts balance is the sum of all transactions), but refine things so that starting balances are actually a special type of Transaction (with some invariants around that—for instance, an Account can only have one starting balance Transaction, and it must be on the date the Account is opened). But this is good, we’re domain-modeling now! We’re rethinking some of our assumptions, and that’s pushing us to think more deeply about our understanding of the model.

Here’s a trickier one: what if a transaction occurs between two accounts? In our current model, we’d actually have two transactions (one leaving the first account, and one entering the second one). That might be fine in many applications, but if you’re an accounting product, you might realize that this model can introduce some inconsistencies. What if one transaction is missing? In the real world, money flows from some place to another. Maybe every transaction requires two accounts (from_account and to_account). A domain expert on your team would now point out that you’re brushing up against double-entry accounting. We don’t need to go down that route, but you can see how a question prompted us to revisit our understanding of the model.

This is just an overview of domain-driven design. You can read a lot more about it on Wikipedia, or by reading Eric Evans’ classic book, but at a high level, in domain-driven design you create a “bounded context” for your domain model, iterate on your understanding of the domain model, come up with a “ubiquitous language” to describe that model, and constantly keep your software entities in sync with that domain model and language.

Data Flow Design

Data flow design takes a bit of a different approach. Instead of focusing on the entities, you focus on the “data”. Now, you might argue that data and entities are the same, or should be the same, and in an ideal world they would be, but software has real-world limitations set by the technology that enables it. Things like locality, speed, and consistency start to rear their heads.

Let’s apply that to our example above. Again, we had already naturally started doing some data flow design in defining our original problem: all of the “users will want to…” statements are about data flow. For example, let’s consider the balances question: “users will want to see the balance of their accounts at any point in time, and how the balances change over time.”

Our model dictates that balances are derived from transactions. How do we respond to a query like “what was the balance every day over the past year for a user’s account?” The simplest way could be to always derive, on-the-fly, the balances of an account by walking through all its transactions. That way, if anything in the underlying transactions change, the balances are always consistent. But this is where technical limitations start to hit us. Can we do that calculation fast enough when we get the query? What if the query is something like “out of the 10 million accounts in the system, show me all accounts for which the balance exceeded $10,000 on any day in the past 5 years”?

You probably already have solutions simmering in your head. Caching for faster queries. Updating balances whenever transactions change. Some additional data store that makes it easy/fast to index and execute queries like that. But you’re no longer just thinking about the domain model. You’re thinking about the data.

To do data flow design well, you need to think through a few dimensions. The first is read vs. write data paths. Clearly, when transactions are changed, balances need to change to reflect that. Should that happen on write, when a transaction is updated? Or should it happen on read? (should we lazily only do the work when we know we need it). Or should we do it asynchronously in between so that we can have fast reads and fast writes, while sacrificing some consistency.

Next, you need to think through read vs. write patterns. How frequent are writes? How frequent are reads? Are they varied or skewed? Depending on the answer, you might be OK doing more work on write, or you might be OK doing more work on read. Or you might introduce something like caching if a lot of reads are similar. Or, you might go full on Command Query Responsibility Segregation.

You’ll also need to think through your consistency requirements. We’ve already hinted at that above, but maybe you can offload some work if you’re OK with data you read being a little out of sync with the data you write. You can use asynchronous or batching models.

Finally, there’s a question around where invariants should live. In modeling the domain, you usually end up with some “invariant”: things that should always be true. These invariants work like constraints, giving you assumptions you can trust throughout the life cycle of any entity or the data representing it (like, the balance of an account is the sum of all its transactions, or an account can only have one starting balance transaction). But when thinking about data flow, you need to worry about how to check and enforce those constraints. Should that happen in the application layer? In the data storage layer?

A full exploration of what this means in practice is beyond our scope here, but the main point is that in addition to our nice, clean domain model, we also have all this extra logic that is not part of our domain. It’s just a function of technological limitations. That’s the tension.

(The best resource I’ve found on thinking about data flow, especially at scale, is Martin Kleppmann’s Designing Data-Driven Applications)

The Balance

I’ve found that most software engineers start their careers with a bias either towards data model or data flow. As two extremes, consider:

  • The data model purist: Spends an exorbitant amount of time thinking through and modeling the domain before writing a line of code. Draws a lot of diagrams, possibly of database schemas. Gets really frustrated at implementation time because the data flow reality sets in and they realize they will need to “corrupt” their model.
  • The data pragmatist: Thinks through the end-to-end data flow really well, quickly writes code and spins up multiple data services. Was big on “polyglot persistence” when that was a word. Has figured out how to parallelize / partition things before figuring out what those things are.

Many people start off as one of those two, overlooking the other side of the equation, then learn through experience that you have to think about both from the get go.

I find that to strike a good balance, it’s best to do design in an iterative fashion. First, of course, you need a really solid understanding of the underlying problem you’re trying to solve and why it needs to be solved. Then, you take turns thinking through the domain model, and the data flow.

  • Write or sketch out a quick data model.
  • Map it to the problem space: does it represent the domain well? Does it support what the product needs to do now and do later? Fiddle the requirements a little bit. Does the model hold up?
  • Now map the data flow. Look at the UI and what data needs to be shown. Think about the interactions that need to happen and what data needs to be changed. Now think about how that would work at a much larger scale.

Rinse and repeat. Pull in some colleagues, get feedback, and continue repeating again. And even when you start writing code, you still keep iterating.

You should start with a slight bias towards getting the data model right, and worry more about data flow as you gain confidence in your data model, and as you start to hit the performance problems that only show up once you have enough scale and once your product is complex enough. But you always keep both concepts (the data model, and the data flow design) top of mind as you’re working.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: