Distributed Systems — Key Concepts & Patterns

Waleed Ashraf
Klarna Engineering
Published in
8 min readJan 17, 2023

--

If you are working in a modern tech company, you’ve likely come across the term “distributed systems”. In this blog post, I’ll go through it in two sections, the first section is about the key concepts and timeline and the second section is about some of the patterns being used in the industry. The purpose of this blog post is to share enough basic information and resources that you can dig deep on your own if interested.

Key Concepts & Timeline

Two Generals’ Problem — 1975

Two army generals are trying to coordinate an attack. They should decide a time at which both attack together and only if they attack together will they win. The path between them is covered by the opposition army. One General sends a soldier with the time he’ll attack and the other should respond with Ok. The problem is,

  • What if the soldier dies in the middle before reaching the other side? OR
  • What if the soldier dies on the way back?

This problem is proven to be impossible to solve because of the unreliable communication channel. The problem explains why TCP can’t guarantee state consistency between its endpoints. Instead, to keep waiting for acknowledgment for long periods of time, we use a timeout and try again approach. This term was first mentioned around 1975. It is also often discussed along with Byzantine Fault.

2 Phase Commit (2PC) — 1985

2 Phase Commit is one of the earliest solutions in Distributed Databases but it never got enough popularity because of its drawbacks. In a multi-node database, one node will act as a coordinator and the others will act as participants. The transaction is carried out in two phases (so as the name),

  • Prepare phase: asks each node if it’s able to promise to carry out the update
  • Commit phase: actually carries it out.

The significant drawbacks are that

  • It can go into a blocked state if the coordinator dies after the first phase
  • It’s slow

Martin Fowler has explained 2 Phase Commit in a really nice article.

Consensus & Paxos — 1989

“Paxos guarantees that if one peer believes some value has been agreed upon by a majority, the majority will never agree on a different value.”

Paxos as a consensus protocol was a huge step forward in Distributed Systems. In easy words, Paxos moved towards the quorum majority approach vs agreement from all nodes in 2PC. Paxos or its variations (Raft) are still being used in modern-day Distributed Systems, for example, Chubby, Zookeeper, Consul, etc.
Here is a really good visual representation of the Raft protocol.

Fun Fact: Original paper which published Paxos was so complex for everyone to understand Leslie Lamport published a new paper Paxos Made Simple in 2001.

CAP Theorem — 2000

CAP Theorem explains the three main components of Distributed Systems. It states,

“In a distributed system, you can only have one of the two guarantees, Consistency or Availability when there is a Network Partitioning.”

People are often misguided that it’s to pick any 2 out of 3. That’s not true, as Network Partitioning will always happen in a distributed system. So, we either have to pick Consistency or Availability.

Read further: A plain English introduction to CAP Theorem

Cloud Spanner — 2012

Spanner is a distributed SQL database management and storage service developed by Google and introduced in 2012. Google claims Spanner to be CA (both Consistent and Available) system, thus beating the CAP theorem.

It’s interesting to know that Google’s claim is based on terms that Google uses its own network infrastructure. Thus, minimizing the Network Partitioning to almost zero. This would have not been possible if Google used generally available public networks.

Google’s controlled network gives it the ability to use its TrueTime API for clock synchronization and global consistency.

Exactly Once Delivery

Exactly Once Delivery means that if a message is sent from one system to another, it will reach the other system exactly once, not less, not more.

Martin Kleppmann in his book “Designing Data-Intensive Applications” writes that this term should be rephrased to Effectively Once Processing.

If the system sending the message goes down before getting ACK, it will produce the same message again. To solve this problem, the broker in the middle needs deduplication. Also, if the consumer (receiver) of the message dies before sending an ACK for the message it received, it will receive the same message again. That means it needs to be idempotent. When you look at the whole system, the consumer may receive the same messages more than once, but it’ll make sure processing is only done once and messages received later will not have side effects.

There are only two hard problems in distributed systems:
2. Exactly-once delivery
1. Guaranteed order of messages
2. Exactly-once delivery

@mathiasverraes

Exactly-once Support in Apache Kafka by Jay Kreps

Patterns

Event Sourcing

In Event Sourcing, immutable events are stored in an append-only persistent store. The event store can push the raw events to services OR services can fetch the latest state by calling the store. The system is eventually consistent and requires consumers to be idempotent.

For a simple e-commerce site, the Event Sourcing pattern may look like the diagram below. The presentation layer (front-end) will send events in order to a persistent event store. From there, raw events can be published to other services, or, some services can request the latest state.

Pros

  • Events are immutable
  • The event store is an append-only log
  • The event store provides an audit trail
  • Re-play anytime to get the latest state

Cons

  • Event granularity is never clear
  • Schema updates are hard (require versioning)
  • No, Undo / Updates
  • Require snapshots
  • Require a clear understanding of the business domain

Examples

  • Version Control Systems

Chris Richardson wrote a really nice article about Event Sourcing. You can also read the one by Martin Fowler.

CQRS

Command and Query Responsibility Segregation (CQRS) is a pattern to separate read (Query) and update (Command) operations of the data store. It’s not something directly related to Distributed Systems but is often used with event-based systems, like Event Sourcing.

For the above example, the CQRS model would be like the diagram below. The events from the persistent store will be published to another store (database) in a specific format. The presentation layer can query the database as needed.

Pros

  • Fits well with event-based systems.
  • Performance benefit. Separates the load between Commands and Queries.
  • Models can be totally isolated from each other.

Cons

  • Having two separate models needs logic to keep them consistent. Mostly it’s eventually consistent.
  • The Query model would have a different schema than the Command model.
  • The basic idea of CQRS is simple. But it can lead to a more complex application design, especially if they include the Event Sourcing pattern.

Here’s a good explanation of CQRS by Martin Fowler.

SAGA

The Saga design pattern is a way to manage data consistency across microservices in distributed transaction scenarios. A saga is a sequence of transactions that updates each service and publishes a message or event to trigger the next transaction step. If a step fails, the saga executes compensating transactions that counteract the preceding transactions.

Orchestration is a way to coordinate sagas where a centralized controller tells the saga participants what local transactions to execute. The saga orchestrator handles all the transactions and tells the participants which operation to perform based on events.

Here is an example of a SAGA system. A request comes to one service (Service A), and it starts the transaction. The orchestrator sits in the middle to trigger the next transaction in other services. Once the transactions are finished by other services, the orchestrator will notify the first service (Service A) that all required actions have been performed by other services.

Pros

  • Good for complex workflows involving many participants or new participants added over time.
  • Participants are loosely coupled
  • Can handle long-running transactions

Cons

  • Services need to have compensating transactions
  • Additional design complexity
  • This may cause cyclic dependency between services
  • For a complex flow, end-to-end testing can be tricky
  • This can lead to Dumb Services, Smart Orchestrator

Microsoft Azure Architectures has a detailed explanation of SAGA.

CDC / Outbox

The Change Data Capture (CDC) and the Outbox patterns are quite similar. The main concept is that events are derived from the database’s internal logs (i.e WAL). When an action is performed, the service only needs to commit changes in the database and doesn’t need to worry about publishing events within the transaction. The database acts as the source of truth.

In the diagram below we have a simple example of the `users` table in a service. On every change in the `users` table, a relevant message is added to the `users_outbox` table. This should be done within one transaction. After that, through WAL or a trigger, we can send each entry in the `users_outbox` table as one message.

Pros

  • Easy to implement in the existing system
  • Standalone CDC connector

Cons

  • Triggers overhead (define for each table)
  • Only works with a log-based database
  • No reliable open-source tool
  • Need to maintain schema mapping between database logs and events

Examples

  • Debezium
  • DataStax (Cassandra)

Debezium writes about Event-Sourcing vs CDC vs Outbox.

Other resources

Summary

Distributed Systems which seems like a single term is actually based on dozens of basic key concepts in computing, algorithms, and design patterns. With the hype of microservices architecture and async communication, these concepts have become more important than ever before.

I couldn’t cover everything in detail in this blog, but as said before, the purpose is to share basic information and resources for further understanding. If you liked it, please share it in your circle.

Did you enjoy this post? Follow Klarna Engineering on Medium and LinkedIn to stay tuned for more articles like these.

--

--