Introduction to Distributed Systems

June 20, 2023

#Updated: June 22 2023 - Add diagrams and example after peer review

As part of my Software Engineering self-study curriculum I'm (re)learning key concepts in our fast-changing world of tech.

I'm currently reading through the book "Understanding Distributed Systems" - by Roberto Vitillo

But they say the best way to learn is to teach. So, here's an Introduction to Distributed Systems🖥️⚡:

Did you know that Distributed Systems are really the backbone of AI (like ChatGPT), and Blockchain (like Bitcoin)?

A Distributed System is basically a bunch of computers that work together for a common goal.

The goal might be to send a webpage to millions of people across the web (like this message).

It might be to generate interesting ideas from a very large amount of data (AI).

It might be to submit a financial transaction on a blockchain (like Bitcoin).

So distributed systems exist to make all these important use-cases possible.

This sounds simple, but why are distributed systems so complicated? - Because lots of things can go wrong!

In fact, most of the effort in distributed systems is focused on making sure things don't go wrong.

Here are some examples of things that can go horribly wrong:

Data getting lost
Data getting corrupted
Computers not being able to reach eachother
Computers disagreeing with eachother
Computers in the network dying
Computers being very slow
Hackers breaking into the network

With so many important systems relying on the idea of being a distributed system. Some smart people decided to put on their mask and cape (like Batman does), and came up with many ways to make distributed systems more reliable, thus saving the day!

These are some of the common solutions to problems in distributed systems:

How do we make it so computers can die without the whole network dying? - Fault Tolerance 💔

Fault Tolerant

What if some computers are temporarily down, should we allow that and how long can they be temporarily down for? - Availability 🟢

Fault Tolerant

Computers that work together on a network will occasionally lose connection to eachother, how can we ensure this isn't fatal? - Partition Tolerance 🥂

Fault Tolerant

How can we make computers agree with each other at all times? - Consensus Algorithms 🤝🏾

Fault Tolerant

Waiting for all these computers to agree with eachother takes a long time, and time is money! Can they just win a majority vote? - Quorum ✋

Fault Tolerant

If computers really don't agree, how can we handle this? - Conflict Resolution Algorithms ☮️

Fault Tolerant

If a computer dies do we have backups of its data? - Replication ✌🏻

Fault Tolerant

Time isn't what you think it is! How do two computers in opposite parts of the world decide what specific time an event actually happened? - Logical Clocks ⏰

Fault Tolerant

How can a bunch of unmanaged computers work on a task... We need a manager! Who's the manager? - Leader Election 🗳️

Fault Tolerant

When storing new data, how important is it that people immediately see the new data, rather than the old data? - Consistency Models 🔋

Fault Tolerant

These are just a few key features of distributed systems. But lets put them together and create an example system.

For this example, I'll build a simple Content Delivery Network (CDN). The aim of CDN's are the make sure users like us can access data on the internet as fast as possible. Whether it's accessing images on Google Images, movies on Netflix, or songs on Spotify. Making sure that data is quick and easy to access is essential!

What are the key components of a Content Delivery Network?

A user (with a computer)
A data store with the data we need
A server to get the data from the data store and send it to a user

Small CDN

Using these 3 componets, user, server, datastore, a user can get data from a server. But the problem here is that if we have millions of users trying to get data, the system will get overwhelemd and break - it's not Fault Tolerant it doesn't scale!

So, to make this simple CDN scale, since we've got an increased amount of users, we also need to increase the amount of servers and data stores.

This means data would need to get Replicated from a single datastore to multiple data stores.

And since we have lots of servers, there needs to be some sort of coordination. Here it depends on the Consistency Model. If we don't care whether people read data that isn't up-to-date, we can go for a weaker consistency model called Eventual Consistency.

Eventual Consistency leverages a thing called Quorum whereby only a subset of nodes in a network have to agree with each other. This means that if fetching data from a group of nodes that have outdated information, you'll get the outdated informatation back.

If Eventual Consistency isn't acceptable, and we want to ensure that people only see the most up-to-date version of whatever data we're storing we'd go for a Strongly Consistent model.

To make sure our data is Strongly Consistent we can't use Quorum because all nodes on the network MUST agree with each other, so we'd instead use a Consensus Algorithm like Raft that guarantees that all nodes coordinate well and agree with each other. And this cooridnation is typically done by a Leader that has been elected via Leader Election to reign supreme and keep all nodes on track to complete their tasks.

The trade-off between choosing either a Strongly Consistent, or Weakly Consistent Model is that of Availability/Performance vs. Consistency. The stronger the consistency, the poorer the perfomance, and lower the Availability due to the overhead caused by keeping all nodes all nodes in coordination.

As for Partition Tolerance, in the world of distributed systems, partitions are innevitable, so all systems must be partition tolerant, regardless of the consistency model.

On a final note, I'd like to mention how Conlfict Resolution and Logical Clocks can apply in our CDN.

If two people in separate parts of the world write data try to the same value in a system. This may produce a conflict because the system may not know who wrote the value first or last. But one way to determine this is by using a special implementation of time in the Distributed Systems world known as logical time as detrmined by Logical Clocks. This is where nodes in a system use special timestamps to preserve the causal/chronological order of events in a system where nodes communicating which each other, also send each other timestamps when communicating so they can order their messages. Using this new way to order things, the system can handle Conflict Resolution by simply choosing the most recent and up-to-date piece of data received to store in the value.

Below is our finished Content Delivery Network (CDN)!

Big CDN

There is plenty more to distributed systems, but hopefully this gives enough to whet your appetite to learn more.

In future learn-and-teach posts we'll cover more topics and dive deeper into some above 📚

Happy learning!

Go back