Distributed Systems Fundamentals Part 1 or: How I Learned To Start Worrying and Confront the Possibility of Message Loss

At TheoremOne, we approach problems by reasoning about them from first principles. To truly understand a problem and be confident about your solution, you must have a working understanding of the fundamental constructs and motivating logic behind what you’re doing. In this spirit, we’re releasing a a blog series about the fundamentals of distributed systems. This post is the first of that series.

Whether we know it or not, every solution we build is a distributed system, and sound analysis of the distributed nature of a solution is critical to its reliability. Reading this blog series and thinking critically about how it applies to your work will put you well on your way to internalizing these concepts and leveraging them to build rock-solid solutions that will stand the test of time.

A Tale of Two Generals

When you start learning about distributed systems, one of the first topics covered is usually the Two Generals’ Problem - a thought experiment meant to demonstrate the implications of unreliable communication in a simple and intuitive way.

In the scenario, two generals need to communicate via messengers to come to a consensus about a plan of attack, while a hostile army lies between them with some non-zero probability of capturing the messengers. The generals cannot detect when the messengers are captured, because they only know about the messages they receive and because the absence of evidence is not the evidence of absence. As a result, even if some of the messengers get past the enemy forces without being captured, it is impossible to construct a messaging protocol that will guarantee the two generals to become mutually certain of their agreement in any finite period of time.

an illustration of The Two Generals' Problem

Why is this fundamental concept illustrated in terms of warfare? Because the first lesson of distributed systems is to treat every move you make as if it is in hostile territory. Much like the study of security, the study of distributed systems is based on a profound paranoia. Much like the philosophical skepticism of John Locke, there is nothing in this world that you can be certain about other than what you have directly observed. When you learn to question everything you hold dear, you’ll be off to a great start in distributed systems.

What kinds of things do we hold dear?

that the user’s request reached our application
that our application wrote the change to the database
that the database wrote the change to disk
that the changes were replicated to the distributed database replicas
etc…

The Two Generals’ Problem teaches us that each of these occurrences is called into question - they may fail to occur due to message loss. If we use a response message to confirm success, that too can be lost. With each action we take, we must confront this uncertainty because the messages that cause the actions (along with the execution of the actions themselves) are transmitted across an unreliable communication medium.

What counts as an unreliable communication medium?

It's important to remember that the unreliable medium in question isn't just the space between servers (the network), but also the space between a server and its disk (the kernel, the disk buffer, etc), and even the space between two consecutive CPU instructions in a single program (the possibility of the program's execution being suddenly interrupted). That is, just because the figure below is shown with only three "hops" between entities doesn't mean there aren't more opportunities for failure. Indeed, if we understand the system as Zeno would, we see that it could be divided into an infinite number of hops with correspondingly infinite opportunities for failure!

In other words, every medium is unreliable, and every system is a distributed system - we make distinctions by degree and not by kind.

Dealing with Uncertainty

We’ve seen that message loss is always possible, and success cannot be guaranteed, so how do we move forward and build practical systems? The second lesson of distributed systems is that systems be made robust by understanding and confronting failure. So let’s start by taking a look at some of the failure modes that can arise.

an illustration of the implications of a lost message

As shown in Figure 2, in the event of message loss, we (represented by the requesting entity on the left) cannot distinguish whether the final action was triggered (implying it was the response that was lost) or not (because it was the request that was lost).

Already, even with this elementary thought experiment, we are confronted with one of the perennial dilemmas of distributed systems: “at least once” delivery versus “at most once” delivery.

If we assume the worst (that a missing confirmation implies the action didn’t finish), then we want to retry until we get a confirmation. But in the event of message loss, we risk delivering the message more than once. Alternatively, we can avoid retrying entirely to avoid the risk of delivering more than once, but we in turn run the risk of delivering the message less than once (that is, not at all).

an illustration of 'at least once' and 'at most once' messaging

Those of you who have been around the block may have heard these terms before, and if so you have also likely heard of “exactly once” message delivery; you may have heard that it’s impossible to guarantee, or you may have heard software vendors claiming to do just that. We’re not going to jump into that fray just yet because there are some other points to talk about first, but we’ll deal with it in the next post in this series. For now, keep it in mind as a sort of ideal that we often strive for when designing systems, even if we aren’t able to fully guarantee it.

A Brief History of Time

Let’s rewind for a moment. In the discussion above, some astute readers may have been wondering, “How does the requester know when to retry the request?” The word “when” here is the operative part of the question that insists we bring the concept of time into our model.

Given that we want to perform some action that affects something in the world, and assuming an upper limit on the speed of information, we can deduce that our action must necessarily unfold over some non-zero period of time. We can call this our third lesson of distributed systems: nothing is instantaneous.

So, we have a series of steps leading to a final action, with each step in the process taking time to complete, and each step in the process being prone to message loss. To continue considering time, we have to expand further and say that each step in the process is not only prone to message loss but also prone to message delay.

an illustration of message delay

In other words, the time taken in each step can vary - sometimes dramatically. What if a message takes days to reach the next step of the process? Well, the answer is context-sensitive. In many processes, a days-old message in flight has become totally worthless; in others, it may be well within the expected norms. As system designers, we have to understand our problem domain and develop a decision model for how the system will handle varying processing latencies.

Timing Out

One approach is to look at the field of possible latencies as a probability distribution, consult our service-level objectives, and make a judgment call about how long is too long to wait for a response. This choice becomes our “timeout” interval.

an illustration of bell curve latencies

In essence, a timeout interval is about creating a stark dichotomy between success and failure on the otherwise ambiguous gradient of time, treating message delay beyond a certain threshold as being equivalent to message loss. It’s about cutting off the “long tail” of latency. In a model where the time a request could take is theoretically unbounded, we have to cut the tail off somewhere.

When we choose a timeout interval, we are navigating the tradeoff between reducing the upper bound of time that a successful request will take and reducing the percentage of failed requests that will occur. That’s why there is no universally-optimal timeout value - it is dependent on your prioritization of those two competing concerns.

Is it worth waiting minutes for a request that was expected to take only a few seconds just to preserve the possibility of success for a tiny fraction of requests? These are the kinds of questions we need to answer for ourselves in our problem domain.

Late to the Party

One more point to be aware of now that we are treating slow messages as if they were lost messages is that we have to deal with the consequences of what happens when they finally arrive, breathing heavily and apologizing profusely for their tardiness. What do we do with a message that we have already written off as being lost to us forever?

Well, the best we can do at this point is to ignore it. We’ve already sent a timeout failure message in its place, so there is no place for this late message anymore. It wasn’t lost before, but it will truly be lost now - left to fade into oblivion with no purpose or meaning remaining in its fate.

an illustration of phantom responses

Note that this distinction between an expected and unexpected response carries a notable implication for our implementation because the entity tasked with recognizing this distinction must be capable of retaining state. The state is needed so as to allow the entity to know whether a given response message was expected (having an unfulfilled outbound request associated with it) or unexpected (a response was already received, or the timeout interval has passed, or such a request was seemingly never sent in the first place).

There are also other ways to gracefully handle this situation, including designing the system to act with only idempotent effects. We will cover this important concept in the next post in our series. For now, just remember that phantom responses from timed-out requests are another factor that creates message duplication in an “at least once” messaging mode and that the system must be designed to account for them in some way.

Summary

In this post, we’ve discussed basic concepts of message loss, message delay, and message duplication, we’ve introduced some of the classic trade-offs in distributed system design, and we’ve set the stage for future posts in this blog series to dig into how our systems can grapple and engage with these ideas in a pragmatic way.

When you’re ready, you can continue on the the next post in the series.