The Byzantine Generals Problem is an analogy in computer science used to describe the challenge of establishing and maintaining security on a distributed network. To solve this problem, honest nodes (e.g. computers or other physical devices) need to be able to reach consensus despite the presence of dishonest nodes. This means a majority of nodes must establish a set of rules and come to agreement on how to enforce those rules on the network.
In this article, we’ll explain what the Byzantine Generals Problem is in more detail and why it's so important to solve. We’ll look at some of the best solutions that have been proposed and implemented over the years. Finally, we’ll discuss how various blockchain networks have used consensus protocols to overcome the Byzantine Generals Problem and facilitate secure transactions.
What Is The Byzantine Generals Problem?
The Byzantine Generals Problem is a common challenge that decentralized computer systems must overcome. Let’s look at this analogy and how it relates to modern data security.
The Byzantine Generals Problem Research Paper
In 1982, Leslie Lamport, Robert Shostak, and Marshall Pease released a research paper titled, “The Byzantine Generals Problem”. The importance of this concept is clear from the first page, which states that their research was funded by the National Aeronautics and Space Administration (NASA), the Ballistic Missile Defense Systems Command, and the Army Research Office. Although the Byzantine Generals Problem had existed in computer science well before 1982, this was one of the first attempts to put this issue into a relatable analogy and propose solutions for solving it.
The analogy used for the Byzantine Generals Problem basically goes like this: Several divisions of the Byzantine army are stationed just outside of an enemy city and are preparing for battle. Various generals can only communicate with each other via a messenger. They must agree upon a common course of action. However, we must assume that some generals are traitors who wish to prevent loyal generals from agreeing upon a common course of action. An algorithm is needed to ensure that a small group of traitors can’t disrupt communications. To solve the Byzantine Generals problem, loyal generals need a secure way to come to agreement on a plan (known as consensus) and carry out their chosen plan (known as coordination).
While actually solving the Byzantine Generals Problem is quite complex, we now understand the fundamental challenge. It’s crucial to recognize that the concept can be applied strictly to military communications, as the analogy does. However, this problem actually applies to all sorts of computer systems not involving military applications. Whenever a distributed group of nodes (e.g. computers or other physical devices) need to achieve reliable communications, the network needs to solve the Byzantine Generals Problem.
Byzantine Failures
There are several possible reasons why a distributed computer system might fall. These are commonly known as Byzantine failures (also known as Byzantine faults). Looking at the analogy above, Byzantine failures are essentially the traitors who attempt to disrupt communications between the loyal generals.
Applying this concept to real-world computer systems, this could be a software bug, hardware malfunction, and/or a malicious attack. In other words, Byzantine failures don’t necessarily have to be an orchestrated attack from a bad actor. They can simply be issues that prevent nodes from coming to agreement on decisions for a distributed network.
Byzantine Fault Tolerance
Byzantine failures are practically inevitable within any distributed computer system. Let’s say there is a power outage and nodes suddenly go offline. Is the network still able to function as normal and continue supporting reliable communications? Or does the entire system suddenly stop working or become vulnerable to attacks?
In a moderately secure network, something as simple as a few nodes going offline doesn’t have any noticeable impact on the network. The ability to defend against these scenarios is known as Byzantine Fault Tolerance. Networks that are able to handle more Byzantine failures are considered to have a higher tolerance, which means they are more secure than ones that can’t handle Byzantine failures.
Achieving Byzantine Fault Tolerance
Achieving Byzantine Fault Tolerance has historically been a difficult task. That’s because there are a wide range of security challenges for computer scientists to consider. Additionally, there’s a need to create practical solutions that address seemingly abstract, theoretical threats. Here are a few solutions that have been developed over the years for various distributed system applications.
Early Solutions
Research on Byzantine Fault Tolerance began in the 1950s and mainly revolved around the aviation industry. Several Byzantine Fault Tolerance solutions for aircraft and spacecraft were developed in the late 1970s and early 1980s.
In 1978, researchers at Draper Laboratory published a technical report on the Fault-Tolerant Multiprocessor (FTMP) — a multiprocessor computer that eliminates single-fault vulnerability for aircraft modules. This research continued throughout the 1980s.
In the late 1970s, Honeywell Labs developed the Micro-processor Flight Control System (MMFCS), which enabled precise detection of Byzantine failures and the ability to differentiate between other types of failures. In 1981, SRI International, the same organization that coined the term “Byzantine Generals Problem”, published a technical paper for an aircraft control computer called Software Implemented Fault Tolerance (SIFT).
Paxos Protocol And The Part-Time Parliament
In 1989, Leslie Lamport first published the Paxos protocol. In 1998, Lamport released a journal article titled, “The Part-Time Parliament” to describe this solution for the Byzantine Generals Problem. Once again, a simple analogy can be used to understand a complex problem.
In ancient times, the Aegean island of Paxos was run by a part-time parliament. Because trade was prioritized over governance, no one in Paxos was willing to become a full-time member of the island’s parliament. This meant that the government had to find a way to continuously function even as its officials stepped in and out of their roles. Since many of the details about how the parliament ran were lost to history, Lamport made some assumptions about how the governance system worked. Lamport found that this system was similar to the challenges that modern distributed systems were required to resolve.
Based on this analogy, Lamport and other computer scientists developed the Paxos protocol, which is actually a family of consensus protocols that formed the foundation of the state machine replication approach. Whereas previous systems had required nodes to be online at the same time and communicate in a synchronous manner to reach consensus, the Paxos protocol was asynchronous. Essentially, state machine replication introduced an effective way for nodes in a distributed system to perform independent verification and communicate in an asynchronous or semi-synchronous manner.
This involves four main steps.
- A device stores the state of the network (known as a state machine) and determines the initial state.
- The server is replicated multiple times.
- Server replicas receive the same data inputs in the same order, enabling them to generate the same outputs.
- Server replicas vote on replica outputs to verify data is correct and to achieve fault tolerance.
The Paxos protocol became a significant development in the field of computer science because it introduced a way to guarantee data consistency across a distributed system and enabled networks to become more secure against possible Byzantine failures.
Practical Byzantine Fault Tolerance (pBFT)
In 1999, Miguel Castro and Barbara Liskov published a research paper titled, “Practical Byzantine Fault Tolerance” that introduced a new algorithm for achieving Byzantine Fault Tolerance. The word “practical” is used because Castro and Liskov found that previously-developed algorithms either assumed the network was synchronous or didn’t provide a practical means of reaching consensus for asynchronous systems.
This algorithm, typically abbreviated as pBFT, allowed asynchronous networks to process thousands of requests per second with only a sub-millisecond increase in latency. Although it became a major breakthrough for distributed systems, pBFT faced two main issues that limited adoption. First, pBFT becomes more costly to use as the number of nodes increases. Second, pBFT is susceptible to Sybil attacks— a common security threat in which an attacker uses multiple pseudonymous identities to control a majority of nodes on a peer-to-peer network. More protocols were later introduced to solve some of pBFT’s limitations. For example, Q/U, HQ, Zyzzyva, and ABsTRACTs focused on solving performance and cost issues. Aardvark and RBFT focused on solving robustness issues.
Bitcoin Network
In October 2008, Satoshi Nakamoto published the original Bitcoin whitepaper. Although the term “Byzantine Generals Problem” is never used in this document, Nakamoto effectively proposed a solution that would be implemented with the launch of the Bitcoin network in January 2009. Bitcoin became the world’s first blockchain, which is one variety of distributed ledger technology (DLT).
The network introduced the ability for users to securely send and receive a digital currency called Bitcoin (BTC). Other distributed systems for digital payments were proposed before Bitcoin, but they were unsuccessful largely due to their inability to prevent Byzantine failures. Because those solutions didn’t solve the Byzantine Generals Problem, they were prone to a security threat known as the double spending problem. In other words, users would be able to spend funds that didn’t actually exist. With Bitcoin, the double spending problem is solved because the network’s design provides a very, very high level of Byzantine Fault Tolerance.
So how does the Bitcoin network accomplish this? It’s important to understand that Bitcoin builds upon previous solutions for the Byzantine Generals Problem. For example, the network enables asynchronous communication between nodes and is essentially a replicated state machine. The network’s security also relies upon a combination of concepts such as asymmetric encryption, peer-to-peer networking technology, and Proof of Work (PoW). Just like the Paxos protocol or pBFT, Proof of Work is a consensus protocol. Although PoW was first proposed in 1992, Bitcoin became the first network to introduce a competitive aspect of PoW data validation known as mining. More PoW-based networks soon began to adopt Bitcoin’s solution for the Byzantine Generals Problem. Other varieties of blockchain consensus protocols would soon follow.
Byzantine Fault Tolerance Solutions For Blockchain Networks
There are currently three main types of consensus protocols used by blockchain networks. Quite a few variations exist in the numerous implementations of these protocols; however, most networks share the same general mechanics to achieve Byzantine Fault Tolerance.
Proof of Work (PoW)
As mentioned above, Bitcoin became the first network to implement a process known as cryptocurrency mining for Proof of Work (PoW) consensus. In this system, the Byzantine Generals Problem is solved by rewarding honest users (known as miners) who solve difficult math problems with powerful computers. Once the correct solution is verified, the miner receives a reward in the blockchain network’s native coin. Miners on the Bitcoin network, for example, receive 6.25 BTC per block (as of the time of this writing). The block is then mined, and the transaction is completed.
While the process actually requires a few more steps, the main idea to understand is that large PoW-based blockchain networks like Bitcoin are highly secure. That’s because it would take the millions of dollars to even attempt to attack the network and create a Byzantine failure. To successfully carry out a double spending attack, dishonest miners would need to gain control of 51% of the network’s computing power, which is known as a 51% attack. Although this is practically impossible on the Bitcoin network, it’s actually a major problem for smaller networks with fewer miners since the computational resources and costs are much lower.
Proof of Stake (PoS)
First implemented in 2012, Proof of Stake (PoS) is another blockchain consensus protocol that aims to solve the Byzantine Generals Problem. Unlike PoW-based networks, PoS-based networks don’t rely upon cryptocurrency mining. Instead, a process known as staking is used. In this system, users (known as validators) stake funds. Validators that stake more of a blockchain’s coins are able to validate more blocks and earn more rewards. Additionally, users who try to validate invalid transactions may lose their staked funds.
Users can stake coins with basic home computers rather than needing to have specialized computers required for mining in a PoW-based network. A few PoS-based networks have implemented solutions that prevent double spending attacks and other possible security issues that might occur as a result of Byzantine failures. For example, Ethereum 2.0 (Serenity) will feature a PoS algorithm called Casper, which requires a two-thirds majority for nodes to reach consensus before blocks can be created.
Delegated Proof of Stake (DPoS)
First implemented in 2014, Delegated Proof of Stake (DPoS) is a blockchain consensus protocol that works similarly to Proof of Stake. Both require users to stake funds. However, with DPoS-based networks, only a few users (known as delegates) are able to validate transactions and create blocks. Generally, all users can stake a blockchain’s coins as a means of casting a vote in favor of a delegate candidate. Elected nodes generally distribute block rewards with their voters proportional to the amount of funds staked in delegate elections.
With DPoS, nodes are able to reach consensus much faster than PoW or PoS. This means transactions can be processed much quicker at scale. The tradeoff is that maintaining a high level of Byzantine Fault Tolerance with DPoS may become difficult in some situations. Fewer nodes are responsible for keeping the network secure, meaning it’s theoretically easier for nodes to collude against the best interest of the majority. However, DPoS-based networks aim to avoid this scenario by holding delegate elections frequently as a means of ensuring delegates remain accountable for their decisions.
📧Komodo Newsletter
If you'd like to learn more about blockchain technology and keep up with Komodo's progress, subscribe to our newsletter. Begin your blockchain journey with Komodo today.