Notes on ‘The Byzantine Generals Problem’
2017-07-15
Paper by Leslie Lamport et al written in 1981. Foundational reading for distributed systems, such as Bitcoin. First, cursory reading. Notes taken with free recall.
The paper
The paper deals with guarantees for systems where subcomponents fail. Given some assumptions, what conditions must be met in order to guarantee a reliable system?
The problem tackled in this paper is a set of general who are trying to coordinate whether to attack a city or not. They are all surrounding the city so they have to send messages to communicate. Most generals are loyal, but there might be some traitor, or some message that gets lost etc. How can we guarantee that the loyal generals take the same coordinated action? By generals we obviously mean systems/processes (or even communication lines) and by loyal we mean functioning.
It turns out, in the standard case, if you have m traitors you need 3m + 1 generals. Let’s look at why 3 generals with 1 traitor doesn’t work. We have a commander (A) and two lieutenants (B and C). If A is a traitor, he might send an “attack” command to B and “retreat” to C. B repeats their order to C, “attack” and C repeats their order “retreat” to B. Now B and C both have one attack order and one retreat order, but they don’t know which is true.
Now assume A is loyal but C is a traitor. A sends “attack” to B and C but C sends “retreat” to B. Now B is in the same situatiton as before and don’ know who to trust.
Lamport warns about this type of informal reasoning being very treacherous in this domain. There’s a set of formal mathematical proof by induction and proof by contradiction that I won’t elaborate on here.
There are some key assumptions here. Let’s see if I can recall them: A1. Every message is readable, i.e. either ‘attack’ or ‘retreat’ A2. All generals must be able to communicate with each other A3. We must be able to detect the absence of a message
A3 is simply timeouts. Assuming I got the assumptions right, it’d be interesting to see what breaks for each assumption. If we don’t have timeouts, we can wait forever. If every message isn’t readable, well, isn’t that same as a timeout? So seems off. If all generals aren’t able to communicate with each other, they don’t get same info? Not very rigorous though. So unclear on A1-A2.
Question: What assumptions must be true for the ‘standard case’?
Answer, while referencing paper: A1) ‘every message that is sent is delivered correcly’, i.e. not corrupted. A2) ‘the receiver of a message knows who sent it’. A3) like above.
It is more robust to know how to break something than to try to remember assumptions directly. That way the assumptions reveal themselves. I believe A1 might have been the case in some system-wide failures in NASA’s circuits (digital vs analog).
Question: What happens if each one of them breaks down?
Answer, referencing paper: If A1 breaks down a traitor can interfer with sent messages. If there’s a directly line between loyal generals, it isn’t clear to me how this case can happen. If a message gets corrupted, isn’t that just the same as it being sent by a corrupt general? Indeed: later on in section 6 Lamport makes this point. It is just a matter of distinguishing between non-faulty generals and communication lines, but we can treat them as the same ‘failure’ bucket. This probably matters for the proofs based on graph connectedness though, since we are confusing nodes and edges here. If A2 doesn’t hold traitors can introduce spurious messages, like send many messages to the system, I imagine. A3 means a traitor can just not send a message and delay a choice forever.
Another case is when all generals (or processes / communication links) have the same inputs and we can guarantee that a message comes from a certain general. This is a fuzzy explanation, but we can make it more precise with an additional assumption. It is my impression this is rarely the case, though. The idea is that we can pass on where informattion comes from and guarantee it hasn’t been tampered with, and thus nodes get same info and we can do majority vote.
Question: In what real-world scenarios can we get away with voting majority? (more of a follow up question than something the paper can answer, I think.)
Example. 3 generals like before: A, B, C. A says attack to B (message: 0: attack), retreat to B (0: retreat). B sends 1: attack to C, and C 2: retreat to B. (what propery is necessary here? seems like uniquenesss, not ordering). Now B has following state: (0: attack, 2: retreat), and C: (0: retreat, 1: attack). I don’t think ID matters, as long as information is passed along and is guaranteed to be correct. If they both have access to that info we can have a function choose(attack, retreat) that does some ordering and logic based on those options (pre-programming), and ensures both take the same final action. A form of majority voting. If we have a 10m timeout the input to all nodes is (or can be seen as) the same.
Thoughts
I’ve heard about this before but usually in the form of ‘Two generals problem’, where the generals keep asking each other for confirmation that the other person heard back. How do the two relate? I think it is a matter of A1 not being true - if the communication lines are bad it is impossible to for either general to know if they will take the same action.
Regardless, interesting and foundational paper. Didn’t dive deep into the formal reasoning, but just knowing what guarantees you can make under what assumptions (and what happens if they break!) is useful, I think.
How does this apply to practical distributed systems? Bitcoin, MongoDB, Torrent, microservices with req/rep and messages queues respectively, etc.
Further reading:
Direction: Lamport
Direction: Other foundational distributed systems papers
Direction: Jepsen
NASA real-world system failure deck
Direction: Byzantine generals problem in the real world