Lesson 19-20:分布式原子性与一致性


Lesson 19-20 分布式原子性与一致性

这一章有很多算法,还是很感慨于人类的创造力…..能在混乱的分布式环境下想出这么多出路。哈哈当然还有最经典的拜占庭将军问题,但是拜占庭的问题变种有点多,所以这里就只在最后简单概括,需要完整阅读分析请留言~

从单机走向分布式,是计算规模扩大的必然结果。然而,这不仅仅是数量的堆叠,更是系统模型本质的改变。

在单机系统中,我们拥有上帝视角:共享的内存、统一的时钟、一旦断电就一起死掉的组件。而在分布式系统中,我们失去了这些确定性。我们面临的是一个并发的、没有全局时钟的、组件会独立失效的混乱世界。

如何在这样的混沌中建立秩序?这就是分布式原子性与一致性要解决的问题。

1. 分布式系统的挑战与理论边界

1.1 核心差异

与单机相比,分布式系统的本质困难在于:

  • 通信不可靠:消息可能丢失、乱序或无限延迟。
  • 部分失效:A 机器挂了,B 机器还活着,但 B 不知道 A 是挂了还是只是网络卡了。
  • 无全局时钟:你不能用“现在几点”来同步两个物理距离遥远的节点。

1.2 CAP 定理

这是分布式系统的“物理定律”。在一个异步网络环境中,以下三个属性无法同时满足:

  • Consistency (一致性):所有节点在同一时刻看到相同的数据(强一致性)。
  • Availability (可用性):系统在任何时候都能处理请求,哪怕返回的是旧数据。
  • Partition Tolerance (分区容错性):当网络因为故障被切断成两半(分区)时,系统仍能继续运行。

结论:在现实世界的分布式系统中,网络分区(P)是必然发生的。因此我们只能在 CP(为了数据正确暂时停服务,如银行系统)和 AP(为了服务可用容忍短暂数据不一致,如社交网络动态)之间做权衡。

1.3 FLP 不可能定理

这是一个比 CAP 更悲观的理论:在异步系统中,哪怕只有一个节点可能发生崩溃,也不存在一个能保证在有限时间内绝对终止(Liveness)的共识算法。
这并不意味着我们无法达成共识,而是说明了共识算法在极端情况下可能永远死循环(活锁),我们需要用超时等机制来打破僵局。

这里需要注意区分“中止”和“终止”的区别。

2. 基础设施:快照与死锁检测

在没有全局时钟的情况下,如何获得系统的“全局状态”?

2.1 Chandy-Lamport 快照算法

我们要拍一张“全家福”,但大家都在动。如果你先拍 A,再跑去拍 B,可能 A 发给 B 的钱在拍照间隙“消失”了。
Chandy-Lamport 算法的核心思想是利用 Marker 标记消息

  1. 发起节点记录自己状态,并向所有通道发送 Marker。
  2. 其他节点收到 Marker 后,记录自己状态,并继续转发 Marker。
  3. 一致性保证:它捕获的状态不一定是物理时间上的“此时此刻”,但一定是逻辑因果上一致的(不会出现“收到消息但没发出消息”的灵异现象)。

2.2 分布式死锁检测

死锁在分布式中表现为等待图 中的环。

  • 资源死锁:$P \to R \to Q$(P 等 Q 占有的资源)。
  • 通信死锁:$P \to Q$(P 等 Q 的回复)。
    通过快照算法构建全局等待图,如果发现环,则说明存在死锁。

3. 分布式原子性:全员提交 (2PC)

场景:一个跨银行转账事务,涉及节点 A 和节点 B。要么都执行,要么都回滚。这就需要原子性提交协议

3.1 两阶段提交 (2PC - Two Phase Commit)

引入一个协调者 来指挥所有参与者

  • Phase 1: 投票。协调者问:“都能提交吗?”参与者执行事务但不提交,锁定资源,写好 Undo/Redo 日志,然后回“Yes”或“No”。
  • Phase 2: 决断
    • 如果全员 Yes $\to$ 协调者发 Commit。
    • 只要有一个 No $\to$ 协调者发 Abort。

3.2 2PC 的致命缺陷:阻塞 (Blocking)

2PC 最大的问题是它是阻塞协议

  • 如果协调者在发出 Prepare 后挂了,参与者们就尴尬了:它们锁定了资源,手里拿着“Yes”,但不知道该提交还是回滚。它们只能无限等待协调者复活。这是 AC 的反面教材。

3PC 试图通过增加 Pre-Commit 阶段和超时机制来解决阻塞,但在网络分区时仍可能导致数据不一致,因此工业界更倾向于使用基于共识的算法。

4. 分布式共识:Paxos

2PC 要求 100% 节点同意,这太脆弱了。我们希望只要大多数 节点活着,系统就能工作。这就是共识 问题。

4.1 Paxos 算法

Paxos 是分布式一致性的基石(Google Chubby, Zookeeper 核心思想源于此)。

  • 目标:在一堆可能崩溃的节点中,就某个(Value)达成一致。
  • 核心机制
    • 多数派写入:不需要所有人同意,只要超过半数 ($N/2 + 1$) 同意即可。
    • 提案编号 (Proposal Number):为了防止旧的消息干扰,每个提案都有全局递增的编号。编号大的说了算

4.2 Paxos 流程简化

  1. Prepare 阶段:Proposer 抢占“话语权”。发一个编号 $N$,问大家“我能提议吗?”Acceptor 如果没见过比 $N$ 更大的,就承诺“好的,我不再理会小于 $N$ 的提案”。
  2. Accept 阶段:Proposer 获得多数派承诺后,正式发出提案 ${N, Value}$。Acceptor 如果没变心(没收到更大的 Prepare),就接受它。
  3. Learn 阶段:一旦多数派接受了,这个值就被选定 了。

5. 拜占庭容错与区块链

前面的算法(2PC, Paxos)都假设节点虽然会崩溃,但不会撒谎。如果节点被黑客控制,发送恶意数据呢?这就是拜占庭将军问题

5.1 拜占庭容错 (BFT)

在有恶意节点的情况下,需要 $3f+1$ 个节点才能容忍 $f$ 个叛徒。因为叛徒可以给一半人说 0,给另一半人说 1。PBFT 算法通过三轮复杂的投票解决了这个问题。

5.2 区块链与工作量证明 (PoW)

比特币通过一种概率性的方式解决了拜占庭共识:

  • 去中心化账本:所有人都有账本副本。
  • Sybil 攻击:为了防止坏人模拟出一百万个节点来投票,比特币要求投票必须付出代价(算力)。
  • PoW (Proof of Work)
    • 利用 HashCash 机制(计算 SHA-256 哈希满足特定个前导零)。
    • 这就把“算力”转换成了“投票权”。
    • 最长链原则:大家总是信任工作量最大的那条链。只要好人的算力超过 51%,诚实链就会比攻击链长。

6. 拜占庭容错

前面的 Paxos/Raft 解决的是节点“死机”或“断网”的问题(非拜占庭错误),但在更开放的网络(如区块链)中,节点可能会被黑客控制,进而撒谎、伪造消息。这就是著名的拜占庭将军问题

6.1 问题核心与目标

在一群将军(节点)中,存在叛徒(恶意节点)。我们要达成两个目标:

  • IC1 (一致性):所有忠诚的副将必须达成一致的行动(即使司令是叛徒)。
  • IC2 (正确性):如果司令是忠诚的,那么所有忠诚的副将必须执行他的命令。

6.2 两种解决模型

解决拜占庭问题取决于通信机制的能力(假设基础通信可靠 A1~A3):

A. 口头消息 (Oral Messages, OM)

  • 假设:消息可以被传递,但可以被伪造或篡改
  • 结论:要容忍 $m$ 个叛徒,总人数 $N$ 必须满足 $N \ge 3m + 1$
    • 直观理解:如果只有 3 个人(1 司令 2 副将),其中 1 个是叛徒。忠诚的副将无法分辨是司令在乱发命令,还是另一个副将在撒谎,因此无法达成一致。
  • 算法:通过递归的多轮消息传递和多数派表决。

B. 签名消息 (Signed Messages, SM)

  • 假设:引入了数字签名 (A4)。签名不可伪造,且任何改动都能被检测。
  • 结论:难度大幅降低。只要 $N \ge m + 2$ 即可容忍 $m$ 个叛徒。
  • 原理:叛徒无法篡改司令的命令,只能选择“转发”或“不转发”,无法制造混乱的假象。

6.3 故障分类体系

在分布式系统中,我们要分清“故障”的级别,不同级别的故障需要不同的协议来处理:

故障类型 表现行为 解决协议
Crash (宕机) 彻底停止,不响应。 Paxos, Raft
Timeout (超时) 响应太慢,可能阻塞。 Paxos, Raft (加超时机制)
Omission (不回应) 收到消息但不回复(丢包)。 Paxos, Raft (重发机制)
Byzantine (拜占庭) 撒谎、伪造、发送矛盾信息 SM(m), PBFT, PoW

现实应用
即使在硬件层面,单一传感器也可能出现“拜占庭故障”(读数乱跳)。工程上的解法是 输入冗余 (Input Redundancy):安装多个传感器,并进行共识过滤,而不是盲目相信单一输入源。

7. 对照

Lesson 19-20 Distributed Atomicity and Consistency

This chapter contains many algorithms, and I am quite impressed by human creativity… coming up with so many solutions in a chaotic distributed environment. Haha, of course, there is the classic Byzantine Generals Problem, but there are many variations of Byzantine problems, so I will only summarize it briefly at the end. Please leave a comment if you need a complete reading analysis~

Moving from single-machine to distributed systems is an inevitable result of expanding computational scale. However, this is not just a stacking of numbers, but a fundamental change in the system model.

In a single-machine system, we possess a God’s eye view: shared memory, a unified clock, and components that die together once the power is cut. In a distributed system, we lose these certainties. We face a chaotic world that is concurrent, lacks a global clock, and has components that fail independently.

How do we establish order in such chaos? This is the problem that distributed atomicity and consistency aim to solve.

1. Challenges and Theoretical Boundaries

1.1 Core Differences

Compared to single machines, the essential difficulties of distributed systems lie in:

  • Unreliable Communication: Messages may be lost, disordered, or infinitely delayed.
  • Partial Failure: Machine A crashes, Machine B is still alive, but B does not know if A has crashed or if the network is just lagging.
  • No Global Clock: You cannot use “what time is it now” to synchronize two physically distant nodes.
1.2 CAP Theorem

This is the “Physical Law” of distributed systems. In an asynchronous network environment, the following three attributes cannot be satisfied simultaneously:

  • Consistency: All nodes see the same data at the same time (Strong Consistency).
  • Availability: The system can process requests at any time, even if it returns old data.
  • Partition Tolerance: The system can continue to operate when the network is cut into two halves (partitioned) due to failure.

Conclusion: In real-world distributed systems, network partitioning (P) is inevitable. Therefore, we can only trade off between CP (pausing service for data correctness, like banking systems) and AP (tolerating temporary data inconsistency for service availability, like social network feeds).

1.3 FLP Impossibility

This is a theory more pessimistic than CAP: In an asynchronous system, if even one node might crash, there is no consensus algorithm that guarantees absolute termination (Liveness) within a finite time.
This does not mean we cannot reach consensus, but it indicates that consensus algorithms might enter an infinite loop (livelock) in extreme cases, and we need mechanisms like timeouts to break the deadlock.

Note the distinction between “Abort” (中止) and “Terminate” (终止).

2. Infrastructure: Snapshots and Deadlock Detection

In the absence of a global clock, how do we obtain the “Global State” of the system?

2.1 Chandy-Lamport Snapshot Algorithm

We want to take a “family photo,” but everyone is moving. If you take a picture of A first, then run to take a picture of B, the money A sent to B might “disappear” in the gap between the photos.
The core idea of the Chandy-Lamport algorithm is to use Marker messages:

  1. The initiating node records its own state and sends a Marker to all channels.
  2. When other nodes receive the Marker, they record their own state and continue to forward the Marker.
  3. Consistency Guarantee: The state it captures is not necessarily “this exact moment” in physical time, but it is certainly logically and causally consistent (there will be no paranormal phenomena like “receiving a message but not sending it”).
2.2 Distributed Deadlock Detection

Deadlocks in distributed systems manifest as cycles in the Wait-for Graph.

  • Resource Deadlock: $P \to R \to Q$ (P is waiting for a resource held by Q).
  • Communication Deadlock: $P \to Q$ (P is waiting for a reply from Q).
    By constructing a global wait-for graph using the snapshot algorithm, if a cycle is found, a deadlock exists.

3. Distributed Atomicity: Atomic Commit (2PC)

Scenario: A cross-bank transfer transaction involving Node A and Node B. Either both execute, or both rollback. This requires an Atomic Commit Protocol.

3.1 Two-Phase Commit (2PC)

Introduce a Coordinator to direct all Participants.

  • Phase 1: Voting. The coordinator asks: “Can everyone commit?” Participants execute the transaction but do not commit, lock resources, write Undo/Redo logs, and then reply “Yes” or “No”.
  • Phase 2: Decision.
    • If everyone says Yes $\to$ Coordinator sends Commit.
    • As long as there is one No $\to$ Coordinator sends Abort.
3.2 The Fatal Flaw of 2PC: Blocking

The biggest problem with 2PC is that it is a Blocking Protocol.

  • If the coordinator hangs after sending Prepare, the participants are in an awkward position: they have locked resources, holding “Yes” in their hands, but don’t know whether to commit or rollback. They can only wait indefinitely for the coordinator to resurrect. This is a negative example of Availability.

3PC attempts to solve blocking by adding a Pre-Commit phase and timeout mechanisms, but it can still lead to data inconsistency during network partitions, so the industry prefers consensus-based algorithms.

4. Distributed Consensus: Paxos

2PC requires 100% agreement from nodes, which is too fragile. We hope the system can work as long as a Majority of nodes are alive. This is the Consensus problem.

4.1 Paxos Algorithm

Paxos is the cornerstone of distributed consistency (Google Chubby, Zookeeper’s core ideas originate from here).

  • Goal: Reach agreement on a certain Value among a group of nodes that may crash.
  • Core Mechanism:
    • Majority Write: Does not require everyone to agree, as long as more than half ($N/2 + 1$) agree.
    • Proposal Number: To prevent interference from old messages, each proposal has a globally increasing number. Higher numbers rule.
4.2 Simplified Paxos Flow
  1. Prepare Phase: The Proposer seizes the “right to speak”. It sends a number $N$ and asks everyone, “Can I propose?” If an Acceptor has not seen a number greater than $N$, it promises, “Okay, I will ignore any proposals smaller than $N$”.
  2. Accept Phase: After receiving promises from a majority, the Proposer formally sends the proposal ${N, Value}$. If the Acceptor hasn’t changed its mind (hasn’t received a larger Prepare), it accepts it.
  3. Learn Phase: Once a majority has accepted, the value is Chosen.

5. Byzantine Fault Tolerance and Blockchain

The previous algorithms (2PC, Paxos) assume that while nodes may crash, they do not lie. What if a node is controlled by a hacker and sends malicious data? This is the Byzantine Generals Problem.

5.1 Byzantine Fault Tolerance (BFT)

In the presence of malicious nodes, $3f+1$ nodes are required to tolerate $f$ traitors. This is because a traitor can say 0 to half the people and 1 to the other half. The PBFT algorithm solves this problem through three rounds of complex voting.

5.2 Blockchain and Proof of Work (PoW)

Bitcoin solves Byzantine consensus in a probabilistic way:

  • Decentralized Ledger: Everyone has a copy of the ledger.
  • Sybil Attack: To prevent bad actors from simulating a million nodes to vote, Bitcoin requires that voting must pay a cost (computational power).
  • PoW (Proof of Work):
    • Utilizes the HashCash mechanism (calculating a SHA-256 hash that meets a specific number of leading zeros).
    • This converts “computational power” into “voting rights”.
    • Longest Chain Rule: Everyone always trusts the chain with the most accumulated work. As long as the computational power of honest nodes exceeds 51%, the honest chain will be longer than the attack chain.

6. Byzantine Fault Tolerance

The previous Paxos/Raft solve the problem of nodes “crashing” or “disconnecting” (non-Byzantine faults), but in more open networks (like Blockchain), nodes may be controlled by hackers and thus lie or forge messages. This is the famous Byzantine Generals Problem.

6.1 Core Problem and Goal

In a group of generals (nodes), there are traitors (malicious nodes). We need to achieve two goals:

  • IC1 (Consistency): All loyal lieutenants must reach a consistent action (even if the commanding general is a traitor).
  • IC2 (Correctness): If the commanding general is loyal, then all loyal lieutenants must execute his order.
6.2 Two Solution Models

Solving the Byzantine problem depends on the capabilities of the communication mechanism (assuming basic communication reliability A1~A3):

A. Oral Messages (OM)

  • Assumption: Messages can be delivered, but can be forged or tampered with.
  • Conclusion: To tolerate $m$ traitors, the total number $N$ must satisfy $N \ge 3m + 1$.
    • Intuitive Understanding: If there are only 3 people (1 commander, 2 lieutenants), and 1 is a traitor. A loyal lieutenant cannot distinguish whether the commander is sending random orders or if the other lieutenant is lying, so consensus cannot be reached.
  • Algorithm: Through recursive multi-round message passing and majority voting.

B. Signed Messages (SM)

  • Assumption: Introduces Digital Signatures (A4). Signatures cannot be forged, and any alteration can be detected.
  • Conclusion: The difficulty is drastically reduced. As long as $N \ge m + 2$, $m$ traitors can be tolerated.
  • Principle: Traitors cannot tamper with the commander’s orders; they can only choose to “forward” or “not forward”, preventing them from creating a chaotic illusion.
6.3 Fault Classification System

In distributed systems, we must distinguish the level of “faults”, as different levels require different protocols to handle:

Fault Type Behavior Resolution Protocol
Crash Completely stops, does not respond. Paxos, Raft
Timeout Responds too slowly, may block. Paxos, Raft (with timeout mechanisms)
Omission Receives messages but does not reply (packet loss). Paxos, Raft (retransmission mechanisms)
Byzantine Lying, forging, sending contradictory messages. SM(m), PBFT, PoW

Real-world Application:
Even at the hardware level, a single sensor may exhibit “Byzantine faults” (erratic readings). The engineering solution is Input Redundancy: install multiple sensors and perform consensus filtering, rather than blindly trusting a single input source.


Author: linda1729
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source linda1729 !
评论
  TOC