计算机系统设计

Computer System Engineering

专栏索引专栏目录 / Introduction to Computer System Engineering / 章节页

Publish Date: 2026-01-14

Word Count: 1.9k

Lesson 16 容错与可靠性

如果说性能决定了系统能跑多快，那么可靠性则决定了系统能跑多远。

1. 核心概念

在讨论如何“容错”之前，必须厘清三个经常被混淆的概念，它们代表了问题在不同层级的表现：

故障 (Fault)：
- 定义：底层的缺陷。
- 例子：代码中的一个 Bug，或者硬盘扇区的一个物理坏点。
- 特性：它是静态的，可能潜伏很久都不被触发。
错误 (Error)：
- 定义：由于故障被触发，导致系统处于了“不正确”的状态。
- 例子：变量的值变成了负数（违反了断言），或者内存数据与磁盘不一致。
- 特性：它是内部可见的，但用户未必感知。
失效 (Failure)：
- 定义：错误传播到了系统边界，导致服务行为偏离了预期。
- 例子：程序崩溃（Crash）、网页返回 503、ATM 吐钱错误。
- 特性：它是用户可见的事故。

关系：容错的本质，就是在“错误”演变成“失效”之前，将其截获并处理。

2. 容错的三阶段防御

构建容错系统就像构筑防线，通常分为三个阶段：

2.1 检错

如果你不知道出错了，就无法处理。

思路：利用冗余信息进行比对。
手段：心跳检测、校验码、超时机制。

2.2 错误遏制

一旦检测到错误，第一要务是止损，防止它像病毒一样扩散到整个系统。

思路：模块化与隔离。
手段：
- 防火墙机制：进程间隔离（Crash 了一个进程不影响另一个）。
- Fail-stop：在分布式系统中，如果节点出错，最好的行为是立即停止响应，而不是发送错误的数据迷惑其他节点。

2.3 错误屏蔽

这是最高级的容错——让用户感觉不到发生了错误。

思路：利用多重冗余进行纠正。
手段：RAID 镜像、重试机制、三模冗余表决（投票，少数服从多数）。

3. 关键技术：冗余

3.1 信息冗余 (编码)

在数字系统中，通过增加少量的额外比特来检错或纠错。

汉明码：
- 原理：利用汉明距离（两个位串不同字符的个数）。
- 公式：若海明距离为 $d$，则检错能力为 $d-1$，纠错能力为 $(d-1)/2$。
- 应用：ECC 内存（可纠正 1 bit 翻转）。
CRC (循环冗余校验)：
- 原理：将数据看作一个长多项式，除以一个生成多项式 $G(x)$，余数即为校验码。
- 优势：极适合硬件实现，检错能力强（能检测突发错误）。
- 注意：它只能检错，不能纠错。广泛用于磁盘存储和网络包校验。
  
  这个老师大概率会考小题计算，挺简单的，可以看看视频：https://www.bilibili.com/video/BV12e411F7fd/

3.2 物理冗余 (复制)

当编码无法解决问题时（如整个磁盘坏了），就需要物理副本。

多模块冗余 (NMR)：N 个完全一样的组件并行运行，通过表决器 决定结果。
主从复制：数据库的标准做法。
- 挑战：一致性。如果主节点写完了，从节点还没写，主节点挂了怎么办？（这涉及分布式共识，如 Paxos/Raft）。

4. 软件容错设计模式

硬件坏了可以换，软件坏了通常是因为逻辑错误。

4.1 响应策略

Fail-fast (快速失败)：发现错误（如参数不对），立即抛出异常或崩溃。
- 适用：开发阶段或内部逻辑，不要试图隐藏 Bug，越早暴露越好。
Fail-safe (失败安全)：即使出错了，也要维持最低限度的安全功能。
- 适用：红绿灯控制系统（出错时全部红灯闪烁），核电站控制。

4.2 状态分离

软件容错的一个核心思想是：进程是易碎的，数据是永恒的。

易失性状态（内存）：设计为可丢弃的。
非易失性状态（磁盘/日志）：设计为原子的、持久的。
做法：Write-Ahead Logging (WAL)。在修改数据前，先把操作记入日志。如果系统崩溃，重启后通过回放日志（Replay）来恢复状态。

5. 量化指标

MTTF (平均失效时间)：衡量组件有多“耐用”。
MTTR (平均修复时间)：衡量系统“复活”有多快。
可用性 (Availability)：
$$ \text{Availability} = \frac{\text{MTTF}}{\text{MTTF} + \text{MTTR}} $$
洞察：对于现代互联网服务，降低 MTTR（快速恢复）比提高 MTTF（不坏）更重要。既然故障不可避免，那就让它在用户眨眼间恢复。

6. 对照

Lesson 16 Fault Tolerance and Reliability

If performance determines how fast a system can run, then reliability determines how far it can run.

1. Core Concepts

Before discussing how to achieve “Fault Tolerance,” we must clarify three often-confused concepts, which represent the problem at different levels:

Fault:
- Definition: An underlying defect.
- Example: A bug in the code, or a physical bad sector on a hard drive.
- Characteristic: It is static and may lie dormant for a long time without being triggered.
Error:
- Definition: The system enters an “incorrect” state because a fault was triggered.
- Example: A variable becomes negative (violating an assertion), or memory data becomes inconsistent with the disk.
- Characteristic: It is internally visible, but the user may not perceive it.
Failure:
- Definition: The error propagates to the system boundary, causing service behavior to deviate from expectations.
- Example: Program Crash, a 503 error, or an ATM dispensing the wrong amount of cash.
- Characteristic: It is a User-visible accident/incident.

Relationship: The essence of fault tolerance is intercepting and handling the “Error” before it evolves into a “Failure”.

2. The Three Stages of Defense

Building a fault-tolerant system is like constructing a defense line, usually divided into three stages:

2.1 Detection

If you don’t know something is wrong, you can’t fix it.

Idea: Compare using Redundant Information.
Methods: Heartbeat checks, Checksums, Timeouts.

2.2 Containment

Once an error is detected, the first priority is Damage Control—preventing it from spreading like a virus throughout the system.

Idea: Modularity & Isolation.
Methods:
- Firewalls: Process isolation (if one process crashes, it doesn’t affect others).
- Fail-stop: In distributed systems, if a node errors out, the best behavior is to stop responding immediately, rather than sending bad data that confuses other nodes.

2.3 Masking

This is the highest level of fault tolerance—making the user completely unaware that an error occurred.

Idea: Use Multiple Redundancy for correction.
Methods: RAID Mirroring, Retry mechanisms, NMR Voting (Triple Modular Redundancy—minority obeys majority).

3. Key Technology: Redundancy

3.1 Information Redundancy (Coding)

In digital systems, adding a small number of extra bits to detect or correct errors.

Hamming Code:
- Principle: Uses Hamming Distance (the count of differing characters between two bit strings).
- Formula: If Hamming Distance is $d$, detection capability is $d-1$, and correction capability is $(d-1)/2$.
- Application: ECC Memory (can correct single-bit flips).
CRC (Cyclic Redundancy Check):
- Principle: Treats data as a long polynomial, divides it by a generator polynomial $G(x)$; the remainder is the checksum.
- Advantage: Extremely hardware-friendly, strong detection capability (can detect burst errors).
- Note: It can only detect, not correct. Widely used in disk storage and network packet validation.
The professor is very likely to test this calculation (small problem). It’s simple, check this video: https://www.bilibili.com/video/BV12e411F7fd/

3.2 Physical Redundancy (Replication)

When coding isn’t enough (e.g., the entire disk fails), physical copies are needed.

N-Modular Redundancy (NMR): N identical components running in parallel, with a Voter deciding the result.
Master-Slave Replication: Standard practice for databases.
- Challenge: Consistency. If the Master finishes writing but the Slave hasn’t, and the Master dies, what happens? (This involves distributed consensus, like Paxos/Raft).

4. Software Fault Tolerance Design Patterns

Hardware can be replaced; software faults are usually logic errors.

4.1 Response Strategy

Fail-fast: Throw an exception or crash immediately upon discovering an error (e.g., invalid parameters).
- Applicability: Development phase or internal logic. Do not try to hide Bugs; the sooner they are exposed, the better.
Fail-safe: Maintain minimum safety functionality even if an error occurs.
- Applicability: Traffic light control systems (flash red on error), nuclear power plant controls.

4.2 State Separation

A core idea of software fault tolerance is: Processes are fragile; Data is eternal.

Volatile State (Memory): Designed to be disposable.
Non-volatile State (Disk/Log): Designed to be atomic and durable.
Method: Write-Ahead Logging (WAL). Record the operation in the log before modifying data. If the system crashes, the state is restored by Replaying the log upon restart.

5. Quantitative Metrics

MTTF (Mean Time To Failure): Measures how “durable” a component is.
MTTR (Mean Time To Repair): Measures how fast the system can “resurrect.”
Availability:
$$ \text{Availability} = \frac{\text{MTTF}}{\text{MTTF} + \text{MTTR}} $$
Insight: For modern internet services, lowering MTTR (fast recovery) is more important than raising MTTF (never breaking). Since failure is inevitable, the goal is to recover in the blink of an eye.

linda1729

https://linda1729-blog.netlify.app/posts/2026-01-14-system-engineering-fault-tolerance/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source linda1729 !

计算机系统设计

Lesson 17-18：原子性与隔离（事务处理）

从事务、日志到隔离级别，理解系统如何守住原子性与可恢复性。

2026-01-15 Computer System Engineering

计算机系统设计

Lesson 14-15：性能工程

把性能从“感觉快”变成可分析、可量化、可优化的系统工程问题。

2026-01-14 Computer System Engineering

计算机系统设计

Lesson 16：容错与可靠性

Lesson 16 容错与可靠性

1. 核心概念

2. 容错的三阶段防御

2.1 检错

2.2 错误遏制

2.3 错误屏蔽

3. 关键技术：冗余

3.1 信息冗余 (编码)

3.2 物理冗余 (复制)

4. 软件容错设计模式

4.1 响应策略

4.2 状态分离

5. 量化指标

6. 对照

Lesson 16 Fault Tolerance and Reliability

1. Core Concepts

2. The Three Stages of Defense

2.1 Detection

2.2 Containment

2.3 Masking

3. Key Technology: Redundancy

3.1 Information Redundancy (Coding)

3.2 Physical Redundancy (Replication)

4. Software Fault Tolerance Design Patterns

4.1 Response Strategy

4.2 State Separation

5. Quantitative Metrics

你的赏识是我前进的动力