doc: 阅读redis cluster文档

2025-10-05 16:23:43 +08:00
parent dc90ad42c2
commit 987620a2d6
1 changed files with 89 additions and 0 deletions
--- a/中间件/redis/redis.md
+++ b/中间件/redis/redis.md
@@ -228,6 +228,12 @@
      - [Scaling reads using replica nodes](#scaling-reads-using-replica-nodes)
    - [Fault Tolerance](#fault-tolerance)
      - [Heartbeat and gossip message](#heartbeat-and-gossip-message)
+      - [Heartbeat packet content](#heartbeat-packet-content)
+        - [common header](#common-header)
+        - [gossip section](#gossip-section)
+      - [Failure Dection](#failure-dection)
+        - [`PFAIL` flag](#pfail-flag)
+        - [`FAIL` flag](#fail-flag)


 # redis
@@ -3616,3 +3622,86 @@ redis cluster nodes会持续的交换ping和pong packets。ping packet和pong pa

 例如，在一个规模为100的集群中，如果将`node timeout`设置为60s，每个节点在每30s都会尝试发送99个ping，故而每个节点每秒钟平均发送`3.3`个ping，整个集群每秒钟平均发送`330`个ping。

+> 在上述示例中，集群内部每秒钟会发送330个ping packets，但是这330个请求会均匀的分布在集群的100个节点中，故而该traffic对集群来说是可接受的。
+
+#### Heartbeat packet content
+ping/pong packet的组成结构如下所示：
+- `a header that is common to all types of packets`
+- `a special gossip section that is specific to ping and pong packets`
+
+##### common header
+common header包含的信息如下：
+- `Node ID`: 该node ID为一个伪随生成的160bit字符串，在node最开始被创建时分配给node，并且在redis cluster node之后的声明周期中都保持不变
+- `currentEpoch and configEpoch fields of the sendig node`：这些fields用于挂载redis集群使用的分布式算法。
+  - 如果node为replica，那么configEpoch代表`last known  configEpoch of its master`
+- `node flags`: node flags中主要包含如下信息：
+  - node是否为replica
+  - node是否为master
+  - other single bit information
+- `a bitmap of hash slots served by the sending node, or if the node is a replcia, a bitmap of the slots served by its master`
+- `sender TCP base port`: 该port被redis使用，主要被用于接收client commands
+- `cluster port`: 该port用于`node-to-node communication`
+- 从发送方视角（ping packet的发送方）的cluster state（`ok` or `down`）
+- 如果当前发送方node为replica，那么header中将会包含`master node ID of the sending node`
+
+##### gossip section
+ping/pong section中，同样包含了gossip section。该gossip section向packet的接收方提供了`a view of what the sender node thinks about other nodes in the cluster`。在gossip section中，仅包含`sender已知节点中随机的一部分节点信息`。
+
+> 在gossip section中，信息中提及的节点数量和cluster size成正比。
+
+如果gossip section包含了指定节点，那么其将会section中包含节点的如下信息：
+- node ID
+- IP and port of the node
+- node flags
+
+gossip section将会允许packet接收者从packet发送者的视角来接收其他节点的状态。这在`failure detection`和`发现cluster中的其他节点`上十分有用。
+
+#### Failure Dection
+redis cluster failure detection用于识别`a master or a replica is no longer reachable by the majority of nodes`的情况，其会触发`将replica提升为master`的操作。
+> 当`replica promotion`失败的场景下，cluster将会处于`error`的状态，在该状态下集群将会停止从client处接收查询消息。
+
+就像上文中已经提到的，每个节点对于其他的已知节点都会存储`a list of flags associated with known nodes`。其中，有两个flags用于`failure detection`，分别为`PFAIL`和`FAIL`。
+- 其中，`PFAIL`代表`Possible Fail`，其是`non-acknowledged` fail type
+- `FAIL`代表节点处于失败状态，并且该失败状态已经`在固定时间内`被`majority of masters`确认过
+
+##### `PFAIL` flag
+`a node flags another node with PFAIL flag when the node is not reachable for more than NODE_TIMEOUT time`。
+> master和replica都可以将其他节点标记为`PFAIL`, 无论节点类型是什么。
+
+对于redis cluster node而言，`non-reachability`代表`we have an active ping (已经发送ping packet但是目前尚未收到响应) pending for longer than NODE_TIMEOUT`。
+ > 在使用上述机制时，必须确保NODE_TIMEOUT相比于round trip time要足够大。
+
+ 为了提高正常操作的可靠性，node中如果在`NODE_TIMEOUT /2`的时间范围内没有接收到其他节点的回复，会尝试对其他节点进行重新连接。该机制将会保证broken connections将不会导致`false failure`
+
+##### `FAIL` flag
+`PFAIL`flag仅是node对其他node持有的本地信息，该本地信息还不足以触发replcia promotion。如果要将node判定为`down`状态，需要将`PFAIL`提升为`FAIL`。
+
+如上述的文档描述所示，每个节点都会在向其他节点发送gossip messages时，在消息中包含`the state of a few random known nodes`。最终，每个节点都会接收到针对其他每个节点的`flags`。`通过上述机制，每个节点可以向其他节点发送其探知到的failure`。
+
+在满足如下条件的场景下，`PFAIL`将会被提升为`FAIL`:
+- 假设A节点将B节点标记为`PFAIL`
+- node A通过`gossip sections`收集到`在其他节点的视角下，node B的状态`
+- 如果在`NODE_TIMEOUT * FAIL_REPORT_VALIDITY_MUL`的时间范围内，`majority of masters`发送了`PFAIL`或`FAIL`的状态（该validity factor在当前实现中被设置为2，故而时间范围为NODE_TIMEOUT的两倍）
+
+如果所有上述条件都为true，那么node A将会执行如下操作：
+- 将node B标记为FAIL
+- 向所有可访问（reachable）的节点发送`FAIL message`
+
+`FAIL message`将会强制所有收到该消息的节点将node B的状态标记为`FAIL`，无论接收消息的节点是否已经将node B标记为PFAIL状态。
+
+如果一个节点已经被标记为`FAIL`，那么其`FAIL`只能在如下场景下被清除：
+- 如果该node为replica，并已经处于reachable状态，那么node的FAIL状态可以被清除，并且无需触发fail over
+- 如果该node为`master not servig any slot`，并且已经处于reachable状态，在这种情况下，`FAIL`标志可以被清除。
+  - 由于master node并不负责任何slot，故而且没有实际加入集群，故而其flag可以被正常清除，并在清除后等待后续对该节点的配置并加入集群
+- 该node为master，并且已经处于reachable状态，但是很长时间内(`N * NODE_TIMEOUT`)都没有探知到replica promotion。那么，在这种场景下，最好可以重新加入到该集群
+
+在`PFAIL`到`FAIL`的状态转换中，使用了agreement的形式，但是该一致性较弱:
+- node在固定的时间窗口内接收来自其他节点的视角信息，其在不同时间从不同的节点接收信息。
+- 当节点探知到`FAIL`场景时，其并不保证所有node都能接收到`FAIL message`，可能有些节点因为网络问题处于`network partition`的状态。
+
+redis failure detection拥有如下要求：
+- `最终，所有节点都要就给定节点的状态达成一致`
+
+在如下两种场景中：
+- 如果`majority of masters`将该node标记为`FAIL`，由于failure detection和其产生的链式效应，最终所有其他节点都会将该节点标记为`FAIL`
+- 当仅有`minority of masters`将node标记为`FAIL`时，replica promotion并不会发生，并且每个节点都会清除FAIL状态