doc: 阅读redis t-digest文档

2025-09-23 14:27:27 +08:00
parent 1a01da4c34
commit 3ed69c87a0
1 changed files with 114 additions and 1 deletions
--- a/中间件/redis/redis.md
+++ b/中间件/redis/redis.md
@@ -146,6 +146,15 @@
        - [Example](#example-2)
        - [Cuckoo vs Bloom Filter](#cuckoo-vs-bloom-filter)
        - [Sizing Cuckoo filters](#sizing-cuckoo-filters)
+      - [t-digest](#t-digest)
+        - [Use Cases](#use-cases)
+        - [Examples](#examples-1)
+        - [Estimating fractions or ranks by values](#estimating-fractions-or-ranks-by-values)
+        - [Estimating values by fractions or ranks](#estimating-values-by-fractions-or-ranks)
+        - [trimmed mean](#trimmed-mean)
+        - [TDIGEST.MERGE](#tdigestmerge)
+        - [Retrieving sketch information](#retrieving-sketch-information)
+        - [Resetting a sketch](#resetting-a-sketch)


 # redis
@@ -2490,4 +2499,108 @@ CF.RESERVE {key} {capacity} [BUCKETSIZE bucketSize] [MAXITERATIONS maxIterations
  - 该默认值为1
 - `MAXITERATIONS`:
  - `MAXITERATIONS`代表`the number of attempts to find a slot for incoming fingerprint`。一旦filter为full后，如果`MAXITERATIONS`越大，插入越慢。
-  - 该默认值为20
+  - 该默认值为20
+
+#### t-digest
+t-digest是一种`probabilistic data structure`，其允许在不对`set中所有数据`进行实际存储与排序的情况下获取数据的`percentile point`。故而，其可以针对如下场景：`What's the average latency for 99% of my database operations`
+- 在不使用`t-digest`时，如果要获取上述指标，需要对每位用户都存储平均延迟，并且对平均延迟进行排序，排除最后的百分之一数据，并计算剩余数据的平均值。该操作过程十分耗时
+- 而通过t-digest可以解决该方面的问题
+
+t-digest可以用于其他百分位相关的问题，例如`trimmed means`:
+- `A trimmed mean is the mean value from the sketch, excluding observation values outside the low and high cutoff percentiles.`例如，`0.1 trimmed means`代表排除最低的10%和最高的10%之后计算出的平均值
+
+##### Use Cases
+- `Hardware/Software monitoring`：
+  - 当测量online server response latency，可以需要观测如下指标
+    - What are the 50th, 90th, and 99th percentiles of the measured latencies
+    - Which fraction of the measured latencies are less than 25 milliseconds
+    - What is the mean latency, ignoring outliers? or What is the mean latency between the 10th and the 90th percentile?
+- `Online Gaming`: 
+  - 当online gaming platform涉及数百万用户时，可能需要观测如下指标：
+    - Your score is better than x percent of the game sessions played.
+    - There were about y game sessions where people scored larger than you.
+    - To have a better score than 90% of the games played, your score should be z
+- `Network traffic`:
+  - 在对网络传输中的ip packets进行监测时，如需探测ddos攻击，可能需要观测如下指标：
+    - 过去1s的packets数量是否超过了过去所有packets数量的99%
+    - 在正常网络环境下，期望看到多少packets?
+##### Examples
+在如下示例中，将会创建一个`compression为100`的t-digest并且向其中添加item。在t-digest数据结构中，`compression`参数用于在内存消耗和精确度之间做权衡。`compression的默认值为100`，当compression值指定的更大时，t-digest的精确度会更高。
+
+```redis-cli
+> TDIGEST.CREATE bikes:sales COMPRESSION 100
+OK
+> TDIGEST.ADD bikes:sales 21
+OK
+> TDIGEST.ADD bikes:sales 150 95 75 34
+OK
+```
+
+##### Estimating fractions or ranks by values 
+t-digest中一个有用的特性为`CDF`(definition of rank)，其能给出小于或等于给定值的fraction。该特性能够解决`What's the percentage of observations with a value lower or equal to X`。
+> 更精确的说，`TDIGEST.CDF will return the estimated fraction of observations in the sketch that are smaller than X plus half the number of observations that are equal to X. `
+
+> 也可以使用`TDIGEST.RANK`命令，相比于返回fraction，其会返回value的rank。
+
+```redis-cli
+> TDIGEST.CREATE racer_ages
+OK
+> TDIGEST.ADD racer_ages 45.88 44.2 58.03 19.76 39.84 69.28 50.97 25.41 19.27 85.71 42.63
+OK
+> TDIGEST.CDF racer_ages 50
+1) "0.63636363636363635"
+> TDIGEST.RANK racer_ages 50
+1) (integer) 7
+> TDIGEST.RANK racer_ages 50 40
+1) (integer) 7
+2) (integer) 4
+```
+同样的，TDIGEST也支持`TDIGEST.REVRANK`命令，其返回的结果是`the number of (observations larger than a given value + half the observations equal to the given value)`。
+
+##### Estimating values by fractions or ranks
+`TDIGEST.QUANTILE key fraction ...`命令可以根据fraction来获取`an estimation of the value (floating point) that is smaller than the given fraction of observations.`。
+
+`TDIGEST.BYRANK key rank ...`命令可以根据rank来获取`an estimation of the value (floating point) with that rank`。
+
+使用示例如下所示：
+```redis-cli
+> TDIGEST.QUANTILE racer_ages .5
+1) "44.200000000000003"
+> TDIGEST.BYRANK racer_ages 4
+1) "42.630000000000003"
+```
+`TDIGEST.BYREVRANK`命令可以根据`reverse rank`来获取value。
+
+##### trimmed mean
+如果要计算`TDIGEST`结构的`trimmed mean`，可以使用`TDIGEST.TRIMMED_MEAN {key} low_fraction high_fraction`来获取。
+
+##### TDIGEST.MERGE
+可以通过`TDIGEST.MERGE`命令来对sketches进行merge操作。
+
+假设为3台servers测量了latency，此时需要合并多台servers的测量结果，并且获取合并后结果中90%、95%、99%的latency，此时可以使用`TDIGEST.MERGE`命令。
+
+TDIGEST.MERGE命令的格式如下：
+```redis-cli
+TDIGEST.MERGE destKey numKeys sourceKey... [COMPRESSION compression] [OVERRIDE]
+```
+
+在使用上述命令时：
+- `如果destKey之前不存在`：将会自动创建destKey并且将合并结果设置到key的值
+- `如果destKey之前已经存在`： 那么destKey的old value将会和`values of source keys`一起合并。如果需要覆盖原`destkey`的内容，需要指定`OVERRIDE`选项
+
+##### Retrieving sketch information
+`TDIGEST.MIN`和`TDIGEST.MAX`命令可以用于获取sketch中的最小值和最大值，使用示例如下：
+```redis-cli
+> TDIGEST.MIN racer_ages
+"19.27"
+> TDIGEST.MAX racer_ages
+"85.709999999999994"
+```
+`如果TDIGEST为空，那么TDIGEST.MIN和TDIGEST.MAX命令都会返回nan`。
+
+##### Resetting a sketch
+通过`TDIGEST.RESET`命令能够对sketch进行重置，示例如下：
+```redis-cli
+> TDIGEST.RESET racer_ages
+OK
+```