-
Notifications
You must be signed in to change notification settings - Fork 745
Description
Bug Report
What did you do?
When the ru consumption rate of different tidb instances are not evenly distributed, even when the ru consumption rate is far beyond the group fill rate, due to the current ru distributing strategy, the tidb with the maximum ru consuming rate can result with the min ru fill rate and cause high ru wait duration.
pd/pkg/mcs/resourcemanager/server/token_buckets.go
Lines 296 to 307 in aa75cbb
| // In order to have fewer tokens available to clients that are currently consuming more. | |
| // We have the following formula: | |
| // client1: (1 - a/N + 1/N) * 1/N | |
| // client2: (1 - b/N + 1/N) * 1/N | |
| // ... | |
| // clientN: (1 - n/N + 1/N) * 1/N | |
| // Sum is: | |
| // (N - (a+b+...+n)/N +1) * 1/N => (N - 1 + 1) * 1/N => 1 | |
| ratio := (1 - slot.requireTokensSum/gtb.clientConsumptionTokensSum + evenRatio) * evenRatio | |
| assignToken := elapseTokens * ratio | |
| fillRate, burstLimit := gtb.calcRateAndBurstLimit(ratio) |
This current distributing strategy that only based on the history consumption rate works really bad in the uneven scenario.
Here is my test:
CREATE a bench cluster with 3 tidb instances and prepare some sysbench oltp_read_write data.
CREATE RESOUCE GROUP rg1 ru_per_sec = 100000;
- First, run sysbench oltp_read_write workload with 10 threads on tidb1. The ru consumption rate is about 15k.
- Then, run the same workload with 1 thread on tidb2 for some time. From debug metrics, the slot token fill rate for tidb1, tidb2 are about 28k and 72k.
- Then, run the same workload with 1 thread on tidb3 for some time.From debug metrics, the slot token fill rate for tidb1, tidb2, tidb3 are about 16k, 42k, 42k. At this point, as 16k limit is near the 15k consumption rate, some rates on tidb1 started meeting ru waiting and the latency increased.
While the total ru consumption rate is under 20k and the total group ru limit is 100k, 1 tidb instance met ru exhaustion and the tail latency was high. This is obvious unexpected result.