Skip to content

resourcemanager: slot token limit can be too low when the ru request rate is not even #9605

@glorv

Description

@glorv

Bug Report

What did you do?

When the ru consumption rate of different tidb instances are not evenly distributed, even when the ru consumption rate is far beyond the group fill rate, due to the current ru distributing strategy, the tidb with the maximum ru consuming rate can result with the min ru fill rate and cause high ru wait duration.

// In order to have fewer tokens available to clients that are currently consuming more.
// We have the following formula:
// client1: (1 - a/N + 1/N) * 1/N
// client2: (1 - b/N + 1/N) * 1/N
// ...
// clientN: (1 - n/N + 1/N) * 1/N
// Sum is:
// (N - (a+b+...+n)/N +1) * 1/N => (N - 1 + 1) * 1/N => 1
ratio := (1 - slot.requireTokensSum/gtb.clientConsumptionTokensSum + evenRatio) * evenRatio
assignToken := elapseTokens * ratio
fillRate, burstLimit := gtb.calcRateAndBurstLimit(ratio)

This current distributing strategy that only based on the history consumption rate works really bad in the uneven scenario.

Here is my test:

CREATE a bench cluster with 3 tidb instances and prepare some sysbench oltp_read_write data.

CREATE RESOUCE GROUP rg1 ru_per_sec = 100000;

  • First, run sysbench oltp_read_write workload with 10 threads on tidb1. The ru consumption rate is about 15k.
  • Then, run the same workload with 1 thread on tidb2 for some time. From debug metrics, the slot token fill rate for tidb1, tidb2 are about 28k and 72k.
  • Then, run the same workload with 1 thread on tidb3 for some time.From debug metrics, the slot token fill rate for tidb1, tidb2, tidb3 are about 16k, 42k, 42k. At this point, as 16k limit is near the 15k consumption rate, some rates on tidb1 started meeting ru waiting and the latency increased.

While the total ru consumption rate is under 20k and the total group ru limit is 100k, 1 tidb instance met ru exhaustion and the tail latency was high. This is obvious unexpected result.

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

Metadata

Metadata

Assignees

Labels

affects-7.1This bug affects the 7.1.x(LTS) versions.affects-7.5This bug affects the 7.5.x(LTS) versions.affects-8.1This bug affects the 8.1.x(LTS) versions.affects-8.5This bug affects the 8.5.x(LTS) versions.affects-9.0This bug affects the 9.0.x versions.report/customerCustomers have encountered this bug.severity/majortype/bugThe issue is confirmed as a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions