resourcemanager: slot token limit can be too low when the ru request rate is not even

## Bug Report



### What did you do?
When the ru consumption rate of different tidb instances are not evenly distributed, even when the ru consumption rate is far beyond the group fill rate, due to the current ru distributing strategy, the tidb with the maximum ru consuming rate can result with the min ru fill rate and cause high ru wait duration.

https://github.com/tikv/pd/blob/aa75cbbc452906ff205cdde430907423e01e9119/pkg/mcs/resourcemanager/server/token_buckets.go#L296-L307

This current distributing strategy that only based on the history consumption rate works really bad in the uneven scenario.

Here is my test: 

CREATE a bench cluster with 3 tidb instances and prepare some sysbench oltp_read_write data.

CREATE RESOUCE GROUP rg1 ru_per_sec = 100000;

- First, run sysbench oltp_read_write workload with 10 threads on tidb1. The ru consumption rate is about 15k. 
- Then, run the same workload with 1 thread on tidb2 for some time. From debug metrics, the slot token fill rate for tidb1, tidb2 are about 28k and 72k.
- Then, run the same workload with 1 thread on tidb3 for some time.From debug metrics, the slot token fill rate for tidb1, tidb2, tidb3 are about 16k, 42k, 42k. At this point, as 16k limit is near the 15k consumption rate, some rates on tidb1 started meeting ru waiting and the latency increased.

While the total ru consumption rate is under 20k and the total group ru limit is 100k, 1 tidb instance met ru exhaustion and the tail latency was high. This is obvious unexpected result.



### What did you expect to see?

### What did you see instead?

### What version of PD are you using (`pd-server -V`)?

	// In order to have fewer tokens available to clients that are currently consuming more.
	// We have the following formula:
	// client1: (1 - a/N + 1/N) * 1/N
	// client2: (1 - b/N + 1/N) * 1/N
	// ...
	// clientN: (1 - n/N + 1/N) * 1/N
	// Sum is:
	// (N - (a+b+...+n)/N +1) * 1/N => (N - 1 + 1) * 1/N => 1
	ratio := (1 - slot.requireTokensSum/gtb.clientConsumptionTokensSum + evenRatio) * evenRatio

	assignToken := elapseTokens * ratio
	fillRate, burstLimit := gtb.calcRateAndBurstLimit(ratio)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

resourcemanager: slot token limit can be too low when the ru request rate is not even #9605

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (`pd-server -V`)?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

resourcemanager: slot token limit can be too low when the ru request rate is not even #9605

Description

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What version of PD are you using (`pd-server -V`)?