Strategies for Server/RPC and mixed performance machines #7468

SoftwareRenderer · 2024-05-22T15:00:07Z

SoftwareRenderer
May 22, 2024

I have a similar use-case as the one described in #6829 and wanted to share what I've done to approach this: https://github.com/SoftwareRenderer/llmwrangler . The goal for this is to avoid bottlenecks when including CPU instances along GPU.

I'm hoping that these ideas are useful (if not already implemented), and they can be integrated into the RPC backend. I need to brush up on my C, so it'll be a while until I can do it myself (and not be ashamed to submit a PR for the code).

In llmwrangler there's a couple features:

Cache the response time for each node, and assign the least busy node to incoming requests.
Track and estimate total load per host: If the CPU responds in 3,000 ms, don't give it any work until the GPU's work queue exceeds that time.
Warm up CPU instances (KV cache?). This significantly improves perceived TTFB, since in practice this is a difference of 2 minutes cold start and 3s warmed up.

@rgerganov Tagging you since it looks like you're doing most of the heavy lifting on RPC

rgerganov · 2024-05-23T07:36:13Z

rgerganov
May 23, 2024
Collaborator

Thanks for sharing this. Another work in this direction is the paddler project mentioned in #7369.

The RPC backend is a simple proxy for existing ggml backends. All of the scheduling work is done by the backend scheduler.

I think that combining the backend scheduler with high-level orchestration like yours can unlock the full potential of distributed LLM inference.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strategies for Server/RPC and mixed performance machines #7468

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strategies for Server/RPC and mixed performance machines #7468

Uh oh!

Uh oh!

SoftwareRenderer May 22, 2024

Replies: 1 comment

Uh oh!

rgerganov May 23, 2024 Collaborator

SoftwareRenderer
May 22, 2024

rgerganov
May 23, 2024
Collaborator