Skip to content

[Bug]: requests are not truly concurrent with LLMClient._vllm_batch_completion #67

@HarshVaragiya

Description

@HarshVaragiya

Version

latest

Operating System

Linux

Python Version

3.12

What happened?

Running a local vLLM server, configuring it to be used with synthetic-data-kit via config, and configuring the generation.batch_size key in config should send batch requests to the local vLLM server.

But the vLLM server logs show that requests are sent sequentially.

Relevant log output

(APIServer pid=1) INFO 09-05 08:52:54 [loggers.py:123] Engine 000: Avg prompt throughput: 1.4 tokens/s, Avg generation throughput: 20.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 09-05 08:53:04 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%

Steps to reproduce

  1. Run a local vLLM server
  2. Set the generation.batch_size or curate.batch_size key to 16 or 32
  3. Run synthetic-data-kit generation, curation etc
  4. check vLLM server logs to see requests in flight - should be 16 or 32 - is 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions