[fix] batch size in openai compatible endpoint #835

b8zhong · 2025-09-24T01:56:59Z

Hello, thanks for this great library. We are regularly using it for the CI in SGLang. To increase the speed of CI, we want to increase the batch size. However the batch size in the openai_compatible was not working since it was still processing requests in a loop.

Prev it took around 5 minutes, now it takes around 2min 45s.

export OPENAI_API_BASE="http://0.0.0.0:30000/v1"
export OPENAI_API_KEY="dummy"

python3 -m lmms_eval \
    --model openai_compatible \
    --model_args 'model_version="google/gemma-3-4b-it"' \
    --tasks mmmu_val \
    --batch_size 256 \
    --log_samples \
    --log_samples_suffix openai_compatible \
    --output_path ./logs
/sgl-workspace/sglang/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
2025-09-23 18:39:31 | INFO     | __main__:cli_evaluate:309 - Verbosity set to INFO
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2025-09-23 18:39:33 | INFO     | __main__:cli_evaluate_single:398 - Evaluation tracker args: {'output_path': './logs'}
2025-09-23 18:39:33 | INFO     | __main__:cli_evaluate_single:470 - Selected Tasks: ['mmmu_val']
2025-09-23 18:39:33 | INFO     | lmms_eval.evaluator:simple_evaluate:161 - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2025-09-23 18:39:34 | INFO     | lmms_eval.evaluator:evaluate:402 - Running on rank 0 (local rank 0)
2025-09-23 18:39:34 | INFO     | lmms_eval.api.task:build_all_requests:427 - Building contexts for mmmu_val on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 14156.13it/s]
2025-09-23 18:39:35 | INFO     | lmms_eval.evaluator:evaluate:490 - Running generate_until requests
Model Responding: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:32<00:00, 30.04s/it]2025-09-23 18:42:07 | INFO     | lmms_eval.models.model_utils.gen_metrics:log_metrics:48 - Metric summary - Total time: 1975.575s, Total tokens: 2215, Avg speed: 1.1 tokens/s
Model Responding: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:32<00:00, 38.01s/it]
Postprocessing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 8340.40it/s]
{'Overall-Art and Design': {'num': 120, 'acc': 0.56667}, 'Art': {'num': 30, 'acc': 0.56667}, 'Art_Theory': {'num': 30, 'acc': 0.7}, 'Design': {'num': 30, 'acc': 0.63333}, 'Music': {'num': 30, 'acc': 0.36667}, 'Overall-Business': {'num': 150, 'acc': 0.35333}, 'Accounting': {'num': 30, 'acc': 0.43333}, 'Economics': {'num': 30, 'acc': 0.3}, 'Finance': {'num': 30, 'acc': 0.23333}, 'Manage': {'num': 30, 'acc': 0.33333}, 'Marketing': {'num': 30, 'acc': 0.46667}, 'Overall-Science': {'num': 150, 'acc': 0.35333}, 'Biology': {'num': 30, 'acc': 0.4}, 'Chemistry': {'num': 30, 'acc': 0.3}, 'Geography': {'num': 30, 'acc': 0.4}, 'Math': {'num': 30, 'acc': 0.36667}, 'Physics': {'num': 30, 'acc': 0.3}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.52}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.63333}, 'Clinical_Medicine': {'num': 30, 'acc': 0.4}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.43333}, 'Pharmacy': {'num': 30, 'acc': 0.56667}, 'Public_Health': {'num': 30, 'acc': 0.56667}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.725}, 'History': {'num': 30, 'acc': 0.73333}, 'Literature': {'num': 30, 'acc': 0.86667}, 'Sociology': {'num': 30, 'acc': 0.63333}, 'Psychology': {'num': 30, 'acc': 0.66667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.37143}, 'Agriculture': {'num': 30, 'acc': 0.56667}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.36667}, 'Computer_Science': {'num': 30, 'acc': 0.36667}, 'Electronics': {'num': 30, 'acc': 0.2}, 'Energy_and_Power': {'num': 30, 'acc': 0.4}, 'Materials': {'num': 30, 'acc': 0.36667}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.33333}, 'Overall': {'num': 900, 'acc': 0.46333}}
2025-09-23 18:42:07 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:188 - Saving results aggregated
2025-09-23 18:42:07 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_samples:287 - Saving samples to logs/__google__gemma-3-4b-it__/20250924_093933_samples_mmmu_val.jsonl
openai_compatible (model_version="google/gemma-3-4b-it"), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 256
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.4633|±  |   N/A|

Summary by CodeRabbit

New Features
- Concurrent batched request processing with per-batch caching.
- Configurable base URL, API key, and batch size with URL normalization.
- Metrics aggregation and persistent caching in continual mode.
Refactor
- Parallel dispatch with per-batch retries and updated progress to track batches.
- Batch-level token/latency accounting and payload adjustments for specialized models.
Bug Fixes
- Graceful handling when video utilities are missing, providing a clear install error.

coderabbitai · 2025-09-24T01:57:07Z

Walkthrough

Implements batched, concurrent request processing with per-batch retries, caching, and metrics in OpenAI-compatible chat/simple models. Adds new initialization parameters and helpers in the simple model. Updates protocol video handling with optional dependency import and a guarded error when unavailable.

Changes

Cohort / File(s)	Summary
Batched concurrent processing (chat) `lmms_eval/models/chat/openai_compatible.py`	Refactors per-request loop to batched parallel processing using ThreadPoolExecutor; per-batch retries, batch-level caching (resume), aggregated latency/token accounting, progress by batches, and persisted cache updates.
Public API + batching (simple) `lmms_eval/models/simple/openai_compatible.py`	Adds batching with parallel dispatch; new init params (`base_url`, `api_key`, `batch_size`), URL normalization via unquote, client init via provided/env vars, continual cache persistence; adds properties/methods: batch_size, tok_encode, tok_decode, eot_token_id, rank; new imports for concurrency and URL decoding.
Optional video dependency guard (protocol) `lmms_eval/protocol.py`	Safely imports fetch_video; in ChatMessages.to_openai_messages, raises explicit ImportError if fetch_video missing; otherwise processes video frames as before.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Model as OpenAICompatible.generate_until
  participant Batcher
  participant Exec as ThreadPoolExecutor
  participant API as OpenAI API
  participant Cache

  Caller->>Model: generate_until(requests)
  Model->>Batcher: group into batches
  loop per batch
    Model->>Cache: check resume cache
    alt cache hit
      Cache-->>Model: cached responses
    else cache miss
      Model->>Exec: submit per-payload tasks
      par workers
        Exec->>API: send payload
        API-->>Exec: response/usage
      end
      Exec-->>Model: collected results
      Model->>Cache: persist results (if continual)
    end
    Model->>Model: accumulate tokens/latency
  end
  Model-->>Caller: all responses + metrics

sequenceDiagram
  autonumber
  participant ChatMsg as ChatMessages
  participant Proto as protocol.py
  participant Utils as qwen_vl_utils.fetch_video

  ChatMsg->>Proto: to_openai_messages(video content)
  alt fetch_video available
    Proto->>Utils: load frames
    Utils-->>Proto: frames (PNG data URLs)
    Proto-->>ChatMsg: OpenAI-formatted messages
  else missing dependency
    Proto-->>ChatMsg: raise ImportError
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: Fix video loading logic and add protocol for loading #788 — Also modifies protocol.py video handling (imports and to_openai_messages), indicating overlapping concerns.

Suggested reviewers

Luodian

Poem

A rabbit batches hops by twos,
Queues of carrots, parallel chews.
Cache my crumbs, resume my run,
Threads all weaving, work well done.
If video’s shy, I wisely say:
“Bring the tools—or hop away!” 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "[fix] batch size in openai compatible endpoint" is concise and accurately describes the primary change—fixing batch-size handling for the openai_compatible endpoint so requests are processed in batches rather than a per-request loop, which matches the changes in the diff and reported CI speedup.
Description Check	✅ Passed	The PR description contains a detailed explanation of the problem, the fix, example usage, environment variables, and runtime logs demonstrating the performance improvement, which satisfies the repository template's requirement for a detailed description even though the checklist boxes are not explicitly checked.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

b8zhong · 2025-09-24T02:03:18Z

@kcz358 @Luodian If you could take a look, that would be great. Thank you!

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lmms_eval/models/simple/openai_compatible.py (1)
164-186: Guard video encoding when decord is unavailable

If decord isn’t installed, VideoReader is undefined and will crash with NameError. Raise a clear ImportError.
     def encode_video(self, video_path, for_get_frames_num):
+        if "VideoReader" not in globals():
+            raise ImportError("decord is required for video processing. Install with: pip install decord")
         vr = VideoReader(video_path, ctx=cpu(0))

🧹 Nitpick comments (12)

lmms_eval/protocol.py (2)
100-100: Shorten exception message to satisfy Ruff TRY003 and 88‑char limit

Keep the message concise; installation guidance belongs in docs/logs.

Apply this diff:
-                        raise ImportError("qwen_vl_utils is required for video processing. Please install it with: pip install qwen-vl-utils")
+                        raise ImportError(
+                            "qwen_vl_utils is required for video processing; install qwen-vl-utils"
+                        )
10-14: Guard decord import in lmms_eval/protocol.py and mark fetch_video as Optional

decord is imported at lmms_eval/protocol.py:7 — make it optional (try/except) or lazy-import inside the functions that use VideoReader to avoid breaking installs without video deps.

Annotate fetch_video as Optional (e.g., from typing import Optional, Callable; fetch_video: Optional[Callable[..., Any]] = None) so Pyright knows it may be None.

Repo scan shows many other files import decord at module import time — consider applying the same guard across video-related modules.
lmms_eval/models/simple/openai_compatible.py (6)
36-38: Use Optional typing for nullable params and import Optional

Type hints should not use implicit Optional. Please annotate base_url and api_key as Optional[str] and import Optional.
-from typing import List, Tuple, Union
+from typing import List, Tuple, Union, Optional
...
-        base_url: str = None,
-        api_key: str = None,
+        base_url: Optional[str] = None,
+        api_key: Optional[str] = None,
Also applies to: 7-7

121-124: Annotate property return type

Return type hints are required by the guidelines.
-    def batch_size(self):
+    def batch_size(self) -> int:
         return self.batch_size_per_gpu
131-138: Annotate property return types

Add explicit return types for properties per guidelines.
-    def eot_token_id(self):
+    def eot_token_id(self) -> int:
         return 0
...
-    def rank(self):
+    def rank(self) -> int:
         return self._rank
229-246: File type checks: prefer robust suffix-based detection

Use Path(visual).suffix.lower() over substring checks to avoid false matches and handle uppercase extensions.
+from pathlib import Path
...
-                        if isinstance(visual, str) and (".mp4" in visual or ".avi" in visual or ".mov" in visual or ".flv" in visual or ".wmv" in visual):
+                        if isinstance(visual, str) and Path(visual).suffix.lower() in {".mp4", ".avi", ".mov", ".flv", ".wmv"}:
@@
-                        elif isinstance(visual, str) and (".jpg" in visual or ".jpeg" in visual or ".png" in visual or ".gif" in visual or ".bmp" in visual or ".tiff" in visual or ".webp" in visual):
+                        elif isinstance(visual, str) and Path(visual).suffix.lower() in {".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp"}:
289-299: Avoid blind except Exception; consider narrower exceptions or log type

Catching all exceptions hides actionable errors. At minimum, log exception type; preferably catch OpenAI-specific exceptions.
-                    except Exception as e:
-                        error_msg = str(e)
+                    except Exception as e:
+                        error_msg = f"{e.__class__.__name__}: {e}"
50-55: Docstrings for public API

Add class/method docstrings for this public model and generate_until per guidelines.
lmms_eval/models/chat/openai_compatible.py (4)
47-48: Remove unused variable ctx

ctx is unpacked but unused. Prefix with _ or drop it.
-                ctx, doc_to_messages, gen_kwargs, doc_id, task, split = req.args
+                _, doc_to_messages, gen_kwargs, doc_id, task, split = req.args
109-118: Avoid blind except Exception; at least log exception type

Prefer catching OpenAI-specific exceptions or log the exception class name to aid triage.
-                    except Exception as e:
-                        error_msg = str(e)
+                    except Exception as e:
+                        error_msg = f"{e.__class__.__name__}: {e}"
34-37: Prefer using self.batch_size accessor

Use the accessor added in the simple model for consistency.
-        batch_size = getattr(self, "batch_size_per_gpu", 1)
+        batch_size = getattr(self, "batch_size", getattr(self, "batch_size_per_gpu", 1))
28-33: Docstring for public API method

Add a brief docstring for generate_until explaining batched behavior and return order guarantees.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 036637e and 0bb1001.

📒 Files selected for processing (3)

lmms_eval/models/chat/openai_compatible.py (2 hunks)
lmms_eval/models/simple/openai_compatible.py (7 hunks)
lmms_eval/protocol.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/protocol.py
lmms_eval/models/chat/openai_compatible.py
lmms_eval/models/simple/openai_compatible.py

🧬 Code graph analysis (2)

lmms_eval/models/chat/openai_compatible.py (5)

lmms_eval/api/registry.py (1)

register_model (11-24)

lmms_eval/models/simple/openai_compatible.py (3)

batch_size (122-123)

rank (136-137)

process_single_request (279-299)

lmms_eval/api/task.py (1)

doc_to_messages (1648-1680)

lmms_eval/api/instance.py (1)

args (25-29)

lmms_eval/protocol.py (2)

ChatMessages (45-122)

to_openai_messages (89-109)

lmms_eval/models/simple/openai_compatible.py (3)

lmms_eval/utils.py (3)

Collator (865-1020)

chunks (136-171)

get_batched (891-913)

lmms_eval/api/instance.py (1)

args (25-29)

lmms_eval/models/chat/openai_compatible.py (1)

process_single_request (88-119)

🪛 Ruff (0.13.1)

lmms_eval/protocol.py

100-100: Avoid specifying long messages outside the exception class

(TRY003)

lmms_eval/models/chat/openai_compatible.py

47-47: Unpacked variable ctx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

89-89: Function definition does not bind loop variable batch_responses

(B023)

90-90: Function definition does not bind loop variable batch_responses

(B023)

107-107: Consider moving this statement to an else block

(TRY300)

109-109: Do not catch blind exception: Exception

(BLE001)

lmms_eval/models/simple/openai_compatible.py

36-36: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

37-37: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

128-128: Unused method argument: tokens

(ARG002)

280-280: Function definition does not bind loop variable batch_responses

(B023)

281-281: Function definition does not bind loop variable batch_responses

(B023)

287-287: Consider moving this statement to an else block

(TRY300)

289-289: Do not catch blind exception: Exception

(BLE001)

coderabbitai · 2025-09-24T02:31:46Z

lmms_eval/models/chat/openai_compatible.py

+        for batch_requests in batched_requests:
+            batch_payloads = []
+            batch_doc_uuids = []
+            batch_responses = []
+
+            for req in batch_requests:
+                ctx, doc_to_messages, gen_kwargs, doc_id, task, split = req.args
                doc_uuid = f"{task}___{split}___{doc_id}"
-                self.response_cache[doc_uuid] = response_text
+                batch_doc_uuids.append(doc_uuid)
+
+                if self.continual_mode is True and self.cache_mode == "resume":
+                    if doc_uuid in self.response_cache:
+                        response_text = self.response_cache[doc_uuid]
+                        if response_text:
+                            batch_responses.append(response_text)
+                            continue
+
+                chat_messages_raw = doc_to_messages(self.task_dict[task][split][doc_id])
+                chat_messages: ChatMessages = ChatMessages(**{"messages": chat_messages_raw})
+
+                payload = {"messages": chat_messages.to_openai_messages()}
+                payload["model"] = self.model_version
+
+                if "max_new_tokens" not in gen_kwargs:
+                    gen_kwargs["max_new_tokens"] = 1024
+                if gen_kwargs["max_new_tokens"] > 4096:
+                    gen_kwargs["max_new_tokens"] = 4096
+                if "temperature" not in gen_kwargs:
+                    gen_kwargs["temperature"] = 0
+                if "top_p" not in gen_kwargs:
+                    gen_kwargs["top_p"] = None
+                if "num_beams" not in gen_kwargs:
+                    gen_kwargs["num_beams"] = 1
+
+                payload["max_tokens"] = gen_kwargs["max_new_tokens"]
+                payload["temperature"] = gen_kwargs["temperature"]
+
+                if "o1" in self.model_version or "o3" in self.model_version or "o4" in self.model_version:
+                    del payload["temperature"]
+                    payload.pop("max_tokens")
+                    payload["reasoning_effort"] = "medium"
+                    payload["response_format"] = {"type": "text"}
+                    payload["max_completion_tokens"] = gen_kwargs["max_new_tokens"]
+
+                batch_payloads.append(payload)
+                batch_responses.append(None)
+


⚠️ Potential issue

Indexing mismatch corrupts response alignment within batch

Same misalignment as the simple variant: cached responses are appended while payloads skip them, then indices from enumerate(batch_payloads) are used to update batch_responses. This leads to wrong pairings.

- batch_payloads = [] - batch_doc_uuids = [] - batch_responses = [] + batch_len = len(batch_requests) + batch_payloads = [None] * batch_len + batch_doc_uuids = [None] * batch_len + batch_responses = [None] * batch_len @@ - for req in batch_requests: - ctx, doc_to_messages, gen_kwargs, doc_id, task, split = req.args + for i, req in enumerate(batch_requests): + _, doc_to_messages, gen_kwargs, doc_id, task, split = req.args doc_uuid = f"{task}___{split}___{doc_id}" - batch_doc_uuids.append(doc_uuid) + batch_doc_uuids[i] = doc_uuid @@ - if doc_uuid in self.response_cache: - response_text = self.response_cache[doc_uuid] - if response_text: - batch_responses.append(response_text) - continue + if doc_uuid in self.response_cache: + response_text = self.response_cache[doc_uuid] + if response_text: + batch_responses[i] = response_text + continue @@ - batch_payloads.append(payload) - batch_responses.append(None) + batch_payloads[i] = payload + # keep batch_responses[i] as None to be filled @@ - tasks_to_run = [(payload, i) for i, payload in enumerate(batch_payloads) if batch_responses[i] is None] + tasks_to_run = [ + (payload, i) + for i, payload in enumerate(batch_payloads) + if payload is not None and batch_responses[i] is None + ]

Also applies to: 121-133

🧰 Tools

🪛 Ruff (0.13.1)

47-47: Unpacked variable ctx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

In lmms_eval/models/chat/openai_compatible.py around lines 41-87 (and similarly at 121-133), cached responses are appended to batch_responses while corresponding payloads are skipped, breaking alignment when later iterating over enumerate(batch_payloads). Fix by keeping arrays aligned: when you short-circuit with a cached response, append a placeholder (e.g., None) to batch_payloads at the same index (or otherwise ensure one-to-one positions for batch_payloads and batch_responses), and when sending requests skip None payloads or filter/zip payloads with their response indices so responses are written back into the correct positions. Ensure the rest of the code that sends requests and writes back uses these preserved indices to update batch_responses and the cache correctly.

coderabbitai · 2025-09-24T02:31:46Z

lmms_eval/models/chat/openai_compatible.py

+                        response = self.client.chat.completions.create(**payload)
+                        end_time = time.time()
+
+                        response_text = response.choices[0].message.content
+                        latency = end_time - start_time


🛠️ Refactor suggestion

Add per-call timeout to OpenAI request

Prevent indefinite hangs by passing timeout=self.timeout to the API call.

- response = self.client.chat.completions.create(**payload) + response = self.client.chat.completions.create(**payload, timeout=self.timeout)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

response = self.client.chat.completions.create(**payload)

end_time = time.time()

response_text = response.choices[0].message.content

latency = end_time - start_time

response = self.client.chat.completions.create(**payload, timeout=self.timeout)

end_time = time.time()

response_text = response.choices[0].message.content

latency = end_time - start_time

🤖 Prompt for AI Agents

In lmms_eval/models/chat/openai_compatible.py around lines 95 to 99, the OpenAI API call can hang because it doesn't pass a per-call timeout; update the chat completion invocation to include timeout=self.timeout (i.e., call self.client.chat.completions.create(..., timeout=self.timeout)), ensuring the instance has a numeric self.timeout attribute initialized earlier; keep the rest of the response handling unchanged.

coderabbitai · 2025-09-24T02:31:46Z

lmms_eval/models/simple/openai_compatible.py

        self.client = (
-            OpenAI(api_key=os.getenv("OPENAI_API_KEY"), base_url=os.getenv("OPENAI_API_BASE"), http_client=http_client)
+            OpenAI(api_key=api_key, base_url=base_url, http_client=http_client)
            if not azure_openai
            else AzureOpenAI(api_key=os.getenv("AZURE_OPENAI_API_KEY"), azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"), api_version=os.getenv("AZURE_OPENAI_API_VERSION"), http_client=http_client)
        )


🛠️ Refactor suggestion

Add client-level request timeout (avoid indefinite hangs)

External calls lack an explicit timeout. Pass timeout=self.timeout to the OpenAI/AzureOpenAI clients (or per-call) to prevent indefinite blocking.

- OpenAI(api_key=api_key, base_url=base_url, http_client=http_client) + OpenAI(api_key=api_key, base_url=base_url, http_client=http_client, timeout=self.timeout) if not azure_openai else AzureOpenAI( - api_key=os.getenv("AZURE_OPENAI_API_KEY"), - azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"), - api_version=os.getenv("AZURE_OPENAI_API_VERSION"), - http_client=http_client + api_key=os.getenv("AZURE_OPENAI_API_KEY"), + azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"), + api_version=os.getenv("AZURE_OPENAI_API_VERSION"), + http_client=http_client, + timeout=self.timeout, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.client = (

OpenAI(api_key=os.getenv("OPENAI_API_KEY"), base_url=os.getenv("OPENAI_API_BASE"), http_client=http_client)

OpenAI(api_key=api_key, base_url=base_url, http_client=http_client)

if not azure_openai

else AzureOpenAI(api_key=os.getenv("AZURE_OPENAI_API_KEY"), azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"), api_version=os.getenv("AZURE_OPENAI_API_VERSION"), http_client=http_client)

)

self.client = (

OpenAI(api_key=api_key, base_url=base_url, http_client=http_client, timeout=self.timeout)

if not azure_openai

else AzureOpenAI(

api_key=os.getenv("AZURE_OPENAI_API_KEY"),

azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"),

api_version=os.getenv("AZURE_OPENAI_API_VERSION"),

http_client=http_client,

timeout=self.timeout,

)

)

🤖 Prompt for AI Agents

In lmms_eval/models/simple/openai_compatible.py around lines 98 to 102, the OpenAI/AzureOpenAI client is created without a request timeout which can cause indefinite hangs; update the client instantiation to pass timeout=self.timeout (or pass timeout into individual call methods if the client does not accept a constructor timeout), i.e., include timeout=self.timeout when constructing OpenAI and AzureOpenAI (or wire self.timeout into the per-request call options) so all external requests honor the configured timeout.

coderabbitai · 2025-09-24T02:31:46Z

lmms_eval/models/simple/openai_compatible.py

+    def tok_encode(self, string: str):
+        return list(string.encode("utf-8"))
+
+    def tok_decode(self, tokens):
+        return ""
+


⚠️ Potential issue

Fix tok_decode to actually decode; remove unused-arg warning

tok_decode ignores its input and always returns empty string; tokens is flagged unused. Implement the inverse of tok_encode with proper types.

- def tok_encode(self, string: str): - return list(string.encode("utf-8")) + def tok_encode(self, string: str) -> List[int]: + return list(string.encode("utf-8")) - def tok_decode(self, tokens): - return "" + def tok_decode(self, tokens: List[int]) -> str: + return bytes(tokens).decode("utf-8", errors="ignore")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def tok_encode(self, string: str):

return list(string.encode("utf-8"))

def tok_decode(self, tokens):

return ""

def tok_encode(self, string: str) -> List[int]:

return list(string.encode("utf-8"))

def tok_decode(self, tokens: List[int]) -> str:

return bytes(tokens).decode("utf-8", errors="ignore")

🧰 Tools

🪛 Ruff (0.13.1)

128-128: Unused method argument: tokens

(ARG002)

🤖 Prompt for AI Agents

In lmms_eval/models/simple/openai_compatible.py around lines 125 to 130, tok_decode currently ignores its input and returns an empty string; implement it as the inverse of tok_encode by accepting tokens (list/iterable of ints or bytes/bytearray), converting them to a bytes object and decoding with utf-8 (use errors="replace" or "strict" per project policy) and return the resulting str so the tokens argument is used and properly typed.

coderabbitai · 2025-09-24T02:31:46Z

lmms_eval/models/simple/openai_compatible.py

+        def _collate(x):
+            toks = self.tok_encode(x[0])
+            return -len(toks), x[0]
+
+        from lmms_eval import utils
+
+        re_ords = utils.Collator([reg.args for reg in requests], _collate, grouping=True)
+        chunks = re_ords.get_batched(n=self.batch_size, batch_fn=None)
+        num_iters = len(requests) // self.batch_size if len(requests) % self.batch_size == 0 else len(requests) // self.batch_size + 1
+        pbar = tqdm(total=num_iters, disable=(self.rank != 0), desc="Model Responding")


⚠️ Potential issue

Ordering bug: results not restored to original order

You reorder with Collator but return res in reordered batch order. Restore original order before returning.

- pbar.close() - return res + pbar.close() + return re_ords.get_original(res)

Also applies to: 319-324

🤖 Prompt for AI Agents

In lmms_eval/models/simple/openai_compatible.py around lines 197-206 (and similarly at 319-324), the code uses utils.Collator to reorder requests for batching but returns results in the collated (reordered) sequence; restore the original request order before returning. Capture the original indices from Collator (or keep a mapping of request -> original position) when creating re_ords/get_batched, then after processing batches collect outputs into a result array sized to len(requests) and place each batch's responses at their original indices; finally return that array in original order (do the same for both code locations).

coderabbitai · 2025-09-24T02:31:47Z

lmms_eval/models/simple/openai_compatible.py

+            batch_payloads = []
+            batch_doc_uuids = []
+            batch_responses = []
+
+            for i, (context, doc_id_single) in enumerate(zip(contexts, doc_id)):
+                doc_uuid = f"{task}___{split}___{doc_id_single}"
+                batch_doc_uuids.append(doc_uuid)
+


⚠️ Potential issue

Indexing mismatch corrupts mapping between payloads and responses

batch_responses appends cached entries while batch_payloads skips them, then enumerate(batch_payloads) is used as index into batch_responses. This misalignment will overwrite cached entries and mix up responses.

Apply indexed preallocation and keep indices aligned:

- batch_payloads = [] - batch_doc_uuids = [] - batch_responses = [] + batch_len = len(contexts) + batch_payloads: List[Optional[dict]] = [None] * batch_len + batch_doc_uuids: List[Optional[str]] = [None] * batch_len + batch_responses: List[Optional[str]] = [None] * batch_len - for i, (context, doc_id_single) in enumerate(zip(contexts, doc_id)): + for i, (context, doc_id_single) in enumerate(zip(contexts, doc_id)): doc_uuid = f"{task}___{split}___{doc_id_single}" - batch_doc_uuids.append(doc_uuid) + batch_doc_uuids[i] = doc_uuid if self.continual_mode is True and self.cache_mode == "resume": if doc_uuid in self.response_cache: response_text = self.response_cache[doc_uuid] if response_text: - batch_responses.append(response_text) - continue + batch_responses[i] = response_text + continue @@ - batch_payloads.append(payload) - batch_responses.append(None) + batch_payloads[i] = payload + # leave batch_responses[i] as None to be filled later @@ - tasks_to_run = [(payload, i) for i, payload in enumerate(batch_payloads) if batch_responses[i] is None] + tasks_to_run = [ + (payload, i) + for i, payload in enumerate(batch_payloads) + if payload is not None and batch_responses[i] is None + ]

Also applies to: 222-228, 276-300, 301-312

🤖 Prompt for AI Agents

In lmms_eval/models/simple/openai_compatible.py around lines 214-221 (and also apply same fix to 222-228, 276-300, 301-312), the loop builds batch_payloads and batch_responses out of sync because cached responses are appended to batch_responses while payloads skip cached entries, then enumerate(batch_payloads) is used to index into batch_responses causing misalignment; fix by preallocating placeholders (e.g., initialize batch_payloads, batch_doc_uuids, batch_responses to the same length with None) or by maintaining an explicit index/position mapping so that for every input position you either set payloads[pos]=payload and responses[pos]=None (to be filled later) or set responses[pos]=cached_value and payloads[pos]=None, ensuring you never change relative ordering — then when creating request batches skip None payloads but preserve original indices for mapping responses back, and when filling responses place them into batch_responses using the original positions so cached and fresh responses remain aligned.

coderabbitai · 2025-09-24T02:31:47Z

lmms_eval/protocol.py

+                    if fetch_video is None:
+                        raise ImportError("qwen_vl_utils is required for video processing. Please install it with: pip install qwen-vl-utils")
                    video_input = fetch_video({"type": "video", "video": content.url, **video_kwargs})


⚠️ Potential issue

Fix None-unpacking bug in video kwargs

If video_kwargs is None (the default), **video_kwargs raises at runtime. Use a safe default.

Apply this diff:

- video_input = fetch_video({"type": "video", "video": content.url, **video_kwargs}) + video_input = fetch_video({ + "type": "video", + "video": content.url, + **(video_kwargs or {}), + })

Additionally, consider updating the signature to reflect Optional and non‑string values (recommended for Pyright):

Change to: def to_openai_messages(self, video_kwargs: Optional[Dict[str, Any]] = None):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if fetch_video is None:

raise ImportError("qwen_vl_utils is required for video processing. Please install it with: pip install qwen-vl-utils")

video_input = fetch_video({"type": "video", "video": content.url, **video_kwargs})

if fetch_video is None:

raise ImportError("qwen_vl_utils is required for video processing. Please install it with: pip install qwen-vl-utils")

video_input = fetch_video({

"type": "video",

"video": content.url,

**(video_kwargs or {}),

})

🧰 Tools

🪛 Ruff (0.13.1)

100-100: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In lmms_eval/protocol.py around lines 99 to 101, the code unpacks video_kwargs without guarding against None which raises a TypeError at runtime; change the function signature to use typing Optional and Any (e.g., def to_openai_messages(self, video_kwargs: Optional[Dict[str, Any]] = None):) and inside the function replace **video_kwargs with **(video_kwargs or {}) so a safe empty dict is used when None is passed; also add the necessary typing imports if not present.

b8zhong added 14 commits September 22, 2025 20:58

more

c5f7d1d

more

1f380d5

more

0e58637

more

5ee69f2

more

db8e198

more

f796133

more

3afee96

more

86490ff

more

1186ce5

more

2b9177c

more

1d90001

more

ab9fd5d

more

b7a6d3c

more

0bb1001

coderabbitai bot reviewed Sep 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix] batch size in openai compatible endpoint #835

[fix] batch size in openai compatible endpoint #835

b8zhong commented Sep 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 24, 2025 •

edited

Loading

Uh oh!

b8zhong commented Sep 24, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 24, 2025

Uh oh!

coderabbitai bot Sep 24, 2025

Uh oh!

coderabbitai bot Sep 24, 2025

Uh oh!

coderabbitai bot Sep 24, 2025

Uh oh!

coderabbitai bot Sep 24, 2025

Uh oh!

coderabbitai bot Sep 24, 2025

Uh oh!

coderabbitai bot Sep 24, 2025

Uh oh!

Uh oh!

[fix] batch size in openai compatible endpoint #835

Are you sure you want to change the base?

[fix] batch size in openai compatible endpoint #835

Conversation

b8zhong commented Sep 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

b8zhong commented Sep 24, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

b8zhong commented Sep 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 24, 2025 •

edited

Loading