Skip to content

Conversation

b8zhong
Copy link

@b8zhong b8zhong commented Sep 24, 2025

Hello, thanks for this great library. We are regularly using it for the CI in SGLang. To increase the speed of CI, we want to increase the batch size. However the batch size in the openai_compatible was not working since it was still processing requests in a loop.

Prev it took around 5 minutes, now it takes around 2min 45s.

export OPENAI_API_BASE="http://0.0.0.0:30000/v1"
export OPENAI_API_KEY="dummy"

python3 -m lmms_eval \
    --model openai_compatible \
    --model_args 'model_version="google/gemma-3-4b-it"' \
    --tasks mmmu_val \
    --batch_size 256 \
    --log_samples \
    --log_samples_suffix openai_compatible \
    --output_path ./logs
/sgl-workspace/sglang/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
2025-09-23 18:39:31 | INFO     | __main__:cli_evaluate:309 - Verbosity set to INFO
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2025-09-23 18:39:33 | INFO     | __main__:cli_evaluate_single:398 - Evaluation tracker args: {'output_path': './logs'}
2025-09-23 18:39:33 | INFO     | __main__:cli_evaluate_single:470 - Selected Tasks: ['mmmu_val']
2025-09-23 18:39:33 | INFO     | lmms_eval.evaluator:simple_evaluate:161 - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2025-09-23 18:39:34 | INFO     | lmms_eval.evaluator:evaluate:402 - Running on rank 0 (local rank 0)
2025-09-23 18:39:34 | INFO     | lmms_eval.api.task:build_all_requests:427 - Building contexts for mmmu_val on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 14156.13it/s]
2025-09-23 18:39:35 | INFO     | lmms_eval.evaluator:evaluate:490 - Running generate_until requests
Model Responding: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:32<00:00, 30.04s/it]2025-09-23 18:42:07 | INFO     | lmms_eval.models.model_utils.gen_metrics:log_metrics:48 - Metric summary - Total time: 1975.575s, Total tokens: 2215, Avg speed: 1.1 tokens/s
Model Responding: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:32<00:00, 38.01s/it]
Postprocessing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 8340.40it/s]
{'Overall-Art and Design': {'num': 120, 'acc': 0.56667}, 'Art': {'num': 30, 'acc': 0.56667}, 'Art_Theory': {'num': 30, 'acc': 0.7}, 'Design': {'num': 30, 'acc': 0.63333}, 'Music': {'num': 30, 'acc': 0.36667}, 'Overall-Business': {'num': 150, 'acc': 0.35333}, 'Accounting': {'num': 30, 'acc': 0.43333}, 'Economics': {'num': 30, 'acc': 0.3}, 'Finance': {'num': 30, 'acc': 0.23333}, 'Manage': {'num': 30, 'acc': 0.33333}, 'Marketing': {'num': 30, 'acc': 0.46667}, 'Overall-Science': {'num': 150, 'acc': 0.35333}, 'Biology': {'num': 30, 'acc': 0.4}, 'Chemistry': {'num': 30, 'acc': 0.3}, 'Geography': {'num': 30, 'acc': 0.4}, 'Math': {'num': 30, 'acc': 0.36667}, 'Physics': {'num': 30, 'acc': 0.3}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.52}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.63333}, 'Clinical_Medicine': {'num': 30, 'acc': 0.4}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.43333}, 'Pharmacy': {'num': 30, 'acc': 0.56667}, 'Public_Health': {'num': 30, 'acc': 0.56667}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.725}, 'History': {'num': 30, 'acc': 0.73333}, 'Literature': {'num': 30, 'acc': 0.86667}, 'Sociology': {'num': 30, 'acc': 0.63333}, 'Psychology': {'num': 30, 'acc': 0.66667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.37143}, 'Agriculture': {'num': 30, 'acc': 0.56667}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.36667}, 'Computer_Science': {'num': 30, 'acc': 0.36667}, 'Electronics': {'num': 30, 'acc': 0.2}, 'Energy_and_Power': {'num': 30, 'acc': 0.4}, 'Materials': {'num': 30, 'acc': 0.36667}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.33333}, 'Overall': {'num': 900, 'acc': 0.46333}}
2025-09-23 18:42:07 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:188 - Saving results aggregated
2025-09-23 18:42:07 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_samples:287 - Saving samples to logs/__google__gemma-3-4b-it__/20250924_093933_samples_mmmu_val.jsonl
openai_compatible (model_version="google/gemma-3-4b-it"), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 256
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.4633|±  |   N/A|

Summary by CodeRabbit

  • New Features

    • Concurrent batched request processing with per-batch caching.
    • Configurable base URL, API key, and batch size with URL normalization.
    • Metrics aggregation and persistent caching in continual mode.
  • Refactor

    • Parallel dispatch with per-batch retries and updated progress to track batches.
    • Batch-level token/latency accounting and payload adjustments for specialized models.
  • Bug Fixes

    • Graceful handling when video utilities are missing, providing a clear install error.

Copy link
Contributor

coderabbitai bot commented Sep 24, 2025

Walkthrough

Implements batched, concurrent request processing with per-batch retries, caching, and metrics in OpenAI-compatible chat/simple models. Adds new initialization parameters and helpers in the simple model. Updates protocol video handling with optional dependency import and a guarded error when unavailable.

Changes

Cohort / File(s) Summary
Batched concurrent processing (chat)
lmms_eval/models/chat/openai_compatible.py
Refactors per-request loop to batched parallel processing using ThreadPoolExecutor; per-batch retries, batch-level caching (resume), aggregated latency/token accounting, progress by batches, and persisted cache updates.
Public API + batching (simple)
lmms_eval/models/simple/openai_compatible.py
Adds batching with parallel dispatch; new init params (base_url, api_key, batch_size), URL normalization via unquote, client init via provided/env vars, continual cache persistence; adds properties/methods: batch_size, tok_encode, tok_decode, eot_token_id, rank; new imports for concurrency and URL decoding.
Optional video dependency guard (protocol)
lmms_eval/protocol.py
Safely imports fetch_video; in ChatMessages.to_openai_messages, raises explicit ImportError if fetch_video missing; otherwise processes video frames as before.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Model as OpenAICompatible.generate_until
  participant Batcher
  participant Exec as ThreadPoolExecutor
  participant API as OpenAI API
  participant Cache

  Caller->>Model: generate_until(requests)
  Model->>Batcher: group into batches
  loop per batch
    Model->>Cache: check resume cache
    alt cache hit
      Cache-->>Model: cached responses
    else cache miss
      Model->>Exec: submit per-payload tasks
      par workers
        Exec->>API: send payload
        API-->>Exec: response/usage
      end
      Exec-->>Model: collected results
      Model->>Cache: persist results (if continual)
    end
    Model->>Model: accumulate tokens/latency
  end
  Model-->>Caller: all responses + metrics
Loading
sequenceDiagram
  autonumber
  participant ChatMsg as ChatMessages
  participant Proto as protocol.py
  participant Utils as qwen_vl_utils.fetch_video

  ChatMsg->>Proto: to_openai_messages(video content)
  alt fetch_video available
    Proto->>Utils: load frames
    Utils-->>Proto: frames (PNG data URLs)
    Proto-->>ChatMsg: OpenAI-formatted messages
  else missing dependency
    Proto-->>ChatMsg: raise ImportError
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • Luodian

Poem

A rabbit batches hops by twos,
Queues of carrots, parallel chews.
Cache my crumbs, resume my run,
Threads all weaving, work well done.
If video’s shy, I wisely say:
“Bring the tools—or hop away!” 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "[fix] batch size in openai compatible endpoint" is concise and accurately describes the primary change—fixing batch-size handling for the openai_compatible endpoint so requests are processed in batches rather than a per-request loop, which matches the changes in the diff and reported CI speedup.
Description Check ✅ Passed The PR description contains a detailed explanation of the problem, the fix, example usage, environment variables, and runtime logs demonstrating the performance improvement, which satisfies the repository template's requirement for a detailed description even though the checklist boxes are not explicitly checked.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@b8zhong
Copy link
Author

b8zhong commented Sep 24, 2025

@kcz358 @Luodian If you could take a look, that would be great. Thank you!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lmms_eval/models/simple/openai_compatible.py (1)

164-186: Guard video encoding when decord is unavailable

If decord isn’t installed, VideoReader is undefined and will crash with NameError. Raise a clear ImportError.

     def encode_video(self, video_path, for_get_frames_num):
+        if "VideoReader" not in globals():
+            raise ImportError("decord is required for video processing. Install with: pip install decord")
         vr = VideoReader(video_path, ctx=cpu(0))
🧹 Nitpick comments (12)
lmms_eval/protocol.py (2)

100-100: Shorten exception message to satisfy Ruff TRY003 and 88‑char limit

Keep the message concise; installation guidance belongs in docs/logs.

Apply this diff:

-                        raise ImportError("qwen_vl_utils is required for video processing. Please install it with: pip install qwen-vl-utils")
+                        raise ImportError(
+                            "qwen_vl_utils is required for video processing; install qwen-vl-utils"
+                        )

10-14: Guard decord import in lmms_eval/protocol.py and mark fetch_video as Optional

  • decord is imported at lmms_eval/protocol.py:7 — make it optional (try/except) or lazy-import inside the functions that use VideoReader to avoid breaking installs without video deps.
  • Annotate fetch_video as Optional (e.g., from typing import Optional, Callable; fetch_video: Optional[Callable[..., Any]] = None) so Pyright knows it may be None.
  • Repo scan shows many other files import decord at module import time — consider applying the same guard across video-related modules.
lmms_eval/models/simple/openai_compatible.py (6)

36-38: Use Optional typing for nullable params and import Optional

Type hints should not use implicit Optional. Please annotate base_url and api_key as Optional[str] and import Optional.

-from typing import List, Tuple, Union
+from typing import List, Tuple, Union, Optional
...
-        base_url: str = None,
-        api_key: str = None,
+        base_url: Optional[str] = None,
+        api_key: Optional[str] = None,

Also applies to: 7-7


121-124: Annotate property return type

Return type hints are required by the guidelines.

-    def batch_size(self):
+    def batch_size(self) -> int:
         return self.batch_size_per_gpu

131-138: Annotate property return types

Add explicit return types for properties per guidelines.

-    def eot_token_id(self):
+    def eot_token_id(self) -> int:
         return 0
...
-    def rank(self):
+    def rank(self) -> int:
         return self._rank

229-246: File type checks: prefer robust suffix-based detection

Use Path(visual).suffix.lower() over substring checks to avoid false matches and handle uppercase extensions.

+from pathlib import Path
...
-                        if isinstance(visual, str) and (".mp4" in visual or ".avi" in visual or ".mov" in visual or ".flv" in visual or ".wmv" in visual):
+                        if isinstance(visual, str) and Path(visual).suffix.lower() in {".mp4", ".avi", ".mov", ".flv", ".wmv"}:
@@
-                        elif isinstance(visual, str) and (".jpg" in visual or ".jpeg" in visual or ".png" in visual or ".gif" in visual or ".bmp" in visual or ".tiff" in visual or ".webp" in visual):
+                        elif isinstance(visual, str) and Path(visual).suffix.lower() in {".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp"}:

289-299: Avoid blind except Exception; consider narrower exceptions or log type

Catching all exceptions hides actionable errors. At minimum, log exception type; preferably catch OpenAI-specific exceptions.

-                    except Exception as e:
-                        error_msg = str(e)
+                    except Exception as e:
+                        error_msg = f"{e.__class__.__name__}: {e}"

50-55: Docstrings for public API

Add class/method docstrings for this public model and generate_until per guidelines.

lmms_eval/models/chat/openai_compatible.py (4)

47-48: Remove unused variable ctx

ctx is unpacked but unused. Prefix with _ or drop it.

-                ctx, doc_to_messages, gen_kwargs, doc_id, task, split = req.args
+                _, doc_to_messages, gen_kwargs, doc_id, task, split = req.args

109-118: Avoid blind except Exception; at least log exception type

Prefer catching OpenAI-specific exceptions or log the exception class name to aid triage.

-                    except Exception as e:
-                        error_msg = str(e)
+                    except Exception as e:
+                        error_msg = f"{e.__class__.__name__}: {e}"

34-37: Prefer using self.batch_size accessor

Use the accessor added in the simple model for consistency.

-        batch_size = getattr(self, "batch_size_per_gpu", 1)
+        batch_size = getattr(self, "batch_size", getattr(self, "batch_size_per_gpu", 1))

28-33: Docstring for public API method

Add a brief docstring for generate_until explaining batched behavior and return order guarantees.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 036637e and 0bb1001.

📒 Files selected for processing (3)
  • lmms_eval/models/chat/openai_compatible.py (2 hunks)
  • lmms_eval/models/simple/openai_compatible.py (7 hunks)
  • lmms_eval/protocol.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/protocol.py
  • lmms_eval/models/chat/openai_compatible.py
  • lmms_eval/models/simple/openai_compatible.py
🧬 Code graph analysis (2)
lmms_eval/models/chat/openai_compatible.py (5)
lmms_eval/api/registry.py (1)
  • register_model (11-24)
lmms_eval/models/simple/openai_compatible.py (3)
  • batch_size (122-123)
  • rank (136-137)
  • process_single_request (279-299)
lmms_eval/api/task.py (1)
  • doc_to_messages (1648-1680)
lmms_eval/api/instance.py (1)
  • args (25-29)
lmms_eval/protocol.py (2)
  • ChatMessages (45-122)
  • to_openai_messages (89-109)
lmms_eval/models/simple/openai_compatible.py (3)
lmms_eval/utils.py (3)
  • Collator (865-1020)
  • chunks (136-171)
  • get_batched (891-913)
lmms_eval/api/instance.py (1)
  • args (25-29)
lmms_eval/models/chat/openai_compatible.py (1)
  • process_single_request (88-119)
🪛 Ruff (0.13.1)
lmms_eval/protocol.py

100-100: Avoid specifying long messages outside the exception class

(TRY003)

lmms_eval/models/chat/openai_compatible.py

47-47: Unpacked variable ctx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Function definition does not bind loop variable batch_responses

(B023)


90-90: Function definition does not bind loop variable batch_responses

(B023)


107-107: Consider moving this statement to an else block

(TRY300)


109-109: Do not catch blind exception: Exception

(BLE001)

lmms_eval/models/simple/openai_compatible.py

36-36: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


37-37: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


128-128: Unused method argument: tokens

(ARG002)


280-280: Function definition does not bind loop variable batch_responses

(B023)


281-281: Function definition does not bind loop variable batch_responses

(B023)


287-287: Consider moving this statement to an else block

(TRY300)


289-289: Do not catch blind exception: Exception

(BLE001)

Comment on lines +41 to +87
for batch_requests in batched_requests:
batch_payloads = []
batch_doc_uuids = []
batch_responses = []

for req in batch_requests:
ctx, doc_to_messages, gen_kwargs, doc_id, task, split = req.args
doc_uuid = f"{task}___{split}___{doc_id}"
self.response_cache[doc_uuid] = response_text
batch_doc_uuids.append(doc_uuid)

if self.continual_mode is True and self.cache_mode == "resume":
if doc_uuid in self.response_cache:
response_text = self.response_cache[doc_uuid]
if response_text:
batch_responses.append(response_text)
continue

chat_messages_raw = doc_to_messages(self.task_dict[task][split][doc_id])
chat_messages: ChatMessages = ChatMessages(**{"messages": chat_messages_raw})

payload = {"messages": chat_messages.to_openai_messages()}
payload["model"] = self.model_version

if "max_new_tokens" not in gen_kwargs:
gen_kwargs["max_new_tokens"] = 1024
if gen_kwargs["max_new_tokens"] > 4096:
gen_kwargs["max_new_tokens"] = 4096
if "temperature" not in gen_kwargs:
gen_kwargs["temperature"] = 0
if "top_p" not in gen_kwargs:
gen_kwargs["top_p"] = None
if "num_beams" not in gen_kwargs:
gen_kwargs["num_beams"] = 1

payload["max_tokens"] = gen_kwargs["max_new_tokens"]
payload["temperature"] = gen_kwargs["temperature"]

if "o1" in self.model_version or "o3" in self.model_version or "o4" in self.model_version:
del payload["temperature"]
payload.pop("max_tokens")
payload["reasoning_effort"] = "medium"
payload["response_format"] = {"type": "text"}
payload["max_completion_tokens"] = gen_kwargs["max_new_tokens"]

batch_payloads.append(payload)
batch_responses.append(None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Indexing mismatch corrupts response alignment within batch

Same misalignment as the simple variant: cached responses are appended while payloads skip them, then indices from enumerate(batch_payloads) are used to update batch_responses. This leads to wrong pairings.

-            batch_payloads = []
-            batch_doc_uuids = []
-            batch_responses = []
+            batch_len = len(batch_requests)
+            batch_payloads = [None] * batch_len
+            batch_doc_uuids = [None] * batch_len
+            batch_responses = [None] * batch_len
@@
-            for req in batch_requests:
-                ctx, doc_to_messages, gen_kwargs, doc_id, task, split = req.args
+            for i, req in enumerate(batch_requests):
+                _, doc_to_messages, gen_kwargs, doc_id, task, split = req.args
                 doc_uuid = f"{task}___{split}___{doc_id}"
-                batch_doc_uuids.append(doc_uuid)
+                batch_doc_uuids[i] = doc_uuid
@@
-                    if doc_uuid in self.response_cache:
-                        response_text = self.response_cache[doc_uuid]
-                        if response_text:
-                            batch_responses.append(response_text)
-                            continue
+                    if doc_uuid in self.response_cache:
+                        response_text = self.response_cache[doc_uuid]
+                        if response_text:
+                            batch_responses[i] = response_text
+                            continue
@@
-                batch_payloads.append(payload)
-                batch_responses.append(None)
+                batch_payloads[i] = payload
+                # keep batch_responses[i] as None to be filled
@@
-            tasks_to_run = [(payload, i) for i, payload in enumerate(batch_payloads) if batch_responses[i] is None]
+            tasks_to_run = [
+                (payload, i)
+                for i, payload in enumerate(batch_payloads)
+                if payload is not None and batch_responses[i] is None
+            ]

Also applies to: 121-133

🧰 Tools
🪛 Ruff (0.13.1)

47-47: Unpacked variable ctx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
In lmms_eval/models/chat/openai_compatible.py around lines 41-87 (and similarly
at 121-133), cached responses are appended to batch_responses while
corresponding payloads are skipped, breaking alignment when later iterating over
enumerate(batch_payloads). Fix by keeping arrays aligned: when you short-circuit
with a cached response, append a placeholder (e.g., None) to batch_payloads at
the same index (or otherwise ensure one-to-one positions for batch_payloads and
batch_responses), and when sending requests skip None payloads or filter/zip
payloads with their response indices so responses are written back into the
correct positions. Ensure the rest of the code that sends requests and writes
back uses these preserved indices to update batch_responses and the cache
correctly.

Comment on lines +95 to +99
response = self.client.chat.completions.create(**payload)
end_time = time.time()

response_text = response.choices[0].message.content
latency = end_time - start_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add per-call timeout to OpenAI request

Prevent indefinite hangs by passing timeout=self.timeout to the API call.

-                        response = self.client.chat.completions.create(**payload)
+                        response = self.client.chat.completions.create(**payload, timeout=self.timeout)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = self.client.chat.completions.create(**payload)
end_time = time.time()
response_text = response.choices[0].message.content
latency = end_time - start_time
response = self.client.chat.completions.create(**payload, timeout=self.timeout)
end_time = time.time()
response_text = response.choices[0].message.content
latency = end_time - start_time
🤖 Prompt for AI Agents
In lmms_eval/models/chat/openai_compatible.py around lines 95 to 99, the OpenAI
API call can hang because it doesn't pass a per-call timeout; update the chat
completion invocation to include timeout=self.timeout (i.e., call
self.client.chat.completions.create(..., timeout=self.timeout)), ensuring the
instance has a numeric self.timeout attribute initialized earlier; keep the rest
of the response handling unchanged.

Comment on lines 98 to 102
self.client = (
OpenAI(api_key=os.getenv("OPENAI_API_KEY"), base_url=os.getenv("OPENAI_API_BASE"), http_client=http_client)
OpenAI(api_key=api_key, base_url=base_url, http_client=http_client)
if not azure_openai
else AzureOpenAI(api_key=os.getenv("AZURE_OPENAI_API_KEY"), azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"), api_version=os.getenv("AZURE_OPENAI_API_VERSION"), http_client=http_client)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add client-level request timeout (avoid indefinite hangs)

External calls lack an explicit timeout. Pass timeout=self.timeout to the OpenAI/AzureOpenAI clients (or per-call) to prevent indefinite blocking.

-            OpenAI(api_key=api_key, base_url=base_url, http_client=http_client)
+            OpenAI(api_key=api_key, base_url=base_url, http_client=http_client, timeout=self.timeout)
             if not azure_openai
             else AzureOpenAI(
-                api_key=os.getenv("AZURE_OPENAI_API_KEY"),
-                azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"),
-                api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
-                http_client=http_client
+                api_key=os.getenv("AZURE_OPENAI_API_KEY"),
+                azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"),
+                api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
+                http_client=http_client,
+                timeout=self.timeout,
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.client = (
OpenAI(api_key=os.getenv("OPENAI_API_KEY"), base_url=os.getenv("OPENAI_API_BASE"), http_client=http_client)
OpenAI(api_key=api_key, base_url=base_url, http_client=http_client)
if not azure_openai
else AzureOpenAI(api_key=os.getenv("AZURE_OPENAI_API_KEY"), azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"), api_version=os.getenv("AZURE_OPENAI_API_VERSION"), http_client=http_client)
)
self.client = (
OpenAI(api_key=api_key, base_url=base_url, http_client=http_client, timeout=self.timeout)
if not azure_openai
else AzureOpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
azure_endpoint=os.getenv("AZURE_OPENAI_API_BASE"),
api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
http_client=http_client,
timeout=self.timeout,
)
)
🤖 Prompt for AI Agents
In lmms_eval/models/simple/openai_compatible.py around lines 98 to 102, the
OpenAI/AzureOpenAI client is created without a request timeout which can cause
indefinite hangs; update the client instantiation to pass timeout=self.timeout
(or pass timeout into individual call methods if the client does not accept a
constructor timeout), i.e., include timeout=self.timeout when constructing
OpenAI and AzureOpenAI (or wire self.timeout into the per-request call options)
so all external requests honor the configured timeout.

Comment on lines +125 to +130
def tok_encode(self, string: str):
return list(string.encode("utf-8"))

def tok_decode(self, tokens):
return ""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix tok_decode to actually decode; remove unused-arg warning

tok_decode ignores its input and always returns empty string; tokens is flagged unused. Implement the inverse of tok_encode with proper types.

-    def tok_encode(self, string: str):
-        return list(string.encode("utf-8"))
+    def tok_encode(self, string: str) -> List[int]:
+        return list(string.encode("utf-8"))
 
-    def tok_decode(self, tokens):
-        return ""
+    def tok_decode(self, tokens: List[int]) -> str:
+        return bytes(tokens).decode("utf-8", errors="ignore")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def tok_encode(self, string: str):
return list(string.encode("utf-8"))
def tok_decode(self, tokens):
return ""
def tok_encode(self, string: str) -> List[int]:
return list(string.encode("utf-8"))
def tok_decode(self, tokens: List[int]) -> str:
return bytes(tokens).decode("utf-8", errors="ignore")
🧰 Tools
🪛 Ruff (0.13.1)

128-128: Unused method argument: tokens

(ARG002)

🤖 Prompt for AI Agents
In lmms_eval/models/simple/openai_compatible.py around lines 125 to 130,
tok_decode currently ignores its input and returns an empty string; implement it
as the inverse of tok_encode by accepting tokens (list/iterable of ints or
bytes/bytearray), converting them to a bytes object and decoding with utf-8 (use
errors="replace" or "strict" per project policy) and return the resulting str so
the tokens argument is used and properly typed.

Comment on lines +197 to +206
def _collate(x):
toks = self.tok_encode(x[0])
return -len(toks), x[0]

from lmms_eval import utils

re_ords = utils.Collator([reg.args for reg in requests], _collate, grouping=True)
chunks = re_ords.get_batched(n=self.batch_size, batch_fn=None)
num_iters = len(requests) // self.batch_size if len(requests) % self.batch_size == 0 else len(requests) // self.batch_size + 1
pbar = tqdm(total=num_iters, disable=(self.rank != 0), desc="Model Responding")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ordering bug: results not restored to original order

You reorder with Collator but return res in reordered batch order. Restore original order before returning.

-        pbar.close()
-        return res
+        pbar.close()
+        return re_ords.get_original(res)

Also applies to: 319-324

🤖 Prompt for AI Agents
In lmms_eval/models/simple/openai_compatible.py around lines 197-206 (and
similarly at 319-324), the code uses utils.Collator to reorder requests for
batching but returns results in the collated (reordered) sequence; restore the
original request order before returning. Capture the original indices from
Collator (or keep a mapping of request -> original position) when creating
re_ords/get_batched, then after processing batches collect outputs into a result
array sized to len(requests) and place each batch's responses at their original
indices; finally return that array in original order (do the same for both code
locations).

Comment on lines +214 to +221
batch_payloads = []
batch_doc_uuids = []
batch_responses = []

for i, (context, doc_id_single) in enumerate(zip(contexts, doc_id)):
doc_uuid = f"{task}___{split}___{doc_id_single}"
batch_doc_uuids.append(doc_uuid)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Indexing mismatch corrupts mapping between payloads and responses

batch_responses appends cached entries while batch_payloads skips them, then enumerate(batch_payloads) is used as index into batch_responses. This misalignment will overwrite cached entries and mix up responses.

Apply indexed preallocation and keep indices aligned:

-            batch_payloads = []
-            batch_doc_uuids = []
-            batch_responses = []
+            batch_len = len(contexts)
+            batch_payloads: List[Optional[dict]] = [None] * batch_len
+            batch_doc_uuids: List[Optional[str]] = [None] * batch_len
+            batch_responses: List[Optional[str]] = [None] * batch_len
 
-            for i, (context, doc_id_single) in enumerate(zip(contexts, doc_id)):
+            for i, (context, doc_id_single) in enumerate(zip(contexts, doc_id)):
                 doc_uuid = f"{task}___{split}___{doc_id_single}"
-                batch_doc_uuids.append(doc_uuid)
+                batch_doc_uuids[i] = doc_uuid
 
                 if self.continual_mode is True and self.cache_mode == "resume":
                     if doc_uuid in self.response_cache:
                         response_text = self.response_cache[doc_uuid]
                         if response_text:
-                            batch_responses.append(response_text)
-                            continue
+                            batch_responses[i] = response_text
+                            continue
@@
-                batch_payloads.append(payload)
-                batch_responses.append(None)
+                batch_payloads[i] = payload
+                # leave batch_responses[i] as None to be filled later
@@
-            tasks_to_run = [(payload, i) for i, payload in enumerate(batch_payloads) if batch_responses[i] is None]
+            tasks_to_run = [
+                (payload, i)
+                for i, payload in enumerate(batch_payloads)
+                if payload is not None and batch_responses[i] is None
+            ]

Also applies to: 222-228, 276-300, 301-312

🤖 Prompt for AI Agents
In lmms_eval/models/simple/openai_compatible.py around lines 214-221 (and also
apply same fix to 222-228, 276-300, 301-312), the loop builds batch_payloads and
batch_responses out of sync because cached responses are appended to
batch_responses while payloads skip cached entries, then
enumerate(batch_payloads) is used to index into batch_responses causing
misalignment; fix by preallocating placeholders (e.g., initialize
batch_payloads, batch_doc_uuids, batch_responses to the same length with None)
or by maintaining an explicit index/position mapping so that for every input
position you either set payloads[pos]=payload and responses[pos]=None (to be
filled later) or set responses[pos]=cached_value and payloads[pos]=None,
ensuring you never change relative ordering — then when creating request batches
skip None payloads but preserve original indices for mapping responses back, and
when filling responses place them into batch_responses using the original
positions so cached and fresh responses remain aligned.

Comment on lines +99 to 101
if fetch_video is None:
raise ImportError("qwen_vl_utils is required for video processing. Please install it with: pip install qwen-vl-utils")
video_input = fetch_video({"type": "video", "video": content.url, **video_kwargs})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix None-unpacking bug in video kwargs

If video_kwargs is None (the default), **video_kwargs raises at runtime. Use a safe default.

Apply this diff:

-                    video_input = fetch_video({"type": "video", "video": content.url, **video_kwargs})
+                    video_input = fetch_video({
+                        "type": "video",
+                        "video": content.url,
+                        **(video_kwargs or {}),
+                    })

Additionally, consider updating the signature to reflect Optional and non‑string values (recommended for Pyright):

  • Change to: def to_openai_messages(self, video_kwargs: Optional[Dict[str, Any]] = None):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if fetch_video is None:
raise ImportError("qwen_vl_utils is required for video processing. Please install it with: pip install qwen-vl-utils")
video_input = fetch_video({"type": "video", "video": content.url, **video_kwargs})
if fetch_video is None:
raise ImportError("qwen_vl_utils is required for video processing. Please install it with: pip install qwen-vl-utils")
video_input = fetch_video({
"type": "video",
"video": content.url,
**(video_kwargs or {}),
})
🧰 Tools
🪛 Ruff (0.13.1)

100-100: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In lmms_eval/protocol.py around lines 99 to 101, the code unpacks video_kwargs
without guarding against None which raises a TypeError at runtime; change the
function signature to use typing Optional and Any (e.g., def
to_openai_messages(self, video_kwargs: Optional[Dict[str, Any]] = None):) and
inside the function replace **video_kwargs with **(video_kwargs or {}) so a safe
empty dict is used when None is passed; also add the necessary typing imports if
not present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant