[NA] [SDK] Multimodal Opik Optimizer (EO with Dashcam Hazard Agent) #3488

vincentkoc · 2025-10-01T01:02:13Z

Details

Introducing a multi-modal optimizer using EvolutionaryOptimizer and Dashcam dataset from Huggingface.

Change checklist

User facing
Documentation update

Issues

Resolves #
OPIK-

Testing

Documentation

Note

Introduces end-to-end multimodal support (text+image) across optimizer and evaluation, adds DHPR driving-hazard dataset and a vision LLM judge, image utilities, and updates mutation/crossover to preserve images.

Optimizer (Evolutionary):
- Multimodal-aware mutations and crossovers: extract text from structured content and rebuild while preserving image_url parts.
- LLM-driven prompts updated with multimodal guidance and stricter JSON I/O; improved JSON parsing in population initialization.
- Reasoning calls use higher max_tokens to avoid truncation.
Datasets:
- New DHPR loaders: driving_hazard_50, driving_hazard_100, driving_hazard_test; exported in datasets/__init__.py.
Metrics:
- New MultimodalLLMJudge metric to evaluate outputs with image context; exported via metrics/__init__.py.
Prompt Config:
- ChatPrompt now supports structured multimodal content (validation, substitution, serialization).
Evaluation Runtime:
- Added model capability detection (model_capabilities) and message rendering (message_renderer) to handle structured content and flatten for non-vision models; evaluator uses these.
Utils:
- New image_helpers for base64 encode/decode, structured content creation, validation, token estimation.
Examples:
- Added litellm_evolutionary_hazard_detection_example.py demonstrating multimodal EO on DHPR with custom judge.
Tests:
- Unit tests for ChatPrompt multimodal handling, image utilities, mutation/crossover helpers, and message rendering.

^{Written by Cursor Bugbot for commit b848387. This will update automatically on new commits. Configure here.}

…l/opik into fix/optimizer-signatures

github-actions · 2025-10-01T01:11:01Z

SDK E2E Tests Results

68 tests 67 ✅ 1m 56s ⏱️
1 suites 0 💤
1 files 1 ❌

For more details on these failures, see this check.

Results for commit b848387.

Copilot

Pull Request Overview

This PR introduces comprehensive multimodal support for the Opik optimizer ecosystem, enabling end-to-end optimization and evaluation of prompts that include both text and images. The implementation adds a complete multimodal workflow from dataset loading through optimization to evaluation, with specific focus on driving hazard detection using dashcam imagery.

Key changes:

Multimodal infrastructure: Adds structured content support, image utilities, and vision model capabilities
Evolutionary optimizer enhancements: Updates mutation/crossover operations to preserve images while evolving text prompts
New datasets and metrics: Introduces DHPR driving hazard dataset and multimodal LLM judge evaluation

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`sdks/python/src/opik/evaluation/models/model_capabilities.py`	Vision model detection with comprehensive model support list
`sdks/python/src/opik/evaluation/models/message_renderer.py`	Message content rendering for vision vs non-vision models
`sdks/python/src/opik/evaluation/evaluator.py`	Integration of multimodal rendering in evaluation pipeline
`sdks/opik_optimizer/src/opik_optimizer/optimization_config/chat_prompt.py`	Structured content validation and template substitution
`sdks/opik_optimizer/src/opik_optimizer/utils/image_helpers.py`	Complete image processing utilities for multimodal content
`sdks/opik_optimizer/src/opik_optimizer/metrics/multimodal_llm_judge.py`	Vision-capable LLM judge for multimodal evaluation
`sdks/opik_optimizer/src/opik_optimizer/evolutionary_optimizer/`	Updated mutation/crossover operations with multimodal awareness
`sdks/opik_optimizer/src/opik_optimizer/datasets/driving_hazard.py`	DHPR dataset loader with image encoding and processing
`sdks/opik_optimizer/scripts/litellm_evolutionary_hazard_detection_example.py`	Complete example demonstrating multimodal optimization
Test files	Comprehensive unit tests for all new multimodal functionality

Copilot · 2025-10-02T14:49:52Z

sdks/opik_optimizer/src/opik_optimizer/evolutionary_optimizer/mutation_ops.py

@@ -1,3 +1,4 @@
+from typing import Any, TYPE_CHECKING, Union, List, Dict
 from typing import Any, TYPE_CHECKING


There are duplicate imports from typing. The first import includes Union, List, Dict which are already imported in the second line. Remove the duplicate second import.

Suggested change

from typing import Any, TYPE_CHECKING

Copilot · 2025-10-02T14:49:53Z

sdks/opik_optimizer/src/opik_optimizer/evolutionary_optimizer/llm_support.py

+        # For reasoning calls (prompt generation), use higher max_tokens to avoid truncation
+        # For evaluation calls (task output), use user-configurable max_tokens
+        default_max_tokens = 8000 if is_reasoning else 1000


The variable is_reasoning is referenced but not defined in the function parameters or scope. This will cause a NameError when the function is called.

Copilot · 2025-10-02T14:49:53Z

sdks/opik_optimizer/src/opik_optimizer/evolutionary_optimizer/prompts.py

        "You are an expert prompt engineer. Your task is to generate novel, effective prompts from scratch "
        "based on a task description, specifically aiming for prompts that elicit answers in the style: "
-        f"'{style}'. Output ONLY a raw JSON list of strings."
+        f"'{style}'. Output ONLY a raw JSON list of message objects (with 'role' and 'content' fields)."


The prompt instruction format is inconsistent with examples shown later in the file (lines 406-407) which show array of arrays format. This inconsistency could lead to parsing errors.

Suggested change

f"'{style}'. Output ONLY a raw JSON list of message objects (with 'role' and 'content' fields)."

f"'{style}'. Output ONLY a raw JSON list of lists of message objects (with 'role' and 'content' fields)."

Copilot · 2025-10-02T14:49:54Z

sdks/opik_optimizer/src/opik_optimizer/utils/image_helpers.py

+    text: str,
+    image_uri: Optional[str] = None,
+    image_detail: str = "auto"
+) -> list[dict]:


Use List[Dict[str, Any]] instead of list[dict] for better type specificity and consistency with other type annotations in the codebase.

Copilot · 2025-10-02T14:49:54Z

sdks/opik_optimizer/src/opik_optimizer/utils/image_helpers.py

+
+def extract_images_from_structured_content(
+    content: list[dict]
+) -> list[str]:


Use List[str] instead of list[str] for consistency with the typing imports at the top of the file.

cursor · 2025-10-04T01:11:52Z

sdks/opik_optimizer/src/opik_optimizer/evolutionary_optimizer/prompts.py

+    [{{"role": "<role>", "content": "<content>"}}],
+    [{{"role": "<role>", "content": "<content>"}}]
 ]
+Return only valid JSON, nothing else.


Bug: JSON Encoding Issues and Inconsistent Examples

JSON inputs to several _user_prompt functions (e.g., llm_crossover_user_prompt) are double-encoded, making them unparseable. llm_crossover_user_prompt's parent message type hint is also too restrictive for multimodal content, risking serialization issues. Additionally, llm_crossover_user_prompt and fresh_start_user_prompt provide malformed or inconsistent JSON examples for expected LLM output, which could confuse the model.

cursor · 2025-10-04T01:11:52Z

sdks/opik_optimizer/src/opik_optimizer/evolutionary_optimizer/llm_support.py

+        else:
+            # Text-only: more generous token allocation
+            default_max_tokens = 8000 if is_reasoning else 1000
+


Bug: Model Call Ignores Context Window Safety

The _call_model method's max_tokens logic uses getattr(self, "max_tokens", default_max_tokens). If self.max_tokens is explicitly set, it overrides the new multimodal-aware default_max_tokens calculation. This bypasses the intended context window safety for images, potentially causing overflow.

cursor · 2025-10-04T01:11:52Z

sdks/opik_optimizer/scripts/litellm_evolutionary_hazard_detection_example.py

+# FALLBACK: If GPT-5 is not available via your API provider, use GPT-4o:
+# VISION_MODEL = "gpt-4o-mini"  # 128k context
+# JUDGE_MODEL = "gpt-4o"  # 128k context
+# Then reduce image quality: MAX_IMAGE_SIZE = (512, 384), IMAGE_QUALITY = 60


Bug: Incorrect Model Names in Example Script

The example script litellm_evolutionary_hazard_detection_example.py is configured with non-existent "GPT-5" model names (gpt-5-nano, gpt-5). This causes runtime errors, making the example unusable by default, even though a GPT-4o fallback is mentioned in comments.

cursor · 2025-10-04T01:11:52Z

sdks/opik_optimizer/src/opik_optimizer/datasets/driving_hazard.py

+                    trust_remote_code=True,
+                )
+            except Exception as inner_e:
+                ds.enable_progress_bar()


Bug: Dataset Streaming Issues and Security Risks

The _load_dhpr_dataset function attempts to load the HuggingFace dataset in streaming mode but then tries to index it, which isn't supported and causes the primary load path to fail. This also enables trust_remote_code=True by default, posing a security risk. Additionally, the exception handling can lead to a NameError if the initial streaming attempt fails, as the error message references an undefined exception variable.

cursor · 2025-10-04T01:11:53Z

sdks/opik_optimizer/src/opik_optimizer/evolutionary_optimizer/mutation_ops.py

+
+        return " ".join(text_parts)
+
+    return str(content)


Bug: Multimodal Content Formatting and Rebuilding Issues

The new multimodal content handling has a couple of issues. extract_text_from_content joins multiple text parts with a single space, which can lose original formatting and semantic separation. Additionally, _word_level_mutation might not consistently rebuild multimodal content using rebuild_content_with_mutated_text when no word-level changes occur, potentially leading to inconsistent message structures.

Additional Locations (1)

sdks/opik_optimizer/src/opik_optimizer/evolutionary_optimizer/mutation_ops.py#L230-L233

cursor · 2025-10-04T01:11:53Z

sdks/opik_optimizer/src/opik_optimizer/optimization_config/chat_prompt.py

+
+            result.append(part_copy)
+
+        return result


Bug: Structured Content Handling Fails

The _substitute_structured_content method incorrectly handles dataset_item values that are structured content (lists of dictionaries). It performs string replacement or converts these values to strings, rather than integrating them as actual structured content. This prevents proper multimodal message construction when a template variable represents a list of content parts.

Additional Locations (1)

sdks/opik_optimizer/src/opik_optimizer/optimization_config/chat_prompt.py#L171-L183

vincentkoc and others added 30 commits September 25, 2025 12:24

Update meta_prompt_optimizer.py

7a31dda

Update mcp_workflow.py

1b7704f

Update gepa_optimizer.py

aba2272

Update few_shot_bayesian_optimizer.py

aeaf00e

Update evolutionary_optimizer.py

fe22b36

Update base_optimizer.py

5ff3bf2

test: added tests for spec

7bde2f2

test: tests

8f9d65c

Update mipro_optimizer.py

68ca2ef

Update gepa_optimizer.py

b76c495

patched examples

5b79c09

refactor: signature validators

c5586d5

refactor: call counters

9802a86

test: counters

5b4684a

Update test_gepa_adapter.py

2463f7f

chore: lint

f5c9673

test: patch

d90d8b5

Merge branch 'main' into fix/optimizer-signatures

dfefda8

Update test_mipro.py

a00dd5d

Merge branch 'main' into fix/optimizer-signatures

bab1a38

Update test_mipro.py

f3abcd8

Merge branch 'fix/optimizer-signatures' of https://github.com/comet-m…

0480112

…l/opik into fix/optimizer-signatures

fix: cache, gc and seed values

27b10ef

chore: signature fixes

e971525

chore: type fix

78121a1

chore: types

d4035e0

Update _lm.py

ead12a0

fix: copilot reccos

9dffbc7

Merge branch 'main' into fix/optimizer-signatures

b4f8386

Update adapter.py

6040636

vincentkoc added 10 commits September 30, 2025 16:45

chore: improvements

b4376f6

chore: EO (is_multimodal)

63b75dc

fix: stricter JSON output on EO (bug)

ffc7442

fix: max_token on EO

6bf3e59

feat: image size/quality flags

6e0869f

Update litellm_evolutionary_hazard_detection_example.py

a6479a1

Update litellm_evolutionary_hazard_detection_example.py

ad5ec24

Update litellm_evolutionary_hazard_detection_example.py

ff0f406

Update litellm_evolutionary_hazard_detection_example.py

075da21

Update litellm_evolutionary_hazard_detection_example.py

b848387

github-actions bot assigned vincentkoc Oct 1, 2025

vincentkoc changed the base branch from main to feat/mcp-upflift October 1, 2025 01:02

comet-ml deleted a comment from github-actions bot Oct 1, 2025

vincentkoc marked this pull request as ready for review October 1, 2025 01:03

vincentkoc requested review from a team and dsblank as code owners October 1, 2025 01:03

This comment was marked as outdated.

Sign in to view

Base automatically changed from feat/mcp-upflift to main October 1, 2025 14:31

Merge branch 'main' into feat/imagebased-optimizer

cd7a0f5

Copilot AI review requested due to automatic review settings October 2, 2025 14:48

Copilot AI reviewed Oct 2, 2025

View reviewed changes

vincentkoc added 3 commits October 3, 2025 18:05

Update crossover_ops.py

c441dfd

Update llm_support.py

31b8fa0

Update mutation_ops.py

a0a7a5b

cursor bot reviewed Oct 4, 2025

View reviewed changes

vincentkoc marked this pull request as draft October 28, 2025 22:40

This was referenced Oct 28, 2025

[OPIK-2512][SDK] Add Multimodal Judge Support to Python SDK #3848

Merged

[OPIK-2512][SDK] Add Multimodal Judge Support to Typescript SDK #3849

Open

		@@ -1,3 +1,4 @@
		from typing import Any, TYPE_CHECKING, Union, List, Dict
		from typing import Any, TYPE_CHECKING

	f"'{style}'. Output ONLY a raw JSON list of message objects (with 'role' and 'content' fields)."
	f"'{style}'. Output ONLY a raw JSON list of lists of message objects (with 'role' and 'content' fields)."

[NA] [SDK] Multimodal Opik Optimizer (EO with Dashcam Hazard Agent) #3488

Are you sure you want to change the base?

[NA] [SDK] Multimodal Opik Optimizer (EO with Dashcam Hazard Agent) #3488

Uh oh!

Conversation

vincentkoc commented Oct 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Change checklist

Issues

Testing

Documentation

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot commented Oct 1, 2025

SDK E2E Tests Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot Oct 4, 2025

Choose a reason for hiding this comment

Bug: JSON Encoding Issues and Inconsistent Examples

Uh oh!

cursor bot Oct 4, 2025

Choose a reason for hiding this comment

Bug: Model Call Ignores Context Window Safety

Uh oh!

cursor bot Oct 4, 2025

Choose a reason for hiding this comment

Bug: Incorrect Model Names in Example Script

Uh oh!

cursor bot Oct 4, 2025

Choose a reason for hiding this comment

Bug: Dataset Streaming Issues and Security Risks

Uh oh!

cursor bot Oct 4, 2025

Choose a reason for hiding this comment

Bug: Multimodal Content Formatting and Rebuilding Issues

Uh oh!

cursor bot Oct 4, 2025

Choose a reason for hiding this comment

Bug: Structured Content Handling Fails

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vincentkoc commented Oct 1, 2025 •

edited by cursor bot

Loading