Skip to content

Conversation

@vincentkoc
Copy link
Member

@vincentkoc vincentkoc commented Oct 1, 2025

Details

Introducing a multi-modal optimizer using EvolutionaryOptimizer and Dashcam dataset from Huggingface.

Change checklist

  • User facing
  • Documentation update

Issues

  • Resolves #
  • OPIK-

Testing

Documentation


Note

Introduces end-to-end multimodal support (text+image) across optimizer and evaluation, adds DHPR driving-hazard dataset and a vision LLM judge, image utilities, and updates mutation/crossover to preserve images.

  • Optimizer (Evolutionary):
    • Multimodal-aware mutations and crossovers: extract text from structured content and rebuild while preserving image_url parts.
    • LLM-driven prompts updated with multimodal guidance and stricter JSON I/O; improved JSON parsing in population initialization.
    • Reasoning calls use higher max_tokens to avoid truncation.
  • Datasets:
    • New DHPR loaders: driving_hazard_50, driving_hazard_100, driving_hazard_test; exported in datasets/__init__.py.
  • Metrics:
    • New MultimodalLLMJudge metric to evaluate outputs with image context; exported via metrics/__init__.py.
  • Prompt Config:
    • ChatPrompt now supports structured multimodal content (validation, substitution, serialization).
  • Evaluation Runtime:
    • Added model capability detection (model_capabilities) and message rendering (message_renderer) to handle structured content and flatten for non-vision models; evaluator uses these.
  • Utils:
    • New image_helpers for base64 encode/decode, structured content creation, validation, token estimation.
  • Examples:
    • Added litellm_evolutionary_hazard_detection_example.py demonstrating multimodal EO on DHPR with custom judge.
  • Tests:
    • Unit tests for ChatPrompt multimodal handling, image utilities, mutation/crossover helpers, and message rendering.

Written by Cursor Bugbot for commit b848387. This will update automatically on new commits. Configure here.

@vincentkoc vincentkoc changed the base branch from main to feat/mcp-upflift October 1, 2025 01:02
@comet-ml comet-ml deleted a comment from github-actions bot Oct 1, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Oct 1, 2025
@vincentkoc vincentkoc marked this pull request as ready for review October 1, 2025 01:03
@vincentkoc vincentkoc requested review from a team and dsblank as code owners October 1, 2025 01:03
cursor[bot]

This comment was marked as outdated.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 1, 2025

SDK E2E Tests Results

68 tests   67 ✅  1m 56s ⏱️
 1 suites   0 💤
 1 files     1 ❌

For more details on these failures, see this check.

Results for commit b848387.

Base automatically changed from feat/mcp-upflift to main October 1, 2025 14:31
Copilot AI review requested due to automatic review settings October 2, 2025 14:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive multimodal support for the Opik optimizer ecosystem, enabling end-to-end optimization and evaluation of prompts that include both text and images. The implementation adds a complete multimodal workflow from dataset loading through optimization to evaluation, with specific focus on driving hazard detection using dashcam imagery.

Key changes:

  • Multimodal infrastructure: Adds structured content support, image utilities, and vision model capabilities
  • Evolutionary optimizer enhancements: Updates mutation/crossover operations to preserve images while evolving text prompts
  • New datasets and metrics: Introduces DHPR driving hazard dataset and multimodal LLM judge evaluation

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
sdks/python/src/opik/evaluation/models/model_capabilities.py Vision model detection with comprehensive model support list
sdks/python/src/opik/evaluation/models/message_renderer.py Message content rendering for vision vs non-vision models
sdks/python/src/opik/evaluation/evaluator.py Integration of multimodal rendering in evaluation pipeline
sdks/opik_optimizer/src/opik_optimizer/optimization_config/chat_prompt.py Structured content validation and template substitution
sdks/opik_optimizer/src/opik_optimizer/utils/image_helpers.py Complete image processing utilities for multimodal content
sdks/opik_optimizer/src/opik_optimizer/metrics/multimodal_llm_judge.py Vision-capable LLM judge for multimodal evaluation
sdks/opik_optimizer/src/opik_optimizer/evolutionary_optimizer/ Updated mutation/crossover operations with multimodal awareness
sdks/opik_optimizer/src/opik_optimizer/datasets/driving_hazard.py DHPR dataset loader with image encoding and processing
sdks/opik_optimizer/scripts/litellm_evolutionary_hazard_detection_example.py Complete example demonstrating multimodal optimization
Test files Comprehensive unit tests for all new multimodal functionality

@@ -1,3 +1,4 @@
from typing import Any, TYPE_CHECKING, Union, List, Dict
from typing import Any, TYPE_CHECKING
Copy link

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are duplicate imports from typing. The first import includes Union, List, Dict which are already imported in the second line. Remove the duplicate second import.

Suggested change
from typing import Any, TYPE_CHECKING

Copilot uses AI. Check for mistakes.
Comment on lines 55 to 57
# For reasoning calls (prompt generation), use higher max_tokens to avoid truncation
# For evaluation calls (task output), use user-configurable max_tokens
default_max_tokens = 8000 if is_reasoning else 1000
Copy link

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable is_reasoning is referenced but not defined in the function parameters or scope. This will cause a NameError when the function is called.

Copilot uses AI. Check for mistakes.
"You are an expert prompt engineer. Your task is to generate novel, effective prompts from scratch "
"based on a task description, specifically aiming for prompts that elicit answers in the style: "
f"'{style}'. Output ONLY a raw JSON list of strings."
f"'{style}'. Output ONLY a raw JSON list of message objects (with 'role' and 'content' fields)."
Copy link

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt instruction format is inconsistent with examples shown later in the file (lines 406-407) which show array of arrays format. This inconsistency could lead to parsing errors.

Suggested change
f"'{style}'. Output ONLY a raw JSON list of message objects (with 'role' and 'content' fields)."
f"'{style}'. Output ONLY a raw JSON list of lists of message objects (with 'role' and 'content' fields)."

Copilot uses AI. Check for mistakes.
text: str,
image_uri: Optional[str] = None,
image_detail: str = "auto"
) -> list[dict]:
Copy link

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use List[Dict[str, Any]] instead of list[dict] for better type specificity and consistency with other type annotations in the codebase.

Copilot uses AI. Check for mistakes.

def extract_images_from_structured_content(
content: list[dict]
) -> list[str]:
Copy link

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use List[str] instead of list[str] for consistency with the typing imports at the top of the file.

Copilot uses AI. Check for mistakes.
[{{"role": "<role>", "content": "<content>"}}],
[{{"role": "<role>", "content": "<content>"}}]
]
Return only valid JSON, nothing else.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: JSON Encoding Issues and Inconsistent Examples

JSON inputs to several _user_prompt functions (e.g., llm_crossover_user_prompt) are double-encoded, making them unparseable. llm_crossover_user_prompt's parent message type hint is also too restrictive for multimodal content, risking serialization issues. Additionally, llm_crossover_user_prompt and fresh_start_user_prompt provide malformed or inconsistent JSON examples for expected LLM output, which could confuse the model.

Fix in Cursor Fix in Web

else:
# Text-only: more generous token allocation
default_max_tokens = 8000 if is_reasoning else 1000

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Model Call Ignores Context Window Safety

The _call_model method's max_tokens logic uses getattr(self, "max_tokens", default_max_tokens). If self.max_tokens is explicitly set, it overrides the new multimodal-aware default_max_tokens calculation. This bypasses the intended context window safety for images, potentially causing overflow.

Fix in Cursor Fix in Web

# FALLBACK: If GPT-5 is not available via your API provider, use GPT-4o:
# VISION_MODEL = "gpt-4o-mini" # 128k context
# JUDGE_MODEL = "gpt-4o" # 128k context
# Then reduce image quality: MAX_IMAGE_SIZE = (512, 384), IMAGE_QUALITY = 60
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect Model Names in Example Script

The example script litellm_evolutionary_hazard_detection_example.py is configured with non-existent "GPT-5" model names (gpt-5-nano, gpt-5). This causes runtime errors, making the example unusable by default, even though a GPT-4o fallback is mentioned in comments.

Fix in Cursor Fix in Web

trust_remote_code=True,
)
except Exception as inner_e:
ds.enable_progress_bar()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Dataset Streaming Issues and Security Risks

The _load_dhpr_dataset function attempts to load the HuggingFace dataset in streaming mode but then tries to index it, which isn't supported and causes the primary load path to fail. This also enables trust_remote_code=True by default, posing a security risk. Additionally, the exception handling can lead to a NameError if the initial streaming attempt fails, as the error message references an undefined exception variable.

Fix in Cursor Fix in Web


return " ".join(text_parts)

return str(content)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Multimodal Content Formatting and Rebuilding Issues

The new multimodal content handling has a couple of issues. extract_text_from_content joins multiple text parts with a single space, which can lose original formatting and semantic separation. Additionally, _word_level_mutation might not consistently rebuild multimodal content using rebuild_content_with_mutated_text when no word-level changes occur, potentially leading to inconsistent message structures.

Additional Locations (1)

Fix in Cursor Fix in Web


result.append(part_copy)

return result
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Structured Content Handling Fails

The _substitute_structured_content method incorrectly handles dataset_item values that are structured content (lists of dictionaries). It performs string replacement or converts these values to strings, rather than integrating them as actual structured content. This prevents proper multimodal message construction when a template variable represents a list of content parts.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants