- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.2k
[NA] [SDK] Multimodal Opik Optimizer (EO with Dashcam Hazard Agent) #3488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…l/opik into fix/optimizer-signatures
| SDK E2E Tests Results68 tests   67 ✅  1m 56s ⏱️ For more details on these failures, see this check. Results for commit b848387. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces comprehensive multimodal support for the Opik optimizer ecosystem, enabling end-to-end optimization and evaluation of prompts that include both text and images. The implementation adds a complete multimodal workflow from dataset loading through optimization to evaluation, with specific focus on driving hazard detection using dashcam imagery.
Key changes:
- Multimodal infrastructure: Adds structured content support, image utilities, and vision model capabilities
- Evolutionary optimizer enhancements: Updates mutation/crossover operations to preserve images while evolving text prompts
- New datasets and metrics: Introduces DHPR driving hazard dataset and multimodal LLM judge evaluation
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description | 
|---|---|
| sdks/python/src/opik/evaluation/models/model_capabilities.py | Vision model detection with comprehensive model support list | 
| sdks/python/src/opik/evaluation/models/message_renderer.py | Message content rendering for vision vs non-vision models | 
| sdks/python/src/opik/evaluation/evaluator.py | Integration of multimodal rendering in evaluation pipeline | 
| sdks/opik_optimizer/src/opik_optimizer/optimization_config/chat_prompt.py | Structured content validation and template substitution | 
| sdks/opik_optimizer/src/opik_optimizer/utils/image_helpers.py | Complete image processing utilities for multimodal content | 
| sdks/opik_optimizer/src/opik_optimizer/metrics/multimodal_llm_judge.py | Vision-capable LLM judge for multimodal evaluation | 
| sdks/opik_optimizer/src/opik_optimizer/evolutionary_optimizer/ | Updated mutation/crossover operations with multimodal awareness | 
| sdks/opik_optimizer/src/opik_optimizer/datasets/driving_hazard.py | DHPR dataset loader with image encoding and processing | 
| sdks/opik_optimizer/scripts/litellm_evolutionary_hazard_detection_example.py | Complete example demonstrating multimodal optimization | 
| Test files | Comprehensive unit tests for all new multimodal functionality | 
| @@ -1,3 +1,4 @@ | |||
| from typing import Any, TYPE_CHECKING, Union, List, Dict | |||
| from typing import Any, TYPE_CHECKING | |||
    
      
    
      Copilot
AI
    
    
    
      Oct 2, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are duplicate imports from typing. The first import includes Union, List, Dict which are already imported in the second line. Remove the duplicate second import.
| from typing import Any, TYPE_CHECKING | 
| # For reasoning calls (prompt generation), use higher max_tokens to avoid truncation | ||
| # For evaluation calls (task output), use user-configurable max_tokens | ||
| default_max_tokens = 8000 if is_reasoning else 1000 | 
    
      
    
      Copilot
AI
    
    
    
      Oct 2, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable is_reasoning is referenced but not defined in the function parameters or scope. This will cause a NameError when the function is called.
| "You are an expert prompt engineer. Your task is to generate novel, effective prompts from scratch " | ||
| "based on a task description, specifically aiming for prompts that elicit answers in the style: " | ||
| f"'{style}'. Output ONLY a raw JSON list of strings." | ||
| f"'{style}'. Output ONLY a raw JSON list of message objects (with 'role' and 'content' fields)." | 
    
      
    
      Copilot
AI
    
    
    
      Oct 2, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prompt instruction format is inconsistent with examples shown later in the file (lines 406-407) which show array of arrays format. This inconsistency could lead to parsing errors.
| f"'{style}'. Output ONLY a raw JSON list of message objects (with 'role' and 'content' fields)." | |
| f"'{style}'. Output ONLY a raw JSON list of lists of message objects (with 'role' and 'content' fields)." | 
| text: str, | ||
| image_uri: Optional[str] = None, | ||
| image_detail: str = "auto" | ||
| ) -> list[dict]: | 
    
      
    
      Copilot
AI
    
    
    
      Oct 2, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use List[Dict[str, Any]] instead of list[dict] for better type specificity and consistency with other type annotations in the codebase.
|  | ||
| def extract_images_from_structured_content( | ||
| content: list[dict] | ||
| ) -> list[str]: | 
    
      
    
      Copilot
AI
    
    
    
      Oct 2, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use List[str] instead of list[str] for consistency with the typing imports at the top of the file.
| [{{"role": "<role>", "content": "<content>"}}], | ||
| [{{"role": "<role>", "content": "<content>"}}] | ||
| ] | ||
| Return only valid JSON, nothing else. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: JSON Encoding Issues and Inconsistent Examples
JSON inputs to several _user_prompt functions (e.g., llm_crossover_user_prompt) are double-encoded, making them unparseable. llm_crossover_user_prompt's parent message type hint is also too restrictive for multimodal content, risking serialization issues. Additionally, llm_crossover_user_prompt and fresh_start_user_prompt provide malformed or inconsistent JSON examples for expected LLM output, which could confuse the model.
| else: | ||
| # Text-only: more generous token allocation | ||
| default_max_tokens = 8000 if is_reasoning else 1000 | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Model Call Ignores Context Window Safety
The _call_model method's max_tokens logic uses getattr(self, "max_tokens", default_max_tokens). If self.max_tokens is explicitly set, it overrides the new multimodal-aware default_max_tokens calculation. This bypasses the intended context window safety for images, potentially causing overflow.
| # FALLBACK: If GPT-5 is not available via your API provider, use GPT-4o: | ||
| # VISION_MODEL = "gpt-4o-mini" # 128k context | ||
| # JUDGE_MODEL = "gpt-4o" # 128k context | ||
| # Then reduce image quality: MAX_IMAGE_SIZE = (512, 384), IMAGE_QUALITY = 60 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Incorrect Model Names in Example Script
The example script litellm_evolutionary_hazard_detection_example.py is configured with non-existent "GPT-5" model names (gpt-5-nano, gpt-5). This causes runtime errors, making the example unusable by default, even though a GPT-4o fallback is mentioned in comments.
| trust_remote_code=True, | ||
| ) | ||
| except Exception as inner_e: | ||
| ds.enable_progress_bar() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Dataset Streaming Issues and Security Risks
The _load_dhpr_dataset function attempts to load the HuggingFace dataset in streaming mode but then tries to index it, which isn't supported and causes the primary load path to fail. This also enables trust_remote_code=True by default, posing a security risk. Additionally, the exception handling can lead to a NameError if the initial streaming attempt fails, as the error message references an undefined exception variable.
|  | ||
| return " ".join(text_parts) | ||
|  | ||
| return str(content) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Multimodal Content Formatting and Rebuilding Issues
The new multimodal content handling has a couple of issues. extract_text_from_content joins multiple text parts with a single space, which can lose original formatting and semantic separation. Additionally, _word_level_mutation might not consistently rebuild multimodal content using rebuild_content_with_mutated_text when no word-level changes occur, potentially leading to inconsistent message structures.
Additional Locations (1)
|  | ||
| result.append(part_copy) | ||
|  | ||
| return result | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Structured Content Handling Fails
The _substitute_structured_content method incorrectly handles dataset_item values that are structured content (lists of dictionaries). It performs string replacement or converts these values to strings, rather than integrating them as actual structured content. This prevents proper multimodal message construction when a template variable represents a list of content parts.
Details
Introducing a multi-modal optimizer using
EvolutionaryOptimizerand Dashcam dataset from Huggingface.Change checklist
Issues
Testing
Documentation
Note
Introduces end-to-end multimodal support (text+image) across optimizer and evaluation, adds DHPR driving-hazard dataset and a vision LLM judge, image utilities, and updates mutation/crossover to preserve images.
contentand rebuild while preservingimage_urlparts.max_tokensto avoid truncation.driving_hazard_50,driving_hazard_100,driving_hazard_test; exported indatasets/__init__.py.MultimodalLLMJudgemetric to evaluate outputs with image context; exported viametrics/__init__.py.ChatPromptnow supports structured multimodalcontent(validation, substitution, serialization).model_capabilities) and message rendering (message_renderer) to handle structured content and flatten for non-vision models; evaluator uses these.image_helpersfor base64 encode/decode, structured content creation, validation, token estimation.litellm_evolutionary_hazard_detection_example.pydemonstrating multimodal EO on DHPR with custom judge.Written by Cursor Bugbot for commit b848387. This will update automatically on new commits. Configure here.