Skip to content

Conversation

@kensteele
Copy link
Contributor

Description

This PR adds comprehensive MLX Whisper support to the docling ASR pipeline, providing significant performance improvement on Apple Silicon devices through automatic hardware-aware model selection. The integration is completely transparent to users - they simply use regular Whisper models and get MLX optimization automatically when beneficial.

Issue resolved by this Pull Request:
Resolves #2364

Key Features

  • Automatic Hardware Detection: Detects MPS (Apple Silicon) and MLX Whisper availability
  • Transparent Integration: Users use regular WHISPER_TURBO, WHISPER_BASE, etc.
  • Smart Fallback: Falls back to native Whisper on non-Apple Silicon systems
  • Complete Model Coverage: All Whisper model sizes support automatic MLX selection
  • CLI Enhancement: Automatic pipeline detection for audio files
  • Type Safety: Proper type annotations and MyPy compliance

Performance Results

Actual performance comparison on Apple Silicon (M1/M2/M3) using 10-second audio sample:

Model Native Whisper (CPU) MLX Whisper (MPS) Speedup
whisper_tiny 1.24 sec 0.81 sec 1.5x faster
whisper_base 8.55 sec 0.45 sec 19.0x faster
whisper_turbo 9.50 sec 1.26 sec 7.6x faster
Average 6.43 sec 0.84 sec 7.7x faster

Key insights:

  • MLX Whisper provides significant speedup across all model sizes
  • Larger models (base, turbo) show the most dramatic improvements
  • The 10-second audio sample completes in under 1 second with MLX Whisper base/turbo

Technical Implementation

1. MLX Framework Integration

class InferenceAsrFramework(str, Enum):
    MLX = "mlx"  # Now enabled
    WHISPER = "whisper"

2. Automatic Model Selection

def _get_whisper_turbo_model():
    """Get the best Whisper Turbo model for the current hardware."""
    # Check if MPS is available (Apple Silicon)
    try:
        import torch
        has_mps = torch.backends.mps.is_built() and torch.backends.mps.is_available()
    except ImportError:
        has_mps = False
    
    # Check if mlx-whisper is available
    try:
        import mlx_whisper
        has_mlx_whisper = True
    except ImportError:
        has_mlx_whisper = False
    
    # Use MLX Whisper if both MPS and mlx-whisper are available
    if has_mps and has_mlx_whisper:
        return InlineAsrMlxWhisperOptions(
            repo_id="mlx-community/whisper-turbo",
            inference_framework=InferenceAsrFramework.MLX,
            # ... MLX-specific options
        )
    else:
        return InlineAsrNativeWhisperOptions(
            repo_id="turbo",
            inference_framework=InferenceAsrFramework.WHISPER,
            # ... Native Whisper options
        )

3. CLI Auto-Detection

# Auto-detect pipeline based on input file formats
if pipeline == ProcessingPipeline.STANDARD:
    # Check if any input files are audio files by extension
    audio_extensions = {'.mp3', '.wav', '.m4a', '.aac', '.ogg', '.flac', '.mp4', '.avi', '.mov'}
    for path in input_doc_paths:
        if path.suffix.lower() in audio_extensions:
            pipeline = ProcessingPipeline.ASR
            _log.info(f"Auto-detected ASR pipeline for audio file: {path}")
            break

Documentation and Examples

Updated Examples

  • docs/examples/minimal_asr_pipeline.py: Updated to show automatic model selection
  • docs/examples/mlx_whisper_example.py: New example demonstrating MLX Whisper usage
  • docs/examples/asr_pipeline_performance_comparison.py: New performance comparison script with --audio parameter

Usage Examples

Python API:

from docling.datamodel import asr_model_specs
from docling.datamodel.pipeline_options import AsrPipelineOptions

# Automatically uses MLX Whisper on Apple Silicon!
pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

CLI:

# Automatically detects audio files and uses ASR pipeline with MLX Whisper!
docling ~/Recording1.mp3 --asr-model whisper_turbo
docling ~/Recording1.mp3 --asr-model whisper_medium
docling ~/Recording1.mp3 --asr-model whisper_base

Performance Comparison:

# Use default test audio file
python docs/examples/asr_pipeline_performance_comparison.py

# Use your own audio file
python docs/examples/asr_pipeline_performance_comparison.py --audio ~/Recording1.mp3

# Show help
python docs/examples/asr_pipeline_performance_comparison.py --help

Testing

Comprehensive Test Coverage

  • MLX Whisper model initialization
  • Automatic model selection logic
  • Import error handling
  • Transcription functionality
  • Pipeline integration
  • CLI model selection

Test Results

$ python -m pytest tests/test_asr_mlx_whisper.py tests/test_asr_pipeline.py -v
============================= test session starts ==============================
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_options_creation PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_whisper_models_auto_select_mlx PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_model_initialization PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_model_import_error PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_transcribe PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_asr_pipeline_with_mlx_whisper PASSED
tests/test_asr_pipeline.py::test_asr_pipeline_conversion PASSED
============================== 7 passed in 3.49s ===============================

Dependencies

Added MLX Whisper Dependency

# pyproject.toml
asr = [
    'mlx-whisper>=0.4.3 ; python_version >= "3.10" and sys_platform == "darwin" and platform_machine == "arm64"',
    "openai-whisper>=20250625",
]
  • Platform Specific: Only installed on Apple Silicon (arm64) macOS systems
  • Python Version: Requires Python 3.10+ (MLX requirement)
  • Optional: Part of the asr extra, doesn't affect core functionality

User Experience

ASR Pipeline

# Users just use regular Whisper models - MLX is automatic!
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO  # Works everywhere

docling CLI

# Audio files automatically trigger ASR pipeline with MLX optimization
docling ~/Recording1.mp3 --asr-model whisper_turbo

Benefits

  1. Performance: Up to 19x faster ASR on Apple Silicon devices
  2. Transparency: No user configuration required
  3. Compatibility: Works on all platforms with appropriate fallbacks
  4. Scalability: Supports all Whisper model sizes
  5. Reliability: Comprehensive error handling and testing
  6. Documentation: Clear examples and usage patterns

Files Changed

  • docling/datamodel/pipeline_options_asr_model.py: Added MLX framework and options
  • docling/datamodel/asr_model_specs.py: Implemented automatic model selection
  • docling/pipeline/asr_pipeline.py: Added MLX Whisper model implementation
  • docling/cli/main.py: Added automatic pipeline detection and device configuration
  • docs/examples/minimal_asr_pipeline.py: Updated documentation
  • docs/examples/mlx_whisper_example.py: New MLX Whisper example
  • docs/examples/asr_pipeline_performance_comparison.py: New performance comparison script
  • tests/test_asr_mlx_whisper.py: Comprehensive test suite
  • pyproject.toml: Added MLX Whisper dependency
  • uv.lock: Updated dependency lock file

Checklist

  • Examples have been added
  • Tests have been added
  • Pre-commit checks pass (Ruff formatter, Ruff linter, MyPy, uv-lock)
  • All tests pass (7/7 tests successful)
  • Type safety ensured with proper annotations
  • Backward compatibility maintained
  • Platform-specific dependencies properly configured

Conclusion

This PR delivers a complete, transparent MLX Whisper integration that provides significant performance improvements on Apple Silicon while maintaining full backward compatibility. Users get the benefits of MLX optimization without any configuration changes, making it a true "just works" enhancement to the docling ASR pipeline.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2025

DCO Check Passed

Thanks @kensteele, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Oct 2, 2025

Documentation Updates

Checked 3 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Oct 2, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@PeterStaar-IBM
Copy link
Contributor

@kensteele this is an awesome PR, thanks a ton!!

@codecov
Copy link

codecov bot commented Oct 3, 2025

Codecov Report

❌ Patch coverage is 77.82805% with 49 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/cli/main.py 13.95% 37 Missing ⚠️
docling/datamodel/asr_model_specs.py 89.47% 12 Missing ⚠️

📢 Thoughts on this report? Let us know!

@PeterStaar-IBM
Copy link
Contributor

@kensteele I think you might need to add some restriction on the tests docs/examples/asr_pipeline_performance_comparison.py for the unit-tests to go through.

@kensteele
Copy link
Contributor Author

@kensteele I think you might need to add some restriction on the tests docs/examples/asr_pipeline_performance_comparison.py for the unit-tests to go through.

@PeterStaar-IBM Hopefully that should clear the tests! 🤞

@kensteele
Copy link
Contributor Author

@kensteele this is an awesome PR, thanks a ton!!

@PeterStaar-IBM You bet - happy to contribute more!

Looks like all the the checks have passed - possible to get a review and merge? @dolfim-ibm @cau-git

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Oct 6, 2025
@kensteele
Copy link
Contributor Author

kensteele commented Oct 7, 2025

@PeterStaar-IBM Please review the latest commit f114d45 which addresses your comments in the previous review as well as:

Adds comprehensive support the following additional audio input formats:

  • m4a
  • aac
  • ogg
  • flac
  • mp4
  • avi
  • mov

Adds support for the following additional MIME types:

  • audio/mp4
  • audio/m4a
  • audio/aac
  • audio/ogg
  • audio/flac
  • audio/x-flac
  • video/mp4
  • video/avi
  • video/x-msvideo
  • video/quicktime

Adds additional sample audio files:

  • audio and video files added to tests/data/audio/ for ASR testing:
sample_10s_audio-aac.aac
sample_10s_audio-flac.flac
sample_10s_audio-m4a.m4a
sample_10s_audio-mp3.mp3
sample_10s_audio-mp4.m4a
sample_10s_audio-mpeg.mp3
sample_10s_audio-ogg.ogg
sample_10s_audio-wav.wav
sample_10s_audio-x-flac.flac
sample_10s_audio-x-wav.wav

sample_10s_video-avi.avi
sample_10s_video-mp4.mp4
sample_10s_video-quicktime.mov
sample_10s_video-x-msvideo.avi

@kensteele
Copy link
Contributor Author

@PeterStaar-IBM @cau-git @dolfim-ibm Looks like I need two reviewers per the Mergify requirements: When test data is updated, we require two reviewers

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Oct 10, 2025
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️

@dolfim-ibm
Copy link
Contributor

@kensteele the current CI failures seem to be caused by the lock of av a new version which only has wheels for Python 3.10. I suggest pinning av<16.0.0

I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: a979a68
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 9827068
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: ebbeb45
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 2f6fd3c

Signed-off-by: Ken Steele <[email protected]>
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 5e61bf1

Signed-off-by: Ken Steele <[email protected]>
…udio/sample_10s.mp3 if no args specified.

Signed-off-by: Ken Steele <[email protected]>
…els.py

- Move audio file extensions from CLI hardcoded set to FormatToExtensions[InputFormat.AUDIO]
- Add support for additional audio formats: m4a, aac, ogg, flac, mp4, avi, mov
- Update FormatToMimeType mapping to include MIME types for all audio formats
- Update CLI auto-detection to use centralized FormatToExtensions mapping
- Add comprehensive tests for audio file auto-detection and pipeline selection
- Ensure explicit pipeline choices are not overridden by auto-detection

Fixes issue where only .mp3 and .wav files were processed as audio despite
CLI auto-detection working for all formats. The document converter now
properly recognizes all audio formats through MIME type detection.

Addresses review comments:
- Centralizes audio extensions in base_models.py as suggested
- Maintains existing auto-detection behavior while using centralized data
- Adds proper test coverage for the audio detection functionality

All examples and tests pass with the new centralized approach.
All audio formats (mp3, wav, m4a, aac, ogg, flac, mp4, avi, mov) now work correctly.

Signed-off-by: Ken Steele <[email protected]>
…explicit model options

Review feedback addressed:
1. Fix CLI auto-detection to only switch to ASR pipeline when ALL files are audio
   - Previously switched if ANY file was audio, now requires ALL files to be audio
   - Added warning for mixed file types with guidance to use --pipeline asr

2. Add explicit WHISPER_X_MLX and WHISPER_X_NATIVE model options
   - Users can now force specific implementations if desired
   - Auto-selecting models (WHISPER_BASE, etc.) still choose best for hardware
   - Added 12 new explicit model options: _MLX and _NATIVE variants for each size

CLI now supports:
- Auto-selecting: whisper_tiny, whisper_base, etc. (choose best for hardware)
- Explicit MLX: whisper_tiny_mlx, whisper_base_mlx, etc. (force MLX)
- Explicit Native: whisper_tiny_native, whisper_base_native, etc. (force native)

Addresses reviewer comments from @dolfim-ibm

Signed-off-by: Ken Steele <[email protected]>
@kensteele kensteele force-pushed the dev/add-mlx-whisper-support branch from ffc1a57 to fec4f33 Compare October 18, 2025 18:02
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: c60e72d
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 9480331
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 21905e8
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 96c669d
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 8371c06

Signed-off-by: Ken Steele <[email protected]>
dolfim-ibm
dolfim-ibm previously approved these changes Oct 20, 2025
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

…ompts

- tests/test_asr_mlx_whisper.py: verify explicit MLX options (framework, repo ids)
- tests/test_asr_pipeline.py: cover _has_text/_determine_status and backend support with proper InputDocument/NoOpBackend wiring
- tests/test_interfaces.py: add BaseVlmPageModel.formulate_prompt tests (RAW/NONE/CHAT, invalid style), with minimal InlineVlmOptions scaffold

Improves reliability of ASR and VLM components by validating configuration paths and helper logic.

Signed-off-by: Ken Steele <[email protected]>
PeterStaar-IBM
PeterStaar-IBM previously approved these changes Oct 20, 2025
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️

@dolfim-ibm
Copy link
Contributor

note: there seems to be some temporary issue with the HF artifacts. we will retry launching the CI in this PR (and others) later today.

@kensteele kensteele dismissed stale reviews from PeterStaar-IBM and dolfim-ibm via f3a2ba2 October 20, 2025 08:50
@kensteele
Copy link
Contributor Author

note: there seems to be some temporary issue with the HF artifacts. we will retry launching the CI in this PR (and others) later today.

@dolfim-ibm @PeterStaar-IBM While we're waiting on the HF artifacts issue, can I get two quick reviews/approvals on the code coverage additions I just committed to pass the codecov test @ f3a2ba2

dolfim-ibm
dolfim-ibm previously approved these changes Oct 20, 2025
…VLM prompts

- tests/test_asr_mlx_whisper.py
  - Add MLX/native selector coverage across all Whisper sizes
  - Validate repo_id choices under MLX and Native paths
  - Cover fallback path when MPS unavailable and mlx_whisper missing

- tests/test_asr_pipeline.py
  - Relax silent-audio assertion to accept PARTIAL_SUCCESS or SUCCESS
  - Force CPU native path in helper tests to avoid torch in device selection
  - Add language handling tests for native/MLX transcribe
  - Cover native run success (BytesIO) and failure (exception) branches
  - Cover MLX run success/failure branches with mocked transcribe
  - Add init path coverage with artifacts_path

- tests/test_interfaces.py
  - Add focused VLM prompt tests (NONE/CHAT variants)

Result: all tests passing with significantly improved coverage for ASR model selectors, pipeline execution paths, and VLM prompt formulation.

Signed-off-by: Ken Steele <[email protected]>
PeterStaar-IBM
PeterStaar-IBM previously approved these changes Oct 20, 2025
@kensteele kensteele dismissed stale reviews from PeterStaar-IBM and dolfim-ibm via c3783d7 October 20, 2025 11:35
@dolfim-ibm
Copy link
Contributor

@kensteele don't worry about the coverage results, they are only indicative, the PR was actually good to be finalized already 😉

dolfim-ibm
dolfim-ibm previously approved these changes Oct 20, 2025
@kensteele
Copy link
Contributor Author

@kensteele don't worry about the coverage results, they are only indicative, the PR was actually good to be finalized already 😉

@dolfim-ibm Well, we've got a lot more code coverage now 😅 82.71% of diff hit (target 75.25%)

I'm sure you guys are sorting out the CI disk space issue:

/home/runner/work/docling/docling/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:801: UserWarning: Not enough free disk space to download the file. The expected file size is: 94.25 MB. The target location 
/home/runner/.cache/huggingface/hub/models--HuggingFaceTB--SmolVLM-256M-Instruct/blobs only has 0.00 MB free disk space.

@PeterStaar-IBM one last 🎖️ for c3783d7 and PR should be good to merge

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Oct 20, 2025
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️

@dolfim-ibm dolfim-ibm dismissed stale reviews from PeterStaar-IBM and themself via 9367094 October 21, 2025 05:32
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dolfim-ibm dolfim-ibm changed the title feat: Add MLX Whisper Support for Apple Silicon ASR Pipeline feat(ASR): MLX Whisper Support for Apple Silicon Oct 21, 2025
@dolfim-ibm dolfim-ibm merged commit 657ce8b into docling-project:main Oct 21, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add MLX Whisper Support for Apple Silicon ASR Pipeline

3 participants