-
Couldn't load subscription status.
- Fork 3k
feat(ASR): MLX Whisper Support for Apple Silicon #2366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ASR): MLX Whisper Support for Apple Silicon #2366
Conversation
|
✅ DCO Check Passed Thanks @kensteele, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
|
@kensteele this is an awesome PR, thanks a ton!! |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@kensteele I think you might need to add some restriction on the tests |
@PeterStaar-IBM Hopefully that should clear the tests! 🤞 |
@PeterStaar-IBM You bet - happy to contribute more! Looks like all the the checks have passed - possible to get a review and merge? @dolfim-ibm @cau-git |
|
@PeterStaar-IBM Please review the latest commit f114d45 which addresses your comments in the previous review as well as: Adds comprehensive support the following additional audio input formats:
Adds support for the following additional MIME types:
Adds additional sample audio files:
|
|
@PeterStaar-IBM @cau-git @dolfim-ibm Looks like I need two reviewers per the Mergify requirements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎖️
f114d45 to
712b586
Compare
|
@kensteele the current CI failures seem to be caused by the lock of |
…omatically if present.
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: a979a68 I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 9827068 I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: ebbeb45 I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 2f6fd3c Signed-off-by: Ken Steele <[email protected]>
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 5e61bf1 Signed-off-by: Ken Steele <[email protected]>
…udio/sample_10s.mp3 if no args specified. Signed-off-by: Ken Steele <[email protected]>
…els.py - Move audio file extensions from CLI hardcoded set to FormatToExtensions[InputFormat.AUDIO] - Add support for additional audio formats: m4a, aac, ogg, flac, mp4, avi, mov - Update FormatToMimeType mapping to include MIME types for all audio formats - Update CLI auto-detection to use centralized FormatToExtensions mapping - Add comprehensive tests for audio file auto-detection and pipeline selection - Ensure explicit pipeline choices are not overridden by auto-detection Fixes issue where only .mp3 and .wav files were processed as audio despite CLI auto-detection working for all formats. The document converter now properly recognizes all audio formats through MIME type detection. Addresses review comments: - Centralizes audio extensions in base_models.py as suggested - Maintains existing auto-detection behavior while using centralized data - Adds proper test coverage for the audio detection functionality All examples and tests pass with the new centralized approach. All audio formats (mp3, wav, m4a, aac, ogg, flac, mp4, avi, mov) now work correctly. Signed-off-by: Ken Steele <[email protected]>
…explicit model options Review feedback addressed: 1. Fix CLI auto-detection to only switch to ASR pipeline when ALL files are audio - Previously switched if ANY file was audio, now requires ALL files to be audio - Added warning for mixed file types with guidance to use --pipeline asr 2. Add explicit WHISPER_X_MLX and WHISPER_X_NATIVE model options - Users can now force specific implementations if desired - Auto-selecting models (WHISPER_BASE, etc.) still choose best for hardware - Added 12 new explicit model options: _MLX and _NATIVE variants for each size CLI now supports: - Auto-selecting: whisper_tiny, whisper_base, etc. (choose best for hardware) - Explicit MLX: whisper_tiny_mlx, whisper_base_mlx, etc. (force MLX) - Explicit Native: whisper_tiny_native, whisper_base_native, etc. (force native) Addresses reviewer comments from @dolfim-ibm Signed-off-by: Ken Steele <[email protected]>
ffc1a57 to
fec4f33
Compare
I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: c60e72d I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 9480331 I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 21905e8 I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 96c669d I, Ken Steele <[email protected]>, hereby add my Signed-off-by to this commit: 8371c06 Signed-off-by: Ken Steele <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
…ompts - tests/test_asr_mlx_whisper.py: verify explicit MLX options (framework, repo ids) - tests/test_asr_pipeline.py: cover _has_text/_determine_status and backend support with proper InputDocument/NoOpBackend wiring - tests/test_interfaces.py: add BaseVlmPageModel.formulate_prompt tests (RAW/NONE/CHAT, invalid style), with minimal InlineVlmOptions scaffold Improves reliability of ASR and VLM components by validating configuration paths and helper logic. Signed-off-by: Ken Steele <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎖️
|
note: there seems to be some temporary issue with the HF artifacts. we will retry launching the CI in this PR (and others) later today. |
f3a2ba2
@dolfim-ibm @PeterStaar-IBM While we're waiting on the HF artifacts issue, can I get two quick reviews/approvals on the code coverage additions I just committed to pass the codecov test @ f3a2ba2 |
…VLM prompts - tests/test_asr_mlx_whisper.py - Add MLX/native selector coverage across all Whisper sizes - Validate repo_id choices under MLX and Native paths - Cover fallback path when MPS unavailable and mlx_whisper missing - tests/test_asr_pipeline.py - Relax silent-audio assertion to accept PARTIAL_SUCCESS or SUCCESS - Force CPU native path in helper tests to avoid torch in device selection - Add language handling tests for native/MLX transcribe - Cover native run success (BytesIO) and failure (exception) branches - Cover MLX run success/failure branches with mocked transcribe - Add init path coverage with artifacts_path - tests/test_interfaces.py - Add focused VLM prompt tests (NONE/CHAT variants) Result: all tests passing with significantly improved coverage for ASR model selectors, pipeline execution paths, and VLM prompt formulation. Signed-off-by: Ken Steele <[email protected]>
c3783d7
|
@kensteele don't worry about the coverage results, they are only indicative, the PR was actually good to be finalized already 😉 |
@dolfim-ibm Well, we've got a lot more code coverage now 😅 I'm sure you guys are sorting out the CI disk space issue: @PeterStaar-IBM one last 🎖️ for c3783d7 and PR should be good to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎖️
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
9367094
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Description
This PR adds comprehensive MLX Whisper support to the docling ASR pipeline, providing significant performance improvement on Apple Silicon devices through automatic hardware-aware model selection. The integration is completely transparent to users - they simply use regular Whisper models and get MLX optimization automatically when beneficial.
Issue resolved by this Pull Request:
Resolves #2364
Key Features
WHISPER_TURBO,WHISPER_BASE, etc.Performance Results
Actual performance comparison on Apple Silicon (M1/M2/M3) using 10-second audio sample:
Key insights:
Technical Implementation
1. MLX Framework Integration
2. Automatic Model Selection
3. CLI Auto-Detection
Documentation and Examples
Updated Examples
docs/examples/minimal_asr_pipeline.py: Updated to show automatic model selectiondocs/examples/mlx_whisper_example.py: New example demonstrating MLX Whisper usagedocs/examples/asr_pipeline_performance_comparison.py: New performance comparison script with--audioparameterUsage Examples
Python API:
CLI:
Performance Comparison:
Testing
Comprehensive Test Coverage
Test Results
Dependencies
Added MLX Whisper Dependency
asrextra, doesn't affect core functionalityUser Experience
ASR Pipeline
docling CLI
Benefits
Files Changed
docling/datamodel/pipeline_options_asr_model.py: Added MLX framework and optionsdocling/datamodel/asr_model_specs.py: Implemented automatic model selectiondocling/pipeline/asr_pipeline.py: Added MLX Whisper model implementationdocling/cli/main.py: Added automatic pipeline detection and device configurationdocs/examples/minimal_asr_pipeline.py: Updated documentationdocs/examples/mlx_whisper_example.py: New MLX Whisper exampledocs/examples/asr_pipeline_performance_comparison.py: New performance comparison scripttests/test_asr_mlx_whisper.py: Comprehensive test suitepyproject.toml: Added MLX Whisper dependencyuv.lock: Updated dependency lock fileChecklist
Conclusion
This PR delivers a complete, transparent MLX Whisper integration that provides significant performance improvements on Apple Silicon while maintaining full backward compatibility. Users get the benefits of MLX optimization without any configuration changes, making it a true "just works" enhancement to the docling ASR pipeline.