-
Notifications
You must be signed in to change notification settings - Fork 4k
feat: Add batch processing capability for directory conversion #1372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add batch processing capability for directory conversion #1372
Conversation
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked this PR, I think it's useful, and I left two suggestions for improvement in the implementation. Congratulations on the work!
@tifilipebr Addressed feedback on reusing existing extension references and separating file validation by removing hardcoded extension list and leveraging existing validation system. |
Hey, really looking forward to this getting merged! I ran a quick test and found a potential issue with the current use of Here’s the test I ran:
Because I think it would be better to append |
Changed from using with_suffix('.md') to appending '.md' to preserve original filenames. This prevents files with same base name but different extensions (e.g., test.txt, test.py, test.md) from overwriting each other in the output directory. Fixes issue where batch processing would overwrite files, causing data loss.
@janthmueller |
Curious if this will be merged soon as it would be a great feature to have out of the box! |
Add batch processing capability for directory conversion
This PR is related #1371
Changes Made
This PR adds batch processing functionality to the MarkItDown CLI, allowing users to convert multiple files in a directory to Markdown format in a single operation.
New CLI Options
-b, --batch
: Enable batch processing mode-r, --recursive
: Process subdirectories recursively--types
: Filter by specific file extensions (e.g.,pdf,docx,pptx
)Implementation Details
__main__.py
--use-plugins
,--use-docintel
, etc.)User Pain Points Solved
Usage Examples
Testing
All tests pass successfully:
Test Coverage
test_cli_misc.py
Backward Compatibility
This change is fully backward compatible:
--use-plugins
,--use-docintel
, etc.) work seamlessly with batch modeFiles Modified
packages/markitdown/src/markitdown/__main__.py
: Added batch processing logicpackages/markitdown/tests/test_cli_misc.py
: Added comprehensive tests for new functionality