Skip to content

Conversation

HossyWorlds
Copy link

@HossyWorlds HossyWorlds commented Jul 19, 2025

Add batch processing capability for directory conversion

This PR is related #1371

Changes Made

This PR adds batch processing functionality to the MarkItDown CLI, allowing users to convert multiple files in a directory to Markdown format in a single operation.

New CLI Options

  • -b, --batch: Enable batch processing mode
  • -r, --recursive: Process subdirectories recursively
  • --types: Filter by specific file extensions (e.g., pdf,docx,pptx)

Implementation Details

  • Added batch processing logic to __main__.py
  • Maintains directory structure in output
  • Supports all existing MarkItDown file formats
  • Integrates seamlessly with existing options (--use-plugins, --use-docintel, etc.)
  • Provides progress reporting and error handling

User Pain Points Solved

  • Efficiency: Eliminates the need to run individual commands for each file
  • Consistency: Ensures all files are processed with the same settings
  • Scalability: Handles large document collections efficiently
  • Workflow Integration: Better integration with automated processing pipelines

Usage Examples

# Basic batch processing
markitdown --batch ./documents --output ./converted

# Recursive processing with file type filter
markitdown --batch ./documents --recursive --types pdf,docx,pptx --output ./converted

# With existing options
markitdown --batch ./documents --use-plugins --output ./converted

Testing

All tests pass successfully:

  • ✅ Existing functionality tests (single file conversion, stdin processing, etc.)
  • ✅ New batch processing tests
  • ✅ Error handling tests
  • ✅ Integration tests with existing options
  • ✅ Backward compatibility verified

Test Coverage

  • Added comprehensive CLI tests in test_cli_misc.py
  • Verified existing functionality remains intact
  • Tested error cases and edge conditions
  • Confirmed proper integration with existing options

Backward Compatibility

This change is fully backward compatible:

  • All existing CLI commands continue to work as before
  • No breaking changes to the API
  • Existing options (--use-plugins, --use-docintel, etc.) work seamlessly with batch mode

Files Modified

  • packages/markitdown/src/markitdown/__main__.py: Added batch processing logic
  • packages/markitdown/tests/test_cli_misc.py: Added comprehensive tests for new functionality

@HossyWorlds
Copy link
Author

@microsoft-github-policy-service agree

Copy link

@tifilipebr tifilipebr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked this PR, I think it's useful, and I left two suggestions for improvement in the implementation. Congratulations on the work!

@HossyWorlds
Copy link
Author

@tifilipebr
Thank you for your reviewing.

Addressed feedback on reusing existing extension references and separating file validation by removing hardcoded extension list and leveraging existing validation system.

@HossyWorlds HossyWorlds requested a review from tifilipebr July 20, 2025 03:33
@janthmueller
Copy link

Hey, really looking forward to this getting merged!

I ran a quick test and found a potential issue with the current use of with_suffix('.md') in _handle_batch_processing . It replaces the original file suffix, which causes files with the same name but different extensions to overwrite each other.

Here’s the test I ran:

~/markitdown pr-1372* python-3.12.3 ❯ mkdir test
~/markitdown pr-1372* python-3.12.3 ❯ touch test/test.md test/test.txt test/test.py
~/markitdown pr-1372* python-3.12.3 ❯ markitdown -b test
Found 3 files to process
[1/3] Processing: test.md
✓ Success: test.md
[2/3] Processing: test.py
✓ Success: test.py
[3/3] Processing: test.txt
✓ Success: test.txt

Batch processing complete!
Success: 3 files
Failed: 0 files
Unsupported: 0 files
Output directory: test/converted
~/markitdown pr-1372* python-3.12.3 ❯ ls test/converted
test.md

Because with_suffix('.md') replaces the suffix, all files end up saved as test.md in the output directory, overwriting each other.

I think it would be better to append .md instead of replacing the suffix, or at least provide an option to control this behavior with proper error handling.

Changed from using with_suffix('.md') to appending '.md' to preserve
original filenames. This prevents files with same base name but different
extensions (e.g., test.txt, test.py, test.md) from overwriting each other
in the output directory.

Fixes issue where batch processing would overwrite files, causing data loss.
@HossyWorlds
Copy link
Author

@janthmueller
Thank you for reviewing!!
I've fixed!!

@tomtom215
Copy link

Curious if this will be merged soon as it would be a great feature to have out of the box!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants