Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,70 @@ You can also pipe content:
cat path-to-file.pdf | markitdown
```

### Bounding Boxes

Use `--emit-bbox` to generate a sidecar JSON file with page, line, and word bounding boxes for PDF and image inputs:

```bash
markitdown sample.pdf --emit-bbox
```

This writes `sample.bbox.json` alongside the Markdown output. The structure of the JSON file is:

```json
{
"version": "1.0",
"source": "sample.pdf",
"pages": [{ "page": 1, "width": 612, "height": 792 }],
"lines": [{ "page": 1, "text": "Hello", "bbox_norm": [0,0,0,0], "bbox_abs": [0,0,0,0], "confidence": null, "md_span": {"start": null, "end": null} }],
"words": [{ "page": 1, "text": "Hello", "bbox_norm": [0,0,0,0], "bbox_abs": [0,0,0,0], "confidence": null, "line_id": 0 }]
}
```

`bbox_abs` values are in pixel units of the page or image, with a top-left origin. `bbox_norm` values are normalized to the range `[0,1]`.

For scanned PDFs or images without embedded text, MarkItDown falls back to Tesseract OCR when `--emit-bbox` is supplied. Set `MARKITDOWN_OCR_LANG` (or use `--ocr-lang`) to control OCR languages. Use `TESSDATA_PREFIX` if custom language packs are installed.

For an example comparison with Docling outputs, see [docling_comparison.md](docling_comparison.md).
For a comprehensive evaluation on the Docling test dataset, see [docling_dataset_comparison.md](docling_dataset_comparison.md).
Across the 12 supported documents, MarkItDown's Markdown differed from the Docling ground truth by roughly **45%** on average,
with bounding box coordinates deviating by about **18%**. Right-to-left pages and scanned forms contributed most of the
discrepancies.

### Docling Test Data Timing

The following table reports the time required by `markitdown` to convert each PDF, TIFF, and PNG file from the [Docling test dataset](https://github.com/docling-project/docling/tree/main/tests/data) into Markdown and to generate bounding boxes (`--emit-bbox`). The TIFF sample was first converted to PNG for processing.

| File | Type | MD Time (s) | BBox Time (s) |
| --- | --- | --- | --- |
| 2305.03393v1-pg9-img.png | png | 2.51 | 5.56 |
| 2203.01017v2.pdf | pdf | 4.59 | 9.30 |
| 2206.01062.pdf | pdf | 4.94 | 11.21 |
| 2305.03393v1-pg9.pdf | pdf | 2.69 | 2.88 |
| 2305.03393v1.pdf | pdf | 3.71 | 6.70 |
| amt_handbook_sample.pdf | pdf | 3.14 | 3.99 |
| code_and_formula.pdf | pdf | 2.80 | 3.24 |
| multi_page.pdf | pdf | 2.89 | 3.93 |
| picture_classification.pdf | pdf | 2.68 | 2.92 |
| redp5110_sampled.pdf | pdf | 3.71 | 8.67 |
| right_to_left_01.pdf | pdf | 2.83 | 2.87 |
| right_to_left_02.pdf | pdf | 2.70 | 3.01 |
| right_to_left_03.pdf | pdf | 2.81 | 2.93 |
| 2206.01062.tif | tiff | 2.57 | 4.19 |

#### Average Times by Type

| Type | Avg MD Time (s) | Avg BBox Time (s) |
| --- | --- | --- |
| png | 2.51 | 5.56 |
| pdf | 3.29 | 5.14 |
| tiff | 2.57 | 4.19 |

#### Overall Average Times

* Average MD Time: 3.18 s
* Average BBox Time: 5.10 s

### Optional Dependencies
MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the `[all]` option. However, you can also install them individually for more control. For example:

Expand Down
27 changes: 27 additions & 0 deletions docling_comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Docling vs MarkItDown on ocr_test.pdf

This document compares the outputs of [Docling](https://github.com/docling-project/docling) and the current MarkItDown implementation on the sample `ocr_test.pdf`.

## Markdown comparison
- Normalized similarity ratio: 1.00

```diff
--- docling
+++ markitdown
@@ -1 +1,3 @@
-Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package
+Docling bundles PDF document conversion to
+JSON and Markdown in an easy self contained
+package
```

## Bounding box comparison (first line)
Page size (MarkItDown): 1654 x 2339 px

| coordinate | Docling (scaled) | MarkItDown | abs diff | norm diff |
|-----------:|------------------:|-----------:|---------:|----------:|
| x1 | 193.63 | 205.00 | 11.37 | 0.0069 |
| y1 | 213.92 | 217.00 | 3.08 | 0.0013 |
| x2 | 1402.98 | 1398.00 | 4.98 | 0.0030 |
| y2 | 424.81 | 268.00 | 156.81 | 0.0670 |

22 changes: 22 additions & 0 deletions docling_dataset_comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Docling vs MarkItDown on Docling Test Dataset

This report compares Docling ground truth outputs (docling_v2) with the current MarkItDown conversion on the PDF and TIFF files from the [docling test data](https://github.com/docling-project/docling/tree/main/tests/data) dataset. For each document we compute the normalized similarity ratio between Docling and MarkItDown Markdown outputs, and the absolute/normalized differences between first line bounding box coordinates.

| File | Markdown similarity | Markdown diff (%) | x1 abs | y1 abs | x2 abs | y2 abs | x1 norm | y1 norm | x2 norm | y2 norm | Avg bbox diff (%) |
|------|--------------------:|------------------:|-------:|-------:|-------:|-------:|--------:|--------:|--------:|--------:|------------------:|
| 2203.01017v2 | 0.68 | 32.00 | 0.00 | 1.44 | 0.00 | 0.00 | 0.0000 | 0.0018 | 0.0000 | 0.0000 | 0.04 |
| 2206.01062 | 0.55 | 45.00 | 0.00 | 1.77 | 0.00 | 0.18 | 0.0000 | 0.0022 | 0.0000 | 0.0002 | 0.06 |
| 2305.03393v1-pg9 | 0.78 | 22.00 | 0.00 | 0.16 | 33.04 | 0.74 | 0.0000 | 0.0002 | 0.0540 | 0.0009 | 1.38 |
| 2305.03393v1 | 0.77 | 23.00 | 0.00 | 1.67 | 0.00 | 0.00 | 0.0000 | 0.0021 | 0.0000 | 0.0000 | 0.05 |
| amt_handbook_sample | 0.48 | 52.00 | 44.91 | 658.38 | 438.61 | 656.85 | 0.0756 | 0.8506 | 0.7384 | 0.8486 | 62.83 |
| code_and_formula | 0.67 | 33.00 | 0.00 | 1.72 | 0.00 | 0.03 | 0.0000 | 0.0022 | 0.0000 | 0.0000 | 0.06 |
| multi_page | 0.97 | 3.00 | 0.00 | 1.47 | 0.00 | 0.66 | 0.0000 | 0.0017 | 0.0000 | 0.0008 | 0.06 |
| picture_classification | 0.98 | 2.00 | 0.00 | 1.72 | 0.01 | 0.03 | 0.0000 | 0.0022 | 0.0000 | 0.0000 | 0.06 |
| redp5110_sampled | 0.53 | 47.00 | 250.92 | 724.48 | 320.24 | 714.36 | 0.4100 | 0.9148 | 0.5233 | 0.9020 | 68.75 |
| right_to_left_01 | 0.05 | 95.00 | 63.72 | 1.45 | 0.00 | 0.70 | 0.1041 | 0.0018 | 0.0000 | 0.0009 | 2.67 |
| right_to_left_02 | 0.02 | 98.00 | 23.15 | 594.43 | 378.81 | 595.51 | 0.0389 | 0.7060 | 0.6364 | 0.7073 | 52.22 |
| right_to_left_03 | 0.08 | 92.00 | 419.00 | 48.07 | 238.12 | 51.77 | 0.7038 | 0.0571 | 0.4000 | 0.0615 | 30.56 |
| 2206.01062_tif | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| **Overall (avg)** | 0.55 | 45.33 | | | | | | | | | 18.23 |

Overall, MarkItDown's Markdown output is about **54.7%** similar to the Docling ground truth (45.33% different) across the 12 supported documents. Bounding box coordinates diverge by an average of **18.23%**, with right-to-left samples and scanned forms contributing most of the error.
Loading