-
Notifications
You must be signed in to change notification settings - Fork 4k
Add optional page/line/word bounding boxes for PDF & image inputs (--emit-bbox) + evaluation notes #1398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree |
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an optional bounding box emission feature to MarkItDown for PDF and image inputs, along with comprehensive performance and accuracy evaluations against the Docling test dataset. When enabled via the --emit-bbox
flag, the feature produces a sidecar JSON file containing page geometries, line/word-level bounding boxes, and OCR confidence values without affecting existing Markdown outputs.
Key changes:
- New opt-in
--emit-bbox
CLI flag and corresponding API parameter for spatial grounding capabilities - Implementation of OCR fallback for scanned PDFs and images using Tesseract when no text layer exists
- Comprehensive evaluation documentation comparing outputs against Docling ground truth data
Reviewed Changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 9 comments.
Show a summary per file
File | Description |
---|---|
packages/markitdown/src/markitdown/bbox.py | New data structures for bounding box information (BBoxDoc, BBoxPage, BBoxLine, BBoxWord) |
packages/markitdown/src/markitdown/_base_converter.py | Added bbox parameter to DocumentConverterResult |
packages/markitdown/src/markitdown/_markitdown.py | Updated all convert methods to support emit_bbox and ocr_lang parameters |
packages/markitdown/src/markitdown/main.py | Added CLI flags --emit-bbox and --ocr-lang with sidecar file output logic |
packages/markitdown/src/markitdown/converters/_pdf_converter.py | Implemented bbox extraction for PDFs with OCR fallback for scanned documents |
packages/markitdown/src/markitdown/converters/_image_converter.py | Added OCR-based bbox extraction for image inputs |
packages/markitdown/tests/bbox/ | Comprehensive test suite covering schema validation, PDF/image processing, and CLI integration |
packages/markitdown/pyproject.toml | Added bbox optional dependency group with pdfplumber, pytesseract, Pillow, jsonschema |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
x1 / width, | ||
y1 / height, | ||
widthw / width, | ||
heighth / height, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bbox_norm calculation is incorrect. It should be normalized coordinates [x1/width, y1/height, x2/width, y2/height] but is currently calculating [x1/width, y1/height, width/width, height/height]. The third and fourth values should be x2/width and y2/height respectively.
heighth / height, | |
x2 / width, | |
y2 / height, |
Copilot uses AI. Check for mistakes.
x1, y1, x2, y2 = left, top, left + w, top + h | ||
conf = float(row.conf) if row.conf != -1 else None | ||
bbox_abs = [x1, y1, x2, y2] | ||
bbox_norm = [x1 / width, y1 / height, w / width, h / height] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bbox_norm calculation is inconsistent with the absolute coordinates. The bbox_abs uses [x1, y1, x2, y2] format but bbox_norm uses [x1, y1, width, height] format. This should be [x1/width, y1/height, x2/width, y2/height] to match the absolute coordinate format.
bbox_norm = [x1 / width, y1 / height, w / width, h / height] | |
bbox_norm = [x1 / width, y1 / height, x2 / width, y2 / height] |
Copilot uses AI. Check for mistakes.
x1 / width, | ||
y1 / height, | ||
(x2 - x1) / width, | ||
(y2 - y1) / height, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same bbox_norm calculation error as line 141. The normalized coordinates should be [x1/width, y1/height, x2/width, y2/height] but are currently [x1/width, y1/height, (x2-x1)/width, (y2-y1)/height].
(y2 - y1) / height, | |
x2 / width, | |
y2 / height, |
Copilot uses AI. Check for mistakes.
x0 / width, | ||
top / height, | ||
(x1 - x0) / width, | ||
(bottom - top) / height, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another instance of the same bbox_norm calculation error. Should be [x0/width, top/height, x1/width, bottom/height] instead of [x0/width, top/height, (x1-x0)/width, (bottom-top)/height].
(bottom - top) / height, | |
x1 / width, | |
bottom / height, |
Copilot uses AI. Check for mistakes.
x1 / width, | ||
y1 / height, | ||
(x2 - x1) / width, | ||
(y2 - y1) / height, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same bbox_norm calculation error. The format is inconsistent between absolute coordinates (x1, y1, x2, y2) and normalized coordinates (x1, y1, width, height).
(y2 - y1) / height, | |
x2 / width, | |
y2 / height, |
Copilot uses AI. Check for mistakes.
x1 / width, | ||
y1 / height, | ||
(x2 - x1) / width, | ||
(y2 - y1) / height, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same bbox_norm calculation inconsistency as previous instances. Should use [x1/width, y1/height, x2/width, y2/height] format to match the absolute coordinate format.
(y2 - y1) / height, | |
x2 / width, | |
y2 / height, |
Copilot uses AI. Check for mistakes.
@@ -0,0 +1,34 @@ | |||
import io | |||
from pathlib import Path | |||
import io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate import statement. The 'io' module is imported twice (lines 1 and 4).
import io |
Copilot uses AI. Check for mistakes.
import io | ||
import json | ||
from pathlib import Path | ||
import io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate import statement. The 'io' module is imported twice (lines 1 and 4).
import io |
Copilot uses AI. Check for mistakes.
emit_bbox=emit_bbox, | ||
ocr_lang=ocr_lang, | ||
**kwargs, | ||
) | ||
|
||
def _convert( | ||
self, *, file_stream: BinaryIO, stream_info_guesses: List[StreamInfo], **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _convert method signature needs to be updated to accept emit_bbox and ocr_lang parameters to properly pass them to converters, but the current implementation only passes them via **kwargs which could lead to inconsistent behavior.
self, *, file_stream: BinaryIO, stream_info_guesses: List[StreamInfo], **kwargs | |
self, | |
*, | |
file_stream: BinaryIO, | |
stream_info_guesses: List[StreamInfo], | |
emit_bbox: bool = False, | |
ocr_lang: Optional[str] = None, | |
**kwargs |
Copilot uses AI. Check for mistakes.
Add optional page/line/word bounding boxes for PDF & image inputs (
--emit-bbox
) + evaluation notesSummary
This PR adds an opt-in capability to emit page-anchored bounding boxes for text extracted from PDFs and images. When enabled, MarkItDown produces a sidecar JSON file that contains:
The feature is off by default and does not change any existing outputs. It is designed for downstream tasks that need spatial grounding—e.g., overlaying selections, table cell alignment checks, redaction previews, or training data generation for doc-layout models.
The fork also documents performance measurements on the Docling test set and a delta analysis vs Docling’s outputs, to help frame accuracy/robustness and cost. ([GitHub]1)
CLI / API
New CLI flags
--emit-bbox
When present, MarkItDown writes a sidecar JSON file next to the Markdown output (e.g.,
sample.pdf
→sample.bbox.json
). Applies to PDF and image inputs. For image-only or scanned PDFs (no text layer), bounding boxes are obtained via OCR. ([GitHub]1)--ocr-lang <lang-codes>
(optional)Controls OCR language(s) for cases where OCR is used. Mirrors
MARKITDOWN_OCR_LANG
(see Env Vars). ([GitHub]1)Environment variables
MARKITDOWN_OCR_LANG
– default OCR language(s) when--emit-bbox
triggers OCR.TESSDATA_PREFIX
– path to custom tessdata if needed.(Both only matter when OCR is used, i.e., scanned PDFs / images.) ([GitHub]1)
Output format (sidecar JSON)
When
--emit-bbox
is set, MarkItDown writes<basename>.bbox.json
with the following structure:Semantics
bbox_abs
is in pixel units of the page/image, top-left origin.bbox_norm
is normalized[x0, y0, x1, y1]
in[0,1]
relative to page width/height.md_span
links a line back intoresult.text_content
via character offsets (start
,end
), enabling exact highlighting in the Markdown string.line_id
associates word items with their parent line.confidence
isnull
when unavailable (e.g., embedded text), or a numeric value when the OCR engine returns one. ([GitHub]1)How it works (high level)
--ocr-lang
/MARKITDOWN_OCR_LANG
.TESSDATA_PREFIX
is respected for custom language packs. ([GitHub]1)This design keeps the default UX unchanged and only introduces extra work when explicitly requested.
Performance notes (Docling test data)
To help reviewers understand the runtime impact, the fork includes a small timing study on the Docling test dataset (12 documents across PDF/PNG/TIFF). Highlights:
Average Markdown conversion time: 3.18 s
Average bbox generation time (with
--emit-bbox
): 5.10 sBy type (avg MD / avg BBox):
Full per-file table is included in the README section of the fork. ([GitHub]1)
Accuracy / quality observations
On the same Docling test set, a simple delta analysis compares MarkItDown outputs to Docling’s ground truth:
Bigger discrepancies were seen on right-to-left pages and scanned forms; these are called out for future iteration. (Context and numbers are documented in the fork’s README.) ([GitHub]1)
The fork also includes comparison notes against FUNSD (see
funsd_bbox_comparison.md
) to illustrate layout alignment behavior on form-like documents. ([GitHub]1)Why this belongs in MarkItDown
Backwards compatibility
--emit-bbox
is provided.Documentation added
The fork’s README adds a “Bounding Boxes” section with:
--emit-bbox
)md_span
)These doc bits can be ported verbatim or adapted into the upstream README. ([GitHub]1)
Testing & metrics (what reviewers can look at)
Schema sanity checks (implicit in the docs):
x0 ≤ x1
,y0 ≤ y1
;[0,1]
;md_span
ranges map back to the same text content as the corresponding line.Benchmarks included in README: per-file timings + averages; delta analysis vs Docling ground truth (content diff and coord deviation). ([GitHub]1)
Limitations & next steps (proposed follow-ups)
md_span
is provided at line-level), enabling finer-grained highlight sync.Checklist
--emit-bbox
), default behavior unchangedScreenshots / Examples
Run:
markitdown sample.pdf --emit-bbox # Produces: sample.md and sample.bbox.json
Excerpt from the emitted JSON is shown above; see the fork README for full details and per-file timings. ([GitHub]1)
Thanks for reviewing! Happy to split this into a docs-only PR + feature PR if you prefer, or to iterate on schema details (e.g., add per-word
md_span
, include page DPI, or expose rotation angles).