Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions docs/C4-Diagrams/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# MarkItDown CLI - Architecture Documentation

## Overview
This directory contains **C4 model architecture diagrams** for **MarkItDown**, a lightweight Python CLI tool that converts various document formats into clean, structured Markdown — optimized for use with **Large Language Models (LLMs)** and text analysis pipelines.

MarkItDown emphasizes **structural fidelity** (headings, lists, tables, links) and **machine readability**, making it ideal for preprocessing documents for AI workflows. While output is often human-readable, the focus is on **semantic accuracy** over visual presentation.

## 📊 C4 Architecture Diagrams

### Level 1: System Context Diagram
![Context Diagram](exports/level-1.png)

Shows how **users** interact with the **MarkItDown CLI** to convert files into Markdown


### Level 2: Container Diagram
![Container Diagram](exports/level-2.png)

Highlights the high-level components:
- **Main Module** (`__main__.py`) – CLI entry point and argument handling
- **MarkItDown Core** (`_markitdown.py`) – Orchestrates conversion logic
- **Document Converters** – Format-specific modules (PDF, DOCX, HTML, etc.)
- **Utilities** – Stream handling, math preprocessing, exception management

All components work together to support **stream-based**, **plugin-extensible** conversion.

### Level 3: Component Diagram
![Component Diagram](exports/level-3.png)

Detailed view of internal structure:
- **Conversion Orchestrator**: Manages workflow: input → detection → conversion → output
- **Base Converter Interface**: Defines `accepts()` and `convert()` contract
- **Format Converters**: Built-in and pluggable converters per file type
- **StreamInfo**: Metadata (mimetype, extension, encoding) for smart routing
- **Error Handling**: Structured exceptions with retry context

Key Features:
- Plugin-based extensibility via `register_converter()`
- Streaming support for large files
- Automatic format detection
- Rich metadata and error tracking

### Level 4: Code-Level Diagram
![Code-Level Diagram](exports/level-4.png)

Shows low-level design and relationships:
- **CLI Flow**: Argument parsing → output handling → error exit
- **Converter Inheritance**: All converters implement `DocumentConverter`
- **Exception Hierarchy**: `MarkItDownException` base with typed subclasses
- **Converter Registry**: Central registry of available converters by format
- **Math Processing**: DOCX equation (OMML) → LaTeX → Markdown

This level supports maintainers and contributors in understanding control flow and extension points.

## 🗂️ Directory Structure
```
C4-Diagrams
|-src (Editable source files)
|-exports (Generated PNGs)
```
## 📌 Purpose
These diagrams improve:
- Onboarding new contributors
- Long-term maintainability
- Design discussions and refactoring
- Understanding of conversion pipeline and extensibility model
Binary file added docs/C4-Diagrams/exports/level-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/C4-Diagrams/exports/level-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/C4-Diagrams/exports/level-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/C4-Diagrams/exports/level-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
43 changes: 43 additions & 0 deletions docs/C4-Diagrams/src/level-1.drawio
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
<mxfile host="65bd71144e">
<diagram id="PkXtOsccsb5lSiSgQKMI" name="Page-1">
<mxGraphModel dx="292" dy="235" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
<root>
<mxCell id="0"/>
<mxCell id="1" parent="0"/>
<mxCell id="6" value="" style="shape=actor;whiteSpace=wrap;html=1;" parent="1" vertex="1">
<mxGeometry x="80" y="170" width="40" height="60" as="geometry"/>
</mxCell>
<mxCell id="9" value="User" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="70" y="230" width="60" height="30" as="geometry"/>
</mxCell>
<mxCell id="10" value="" style="endArrow=classic;html=1;exitX=1;exitY=0.75;exitDx=0;exitDy=0;entryX=0.005;entryY=0.551;entryDx=0;entryDy=0;entryPerimeter=0;" parent="1" source="6" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="360" y="280" as="sourcePoint"/>
<mxPoint x="279.5999999999999" y="216.12" as="targetPoint"/>
</mxGeometry>
</mxCell>
<mxCell id="13" value="" style="edgeStyle=none;html=1;" parent="1" source="11" target="12" edge="1">
<mxGeometry relative="1" as="geometry"/>
</mxCell>
<mxCell id="11" value="&lt;b&gt;&lt;font style=&quot;font-size: 14px;&quot;&gt;&lt;span style=&quot;color: rgb(0, 0, 0);&quot;&gt;MarkitDown CLI&lt;/span&gt;&lt;/font&gt;&lt;/b&gt;&lt;div&gt;&lt;font style=&quot;color: rgb(24, 20, 29);&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;div&gt;&lt;span style=&quot;letter-spacing: 0.32px; text-align: start; white-space-collapse: preserve-breaks;&quot;&gt;&lt;font face=&quot;Helvetica&quot; style=&quot;color: rgb(24, 20, 29); font-size: 14px;&quot;&gt;MarkItDown converts files to Markdown for LLMs, focusing on structural fidelity. Like textract, but optimized for machine-readability over human presentation.&lt;/font&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;" style="rounded=1;whiteSpace=wrap;html=1;" parent="1" vertex="1">
<mxGeometry x="280" y="137.5" width="290" height="165" as="geometry"/>
</mxCell>
<mxCell id="12" value="&lt;b&gt;MarkDown Document&amp;nbsp;&lt;/b&gt;" style="shape=document;whiteSpace=wrap;html=1;boundedLbl=1;" parent="1" vertex="1">
<mxGeometry x="640" y="180" width="120" height="80" as="geometry"/>
</mxCell>
<mxCell id="16" value="A file will be provided by the user." style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="130" y="185" width="140" height="30" as="geometry"/>
</mxCell>
<mxCell id="20" value="&lt;b style=&quot;text-align: left;&quot;&gt;Document Supported&lt;br&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Audio&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;CSV&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Doc&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Docx&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Epub&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Html&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Image&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Ipynb&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;ZIP&lt;/b&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/b&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry y="270" width="155" height="250" as="geometry"/>
</mxCell>
<mxCell id="23" value="&lt;ul style=&quot;color: rgb(63, 63, 63); font-weight: 700; text-align: left;&quot;&gt;&lt;li&gt;&lt;b&gt;Outlook-Msg&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;PDF&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Plain Text&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;PPTX&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;RSS&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Transcribe Audio&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Wikipedia&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;XLSX&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Youtube&lt;/b&gt;&lt;/li&gt;&lt;/ul&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="90" y="300" width="150" height="210" as="geometry"/>
</mxCell>
<mxCell id="25" value="" style="rounded=1;whiteSpace=wrap;html=1;fillColor=none;" parent="1" vertex="1">
<mxGeometry y="270" width="250" height="240" as="geometry"/>
</mxCell>
</root>
</mxGraphModel>
</diagram>
</mxfile>
Loading