|
| 1 | +--- |
| 2 | +title: PDF to Markdown API | Adobe PDF Services |
| 3 | +description: Learn about the PDF to Markdown API service that converts PDF documents into well-formatted Markdown text. |
| 4 | +--- |
| 5 | + |
| 6 | +# PDF to Markdown API |
| 7 | + |
| 8 | +The PDF to Markdown API (included with the PDF Services API) is a cloud-based web service that automatically converts PDF documents – native or scanned – into well-formatted Markdown text. This service preserves the document's structure and formatting while converting it into a format that's widely used for LLM flows, content authoring and documentation. |
| 9 | + |
| 10 | +## Structured Information Output Format |
| 11 | + |
| 12 | +The output of a PDF to Markdown operation includes: |
| 13 | + |
| 14 | +- A primary `.md` file containing the converted Markdown content |
| 15 | + |
| 16 | +### Output Structure |
| 17 | + |
| 18 | +The following is a summary of key elements in the converted Markdown: |
| 19 | + |
| 20 | +#### Elements |
| 21 | + |
| 22 | +Ordered list of semantic elements converted from the PDF document, preserving the natural reading order and document structure. The conversion handles: |
| 23 | + |
| 24 | +- Text content with proper Markdown syntax |
| 25 | +- Document hierarchy and structure |
| 26 | +- Inline formatting and emphasis |
| 27 | +- Links and references |
| 28 | +- Images and figures |
| 29 | +- Tables and complex layouts |
| 30 | + |
| 31 | +#### Content Types |
| 32 | + |
| 33 | +The API processes various content types as follows: |
| 34 | + |
| 35 | +##### Text Elements |
| 36 | + |
| 37 | +- **Headings**: Converted to appropriate Markdown heading levels (H1-H6) |
| 38 | +- **Paragraphs**: Preserved with proper spacing and formatting |
| 39 | +- **Lists**: Both ordered and unordered lists with proper nesting |
| 40 | +- **Text Emphasis**: Bold, italic, and other text formatting |
| 41 | +- **Links**: Preserved with proper Markdown link syntax |
| 42 | + |
| 43 | +##### Images and Figures |
| 44 | + |
| 45 | +- Provided as base64-embedded images in the Markdown output |
| 46 | +- Referenced correctly in the Markdown output |
| 47 | +- Original quality preserved |
| 48 | +- Proper alt text and captions maintained |
| 49 | + |
| 50 | +##### Tables |
| 51 | + |
| 52 | +- Converted to Markdown table syntax |
| 53 | +- Column alignment preserved |
| 54 | +- Cell content formatting maintained |
| 55 | +- Complex table structures supported |
| 56 | + |
| 57 | +#### Element Types and Paths |
| 58 | + |
| 59 | +The API recognizes and converts the following structural elements: |
| 60 | + |
| 61 | +| Category | Element Type | Description | |
| 62 | +| --------- | ----------------- | --------------------------------------------------------- | |
| 63 | +| Aside | Aside | Content which is not part of regular content flow | |
| 64 | +| Figure | Figure | Non-reflowable constructs like graphs, images, flowcharts | |
| 65 | +| Footnote | Footnote | Footnote | |
| 66 | +| Headings | H, H1, H2, etc | Heading levels | |
| 67 | +| List | L, Li, Lbl, Lbody | List and list item elements | |
| 68 | +| Paragraph | P, ParagraphSpan | Paragraphs and paragraph segments | |
| 69 | +| Reference | Reference | Links | |
| 70 | +| Section | Sect | Logical section of the document | |
| 71 | +| StyleSpan | StyleSpan | Styling variations within text | |
| 72 | +| Table | Table, TD, TH, TR | Table elements | |
| 73 | +| Title | Title | Document title | |
| 74 | + |
| 75 | +### Reading Order |
| 76 | + |
| 77 | +The reading order in the output Markdown maintains: |
| 78 | + |
| 79 | +- Natural document flow |
| 80 | +- Proper content hierarchy |
| 81 | +- Column-based layouts |
| 82 | +- Page transitions |
| 83 | +- Inline elements and references |
| 84 | + |
| 85 | +## Use Cases |
| 86 | + |
| 87 | +The PDF to Markdown API is particularly valuable for: |
| 88 | + |
| 89 | +- LLM-friendly content ingestion and prompt creation |
| 90 | +- Training/Fine-tuning LLM with PDFs |
| 91 | +- Content migration from PDF to documentation platforms |
| 92 | +- Legacy document conversion |
| 93 | +- Content repurposing for modern documentation systems |
| 94 | +- Integration with Markdown-based workflows |
| 95 | +- Automated document processing pipelines |
| 96 | +- Searchable internal knowledge repositories |
| 97 | + |
| 98 | +## API Limitations |
| 99 | + |
| 100 | +### File Constraints |
| 101 | + |
| 102 | +- **File Size**: Maximum of 100MB per file |
| 103 | +- **Page Count**: |
| 104 | + - Non-scanned PDFs: Up to 400 pages |
| 105 | + - Scanned PDFs: Up to 150 pages |
| 106 | +- **Page Dimensions**: Between 6" and 17.5" in either dimension |
| 107 | + |
| 108 | +### Processing Limits |
| 109 | + |
| 110 | +- **Rate Limits**: Maximum 25 requests per minute |
| 111 | +- **Language Support**: Optimized for English, supports other Latin-based languages |
| 112 | +- **OCR Quality**: Dependent on scan quality (minimum 200 DPI recommended) |
| 113 | + |
| 114 | +### Document Requirements |
| 115 | + |
| 116 | +- Files must be unprotected or allow content copying |
| 117 | +- No support for: |
| 118 | + - Hidden objects (JavaScript, OCG) |
| 119 | + - XFA and fillable forms |
| 120 | + - Complex annotations |
| 121 | + - CAD drawings or vector art |
| 122 | + - Password-protected content |
| 123 | + |
| 124 | +## REST API |
| 125 | + |
| 126 | +See our public API Reference for [PDF to Markdown API](../../../apis/#tag/PDF-to-Markdown). |
0 commit comments