Skip to content

LLM-accessible output #1301

@jrturton

Description

@jrturton

Feature Name

Create additional outputs from the conversion process to enable LLMs to consume published documentation

Description

We propose to add the capability to DocC to generate Markdown-formatted versions of pages, to be stored alongside the existing JSON versions in the DocC archive. These files can be generated during the convert process by including a flag:

--enable-experimental-markdown-output

This will allow downstream workflows such as serving the Markdown files at specific URLs.

A new boolean RenderMetadata key, hasGeneratedMarkdown, will be set to true if a markdown version of a given page was generated.

A manifest containing paths to all of the generated Markdown files can also be generated at the root of the archive, to aid downstream processing. This will be controlled by a separate flag, which will be ignored if the flag above is not set:

--enable-experimental-markdown-output-manifest

Presentation and delivery of the page Markdown and manifest files is at the discretion of the documentation host. This proposal is only concerned with the generation of these files from documentation source.

Motivation

Documentation created using DocC is not stored on disk in an easily readable format for humans or LLMs. When a documentation archive is created, the pages are stored as JSON files that rely on a javascript application to render them. This means that the content of a given documentation URL is a small HTML page that just instructs the user to turn on JavaScript to browse the page, rather than the page itself.

This makes it difficult for documentation created using DocC to be used for any downstream processing outside of the javascript rendering that is currently used. For example, creating training data or allowing an LLM to read a documentation page is not straightforward.

Importance

It is possible to create these outputs from render JSON itself, but this is a fragile process and introduces a dependency on the render JSON format. The proposed changes would allow DocC users to directly generate markdown versions of their documentation as part of their standard workflows.

Alternatives Considered

Creating Markdown from the serialized RenderNode files was considered but rejected as it adds a dependency on the render JSON format.

Static HTML output was considered but rejected as Markdown is simpler to create and more suitable for consumption by LLMs. Models consuming Markdown instead of HTML can understand and use the same information with a smaller token budget and less cognitive effort. While static HTML output may be a useful development for DocC overall, the requirements for such a format (layout, which information is rendered, formatting) will not be as well suited to ingestion by LLMs and may involve conflicting priorities.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementImprovements or enhancements to existing functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions