Output Formats

Lectito produces all output formats during extraction.

The formats come from the same cleaned article root. That means callers can store HTML for fidelity, use Markdown for display or editing, and use plain text for search without running extraction multiple times.

#![allow(unused)]
fn main() {
let article = extract(html, base_url, &ReadabilityOptions::default())?.unwrap();

let html = article.content;
let markdown = article.markdown;
let text = article.text_content;
}

HTML

content is cleaned article HTML. Scripts, styles, navigation, sidebars, and other page chrome are removed where possible. Relative URLs are resolved when a base URL is provided.

Use HTML when you need the closest representation of the extracted article. It keeps images, links, tables, inline markup, and other structure that can be lost in plain text.

Markdown

markdown is generated from the cleaned article HTML. It preserves common reader content:

headings
paragraphs
links and images
lists
blockquotes
code blocks
tables
math
footnotes

Markdown cleanup also strips zero-width break hints, drops empty links, keeps images intact, and removes duplicate title headings before rendering.

The CLI Markdown output includes TOML frontmatter:

lectito article.html

Markdown is useful when the next step is a reader view, note-taking system, static archive, or editor. It is also easier to diff in tests than HTML.

Plain Text

text_content is normalized article text. Use it for indexing, previews, and readability checks.

Plain text should not be treated as a rendering format. It discards links, images, and most document structure.

JSON

The CLI can serialize the article:

lectito article.html --format json --pretty

JSON is the best CLI format when another program needs metadata and content together.

Quality Expectations