Lectito

Lectito is a Rust library and CLI tool for extracting readable article content from HTML.

Most web pages contain way more than the text a reader came for, like ads, navigation, related links, comment areas, tracking markup, hidden elements, and presentation wrappers. Lectito tries to identify the main content root and return a smaller document that is useful for reading, storage, search, and conversion.

It returns:

  • cleaned article HTML
  • Markdown
  • plain text
  • page metadata
  • extraction diagnostics

Lectito is parser-first. The core API accepts HTML and an optional base URL. URL fetching exists in the CLI for convenience, but the library does not require network access.

This keeps the library usable in environments that already have HTML available: crawlers, browser extensions, desktop apps, mobile apps, tests, and offline archives.

Main APIs

#![allow(unused)]
fn main() {
use lectito::{extract, ReadabilityOptions};

let html = r#"<article><h1>Title</h1><p>Article text.</p></article>"#;
let article = extract(html, Some("https://example.com/post"), &ReadabilityOptions::default())?;

if let Some(article) = article {
    println!("{}", article.markdown);
}
Ok::<(), lectito::Error>(())
}

Use extract_with_diagnostics when tuning extraction or debugging a bad page.

Use is_probably_readable before extraction when you only need a quick yes/no answer.

Installation

Lectito is split into a core library and a CLI. Use the library when your application already has HTML. Use the CLI for local inspection, shell scripts, and quick conversions.

Library

Add lectito to your Rust project:

[dependencies]
lectito = "0.1"

For local development against this workspace:

[dependencies]
lectito = { path = "crates/core" }

The Rust crate name is lectito.

The core crate has no runtime service requirement. It parses the string you pass in and returns an article result.

CLI

Install the CLI from crates.io:

cargo install lectito-cli

For local development against this workspace:

cargo install --path crates/cli

The binary is named lectito.

lectito --help

The CLI can read from a file, stdin, or a URL. URL support is a command-line convenience; it is not part of the core library contract.

Fixture helpers are workspace-only and are not part of the published CLI package.

For local fixture inspection, run the unpublished workspace helper:

cargo run -p lectito-fixtures --bin lectito-fixture -- sample-name

License

Lectito is licensed under MPL-2.0.

Quick Start

Extract From HTML

Start with extract for normal use. It takes the source HTML, an optional base URL, and ReadabilityOptions. The base URL lets Lectito resolve relative links, images, and metadata URLs in the extracted output.

use lectito::{extract, ReadabilityOptions};

fn main() -> Result<(), lectito::Error> {
    let html = r#"
        <html>
          <head><title>Example</title></head>
          <body>
            <article>
              <h1>Example</h1>
              <p>This is the article body.</p>
            </article>
          </body>
        </html>
    "#;

    let article = extract(html, Some("https://example.com/article"), &ReadabilityOptions::default())?;

    if let Some(article) = article {
        println!("{:?}", article.title);
        println!("{}", article.markdown);
    }

    Ok(())
}

extract returns Ok(None) when no useful article content is found. That is different from an error. An empty or navigation-only page can be parsed successfully and still have no article.

Check Readability

Use is_probably_readable when you only need to decide whether a page is worth running through full extraction. It is faster and returns a boolean.

#![allow(unused)]
fn main() {
use lectito::{is_probably_readable, ReadableOptions};

let readable = is_probably_readable(html, &ReadableOptions::default())?;
Ok::<(), lectito::Error>(())
}

CLI

The CLI mirrors the library. The root command extracts content, and readable performs the quick readability check.

lectito article.html
lectito https://example.com/article --format json --pretty
lectito readable article.html

CLI Usage

The CLI is designed for inspecting extraction behavior and converting documents from the terminal.

The root command extracts article content. The CLI also has these subcommands:

  • readable: check whether a document looks readable
  • inspect: print extraction metadata and scoring details
  • llms: fetch, parse, and expand llms.txt files

Extract

Pass a URL, an AT URI, a file path, or - for stdin. Markdown with TOML frontmatter is the default output.

lectito article.html
lectito https://example.com/article
lectito at://did:plc:abc123/site.standard.document/xyz
lectito - < article.html

When a fetched page advertises rel="site.standard.document", the CLI resolves the ATProto record and uses the record content when it can render it. Direct at:// inputs are supported for renderable site.standard.document records. If a normal web URL cannot be resolved through Standard.site, the CLI extracts from the fetched HTML.

Output formats:

Use HTML, text, or JSON when Markdown is not the right output for the next tool.

lectito article.html --format html
lectito article.html --format text
lectito article.html --format json --pretty
lectito article.html --frontmatter=false
lectito article.html --output article.md

Useful options:

The defaults work for most article pages. Tune these flags when a page is too short, too broad, or has a known content container.

lectito article.html --char-threshold 800
lectito article.html --nb-top-candidates 8
lectito article.html --content-selector article
lectito article.html --base-url https://example.com/post --site-profile example.com.toml
lectito article.html --max-elems-to-parse 10000
lectito article.html --media article
lectito article.html --media none
lectito article.html --keep-classes --preserve-class language-rust

--content-selector is the strongest extraction hint. Use it when you know the article root for a page or fixture. Without that flag, the CLI still tries common article-body containers before falling back to generic scoring.

--media accepts none, conservative, article, or all. The default is article, which keeps figures/images that appear to be part of the article body.

--site-profile can be repeated. Each file must be a TOML site profile. User profiles take precedence over bundled profiles for the same host.

--disable-json-ld turns off JSON-LD metadata extraction and the JSON-LD article-body fast path. Use it when structured data is stale or misleading.

Diagnostics are written to stderr after the main output to keep keep stdout usable for the extracted article while still showing debug information in the terminal.

lectito article.html --diagnostic-format pretty
lectito article.html --diagnostic-format json

--inspect prints a compact extraction summary to stderr while keeping article output on stdout:

lectito article.html --inspect

Full extraction has a timeout so unusually large or hostile pages do not hang the command:

lectito article.html --timeout 10

Readable

readable checks whether the document appears to contain enough article-like text. It does not return extracted content.

lectito readable article.html
lectito readable --stdin < article.html
lectito readable https://example.com/article
lectito readable article.html --json --pretty
lectito readable article.html --timeout 10

Thresholds:

lectito readable article.html --min-content-length 140 --min-score 20

Inspect

inspect prints extraction metadata and scoring details without printing the article body.

lectito inspect article.html
lectito inspect https://example.com/article
lectito inspect article.html --json --pretty

llms.txt

Use the llms subcommands when a site publishes an llms.txt file or when you want to bundle its linked resources into one Markdown context file.

lectito llms fetch https://example.com
lectito llms parse https://example.com/llms.txt --pretty
lectito llms expand https://example.com/llms.txt --output llms-full.txt
lectito llms generate https://example.com/docs/ --output llms.txt
lectito llms generate https://example.com/docs/ --output llms.txt --full llms-full.txt
lectito llms generate --sitemap https://example.com/sitemap.xml --output llms.txt
lectito llms generate https://example.com --discover --output llms.txt

fetch resolves a bare site URL to /llms.txt. parse prints structured JSON. expand reads the linked resources, keeps Markdown resources as-is, and runs HTML resources through Lectito before adding them to the bundle. generate crawls same-origin links from a seed page and writes a new llms.txt index. It uses canonical links for generated entries when pages publish them, includes HTTP Last-Modified or sitemap lastmod values in notes, and ranks accepted pages so likely entry points appear first. Pass --full (or --full-output) to write the expanded Markdown context while generating the index.

Links in the special Optional section are skipped unless you pass --include-optional:

lectito llms expand https://example.com/llms.txt --include-optional

Keep generated files small by limiting crawl depth and page count:

lectito llms generate https://example.com/docs/ --max-depth 1 --max-pages 10
lectito llms generate --sitemap https://example.com/sitemap.xml --max-pages 50

Filter generated entries and add a delay between page fetches:

lectito llms generate --sitemap https://example.com/sitemap.xml \
  --filter /docs/ \
  --filter '!/docs/archive/' \
  --filter '!*/drafts/*' \
  --delay 250

Remote generation checks robots.txt before fetching page URLs. It evaluates rules as Lectito by default:

lectito llms generate https://example.com/docs/ --robots-agent Lectito
lectito llms generate https://example.com/docs/ --ignore-robots

See the llms.txt guide for the expected file shape and the tradeoffs.

Exit Codes

  • 0: article extracted, or readability check returned true
  • 1: no article was extracted, or readability check returned false
  • 2: input, file, or network error
  • 3: extraction, readability, configuration, or timeout error

llms.txt

llms.txt is a Markdown file that gives language models and agent tools a curated entry point for a site. Sites usually publish it at /llms.txt.

Lectito supports the practical parts of the convention:

  • fetching a site's llms.txt
  • parsing its sections and links
  • expanding linked pages into one Markdown context file
  • crawling a bounded set of pages to generate an llms.txt index

It does not treat llms.txt as access control. Use robots.txt, HTTP authorization, and normal server controls for that.

File Shape

A small file looks like this:

# Example Docs

> Documentation for Example's public API.

Use the current API reference when generated examples disagree with older blog
posts.

## Docs

- [Quick start](https://example.com/docs/quick-start.md): First integration
  steps.
- [API reference](https://example.com/docs/api.md): Endpoint and object
  reference.

## Optional

- [Changelog](https://example.com/docs/changelog.md)

Lectito expects:

  • one H1 title
  • an optional blockquote summary
  • optional notes before the first H2
  • H2 sections containing Markdown links

The Optional section has special handling. lectito llms expand skips those links by default so the generated context stays smaller.

Fetch

Fetch a site's llms.txt:

lectito llms fetch https://example.com

For bare site URLs, Lectito requests /llms.txt. Explicit URLs are used as given:

lectito llms fetch https://example.com/docs/llms.txt

You can write the result to a file:

lectito llms fetch https://example.com --output llms.txt

Parse

Parse an llms.txt file into JSON:

lectito llms parse llms.txt --pretty

This is useful for checking whether section names, optional links, and notes are being read as expected.

Expand

Expand linked resources into one Markdown file:

lectito llms expand llms.txt --output llms-full.txt

Lectito keeps Markdown resources unchanged. When a linked resource looks like HTML, Lectito extracts the readable article and inserts the extracted Markdown. For remote links, Lectito checks the HTTP Content-Type header before falling back to URL suffixes and simple Markdown markers.

Each resource is separated and labeled:

---
# Source: Quick start
URL: https://example.com/docs/quick-start.md
Notes: First integration steps.
...

Use --include-optional to include the Optional section:

lectito llms expand llms.txt --include-optional --output llms-full.txt

Use --max-links when you want a smaller bundle:

lectito llms expand llms.txt --max-links 10

Generate

Generate an llms.txt file from a seed page:

lectito llms generate https://example.com/docs/ --output llms.txt

The crawler is intentionally bounded. For URL seeds, Lectito follows same-origin links only. For local HTML files, it follows relative local links. Assets such as images, stylesheets, scripts, PDFs, archives, and feeds are skipped.

To write the expanded context at the same time, pass --full:

lectito llms generate https://example.com/docs/ \
  --output llms.txt \
  --full llms-full.txt

--full-output is the same option with a more explicit name.

You can also generate from a sitemap:

lectito llms generate --sitemap https://example.com/sitemap.xml \
  --output llms.txt

Or discover sitemaps from a URL seed:

lectito llms generate https://example.com --discover \
  --output llms.txt

Discovery reads Sitemap: lines from robots.txt. When no sitemap is listed there, Lectito tries /sitemap.xml.

Sitemap indexes are supported. Lectito reads child sitemaps up to --max-sitemaps, then fetches page URLs up to --max-pages:

lectito llms generate --sitemap https://example.com/sitemap.xml \
  --max-sitemaps 10 \
  --max-pages 100 \
  --output llms.txt

Remote sitemap generation keeps sitemap and page URLs on the same origin as the sitemap input. Local sitemap files may list any absolute page URL.

By default, generation fetches up to 25 pages and follows links up to depth 2:

lectito llms generate https://example.com/docs/ \
  --max-pages 10 \
  --max-depth 1

Use --filter for the common path and glob cases. Prefix a pattern with ! to exclude it:

lectito llms generate --sitemap https://example.com/sitemap.xml \
  --filter /docs/ \
  --filter '!/docs/archive/' \
  --filter '!*/drafts/*'

Patterns that start with / match URL paths. Plain path values are prefixes. Path patterns with * or ? are globs. Other glob patterns match the full URL.

Use --delay to wait between page fetches:

lectito llms generate https://example.com/docs/ --delay 250

Remote generation checks robots.txt before fetching page URLs. Lectito keeps the existing browser-like user agent for HTTP requests, but evaluates robots rules as Lectito unless you pass another token:

lectito llms generate https://example.com/docs/ \
  --robots-agent LectitoDocsBot

Use --ignore-robots only when you explicitly want to bypass those checks:

lectito llms generate https://example.com/docs/ --ignore-robots

Only pages that produce readable article content are included. Each accepted page becomes one link in the generated file. Lectito uses the extracted title as the link label, switches to a page's canonical URL when one is available, and uses the extracted excerpt as the link note.

Remote generation also reads Last-Modified response headers. Sitemap generation reads lastmod values. When either value is present, Lectito adds it to the generated note and uses it as a small ranking signal. Ranking favors likely entry points such as docs roots, guides, API references, and pages with useful notes. Archive-like URLs are pushed down.

Set the generated title, summary, or section name when the defaults are too generic:

lectito llms generate https://example.com/docs/ \
  --title "Example Docs" \
  --summary "Public documentation for Example." \
  --section "Guides" \
  --output llms.txt

When To Use It

Use llms.txt when you want agents to start from a small, curated list of important pages. It works well for docs, public APIs, policy pages, and small knowledge bases.

Do not expect every model provider or search engine to read it. The reliable use case is explicit: a developer, tool, or agent asks Lectito to fetch or expand the file.

Basic Usage

Use extract when you want article content.

The function does not fetch the page. Pass it the HTML you want parsed. This is usually cleaner in applications because networking, caching, cookies, and browser rendering are application concerns.

#![allow(unused)]
fn main() {
use lectito::{extract, ReadabilityOptions};

let options = ReadabilityOptions::default();
let article = extract(html, Some("https://example.com/post"), &options)?;

match article {
    Some(article) => println!("{}", article.text_content),
    None => eprintln!("no article content found"),
}
Ok::<(), lectito::Error>(())
}

The base URL is optional. Pass it when the document contains relative links, images, or metadata URLs.

Raw HTML Limits

Lectito parses the HTML string you pass in. It does not run JavaScript, keep a browser session, submit forms, attach cookies, or fetch authenticated resources. For pages that build their article body on the client, capture rendered HTML in your crawler or browser automation layer before calling extract.

The CLI fetches URLs as a convenience, but it has the same raw-HTML boundary. If a site needs login state, consent flows, or browser-specific state, fetch that page in your own application (or a browser) and pass the resulting HTML through stdin or the Rust API.

When extraction succeeds, Lectito returns Some(Article). When the page parses but does not contain a useful article, it returns None. Reserve error handling for invalid base URLs, configured size limits, and serialization failures.

Article Output

Article contains the extracted content in several forms:

#![allow(unused)]
fn main() {
if let Some(article) = article {
    println!("{}", article.content);
    println!("{}", article.markdown);
    println!("{}", article.text_content);
}
}

Use extract_with_diagnostics when you need to see how extraction chose a root. Diagnostics are meant for development and regression work. Most application code should call extract.

#![allow(unused)]
fn main() {
use lectito::{extract_with_diagnostics, ReadabilityOptions};

let report = extract_with_diagnostics(html, base_url, &ReadabilityOptions::default())?;

if let Some(article) = report.article {
    println!("{}", article.markdown);
}

eprintln!("{:?}", report.diagnostics.outcome);

Ok::<(), lectito::Error>(())
}

Configuration

ReadabilityOptions control extraction.

The defaults are conservative. They favor article pages with enough text to be useful and avoid exposing internal scoring knobs unless they affect common integration cases.

#![allow(unused)]
fn main() {
use lectito::{MediaRetention, ReadabilityOptions};

let options = ReadabilityOptions {
    char_threshold: 800,
    nb_top_candidates: 8,
    content_selector: Some("article".to_string()),
    site_profiles: Vec::new(),
    media_retention: MediaRetention::Article,
    ..ReadabilityOptions::default()
};
}

Fields:

FieldDefaultMeaning
max_elems_to_parseNoneReject documents above this element count.
nb_top_candidates5Number of high-scoring candidates to consider.
char_threshold500Minimum extracted text length for an accepted attempt.
content_selectorNoneCSS selector to force as the content root.
site_profiles[]TOML site profiles for host-scoped extraction hints.
mobile_viewport_widthSome(480)Width used by recovery rules for mobile snapshots.
classes_to_preserve[]Class names kept during cleanup.
keep_classesfalseKeep all class attributes.
disable_json_ldfalseSkip JSON-LD metadata extraction.
link_density_modifier0.0Adjust link-density cleanup tolerance.
media_retentionArticleControl figure/image/media retention.

Prefer content_selector when you already know the page shape. It bypasses root scoring for that document, then runs the normal cleanup pipeline.

When content_selector is not set, Lectito still tries a small list of common article-body containers such as #article-body and .entry-content before generic scoring. That catches many large publisher pages without site-specific profiles.

Use site_profiles when you want URL-scoped extraction hints, removal selectors, and metadata hints. Profiles are attempted before generic scoring, but weak profile output falls back to the generic extractor.

Use max_elems_to_parse as a guardrail for untrusted input. It rejects very large documents before extraction work continues.

Use media_retention when output fidelity matters. Article keeps body figures and images by default; None removes media; Conservative is text-first; All keeps media that remains in the selected article subtree.

ReadableOptions controls is_probably_readable.

Lower min_content_length for short posts or documentation pages. Raise min_score when you want the quick check to reject borderline pages.

#![allow(unused)]
fn main() {
use lectito::ReadableOptions;

let options = ReadableOptions {
    min_content_length: 140,
    min_score: 20.0,
};
}

Output Formats

Lectito produces all output formats during extraction.

The formats come from the same cleaned article root. That means callers can store HTML for fidelity, use Markdown for display or editing, and use plain text for search without running extraction multiple times.

#![allow(unused)]
fn main() {
let article = extract(html, base_url, &ReadabilityOptions::default())?.unwrap();

let html = article.content;
let markdown = article.markdown;
let text = article.text_content;
}

HTML

content is cleaned article HTML. Scripts, styles, navigation, sidebars, and other page chrome are removed where possible. Relative URLs are resolved when a base URL is provided.

Use HTML when you need the closest representation of the extracted article. It keeps images, links, tables, inline markup, and other structure that can be lost in plain text.

Markdown

markdown is generated from the cleaned article HTML. It preserves common reader content:

  • headings
  • paragraphs
  • links and images
  • lists
  • blockquotes
  • code blocks
  • tables
  • math
  • footnotes

Markdown cleanup also strips zero-width break hints, drops empty links, keeps images intact, and removes duplicate title headings before rendering.

The CLI Markdown output includes TOML frontmatter:

lectito article.html

Markdown is useful when the next step is a reader view, note-taking system, static archive, or editor. It is also easier to diff in tests than HTML.

Plain Text

text_content is normalized article text. Use it for indexing, previews, and readability checks.

Plain text should not be treated as a rendering format. It discards links, images, and most document structure.

JSON

The CLI can serialize the article:

lectito article.html --format json --pretty

JSON is the best CLI format when another program needs metadata and content together.

Quality Expectations

OutputBest useExpectDo not expect
MarkdownReader views, notes, archives, editingGood preservation of headings, paragraphs, links, images, lists, blockquotes, code, tables, math, and footnotes.Byte-for-byte source fidelity or every custom widget.
HTMLRendering or post-processing extracted articlesThe closest structural view of the cleaned article root, with links and media kept according to options.A complete sanitizer policy or the original page layout.
TextSearch, previews, indexing, basic summariesNormalized article text with block boundaries for headings, paragraphs, lists, code, and definition lists.A rich rendering format with links, images, or full table structure.
JSONProgrammatic CLI integrationsMetadata plus HTML, Markdown, text, length, and source-related fields in one object.Stable values for publisher metadata when source pages disagree or omit fields.
inspectDebugging extraction choicesSelected root, candidate scores, cleanup counts, recovery data, and site-rule information.A user-facing article format.
readableCheap filtering before full extractionA boolean estimate using text length, visibility, class/id hints, and link density.The same answer full extraction would produce on every borderline page.

How It Works

Lectito follows the same broad approach as Mozilla Readability, with a few fast paths for common article snapshots.

The extractor starts with a full HTML document and tries to find the subtree that behaves like an article. It uses signals that tend to survive across sites: text length, paragraph density, semantic tags, class and id names, and the ratio of links to readable text.

  1. Recover useful content from raw HTML snapshots, including declarative shadow DOM.
  2. Parse the document.
  3. Recover useful content from parsed snapshots, including selected mobile and shadow-root cases.
  4. Extract metadata, including JSON-LD before scripts are stripped.
  5. Accept long JSON-LD article text when structured data contains the body.
  6. Try known article containers such as #article-body before broad scoring.
  7. Try a matching site profile or code extractor when one applies.
  8. Remove scripts, styles, hidden nodes, and unlikely content.
  9. Score candidate content roots by text length, tag type, class/id hints, and link density.
  10. Select the best root and include useful siblings.
  11. Clean the selected content.
  12. Apply schema text fallback when structured data is clearly better.
  13. Return HTML, Markdown, text, and diagnostics.

Extraction runs several attempts. Later attempts relax cleanup rules when the first pass produces too little text. The first attempt that reaches char_threshold is accepted. If no attempt reaches the threshold, Lectito may return the best non-empty attempt.

This retry model matters because pages fail in different ways. Some pages hide the useful content behind classes that look like chrome. Others include enough related links or widgets to pull the score away from the main text. Relaxed attempts give Lectito another chance without making the first pass too loose.

content_selector can short-circuit root selection for known documents:

#![allow(unused)]
fn main() {
let options = ReadabilityOptions {
    content_selector: Some("main article".to_string()),
    ..ReadabilityOptions::default()
};
}

Lectito also has a small built-in list of known content containers, including #article-body, [itemprop='articleBody'], .article-body, and .entry-content. These are attempted before generic scoring. They still go through cleanup, media handling, URL rewriting, and diagnostics.

Site profiles provide URL-scoped hints without disabling generic extraction:

#![allow(unused)]
fn main() {
let options = ReadabilityOptions {
    site_profiles: vec![r#"
        name = "example"
        hosts = ["example.com"]
        content_roots = ["article"]
        remove = [".ad", "nav"]
    "#.to_string()],
    ..ReadabilityOptions::default()
};
}

If a profile produces content below char_threshold, Lectito records the profile decision in diagnostics and continues with generic readability attempts.

After the root is selected, cleanup removes empty nodes, normalizes links and media, preserves selected classes, and prepares the HTML for Markdown and text conversion.

Diagnostics

Use diagnostics to inspect extraction decisions.

Diagnostics are for development, fixture work, and bug reports. They explain which candidates were considered, which root was selected, and why an extraction was accepted or downgraded to a best attempt.

#![allow(unused)]
fn main() {
use lectito::{extract_with_diagnostics, ReadabilityOptions};

let report = extract_with_diagnostics(html, base_url, &ReadabilityOptions::default())?;
println!("{:?}", report.diagnostics.outcome);
}

ExtractionReport contains:

  • article: the extracted article, if found
  • diagnostics: details about attempts and candidate selection

Outcomes:

OutcomeMeaning
AcceptedAn attempt met char_threshold.
BestAttemptNo attempt met the threshold, but non-empty content was found.
NoContentNo useful content was found.

Each attempt records:

  • cleanup flags
  • candidate count
  • top candidates
  • entry points
  • selected root
  • cleanup counts
  • recovery counts
  • extracted text length

Fast paths such as JSON-LD article text or a known content container may record an accepted attempt with candidate_count = 0. That means Lectito accepted a specific root before generic candidate scoring ran.

When a site profile or code extractor matches, diagnostics include site_rule. That record reports the matched profile or extractor, whether it was bundled, which roots were selected, how many removals ran, whether the result met char_threshold, and any fallback reason.

Start with outcome, selected_root, and text_len. If the selected root is wrong, inspect the candidate list. If the root is right but output is noisy, inspect cleanup counts and preserved classes.

CLI diagnostics:

lectito article.html --diagnostic-format pretty
lectito article.html --diagnostic-format json
lectito inspect article.html

API Overview

Lectito has two public API targets:

  • Rust Crate API for native Rust applications, CLIs, and server integrations.
  • WASM API for browser, web worker, bundler, and Node.js integrations.

Both targets use the same core extractor and Markdown conversion logic. The Rust crate is the source of truth; the WASM crate maps that API into JavaScript types and camelCase option names.

Rust Crate API

Public exports from lectito:

The crate exposes the extraction API, output structs, diagnostics, errors, and Markdown helpers.

#![allow(unused)]
fn main() {
pub use config::{Article, MarkdownOptions, MediaRetention, ReadabilityOptions, ReadableOptions};
pub use diagnostics::{
    AttemptDiagnostic, CandidateDiagnostic, CandidateSelection,
    CleanupDiagnostic, ContentSelectorDiagnostic, ExtractionDiagnostics,
    ExtractionOutcome, ExtractionReport, FlagDiagnostic, NodeDiagnostic,
    RecoveryDiagnostic,
};
pub use error::Error;
pub use extract::{clean_article_html, extract, extract_with_diagnostics};
pub use markdown::{html_to_markdown, markdown_to_html, markdown_with_toml_frontmatter};
pub use readable::is_probably_readable;
}

Extraction

Use extract for normal application code.

#![allow(unused)]
fn main() {
pub fn extract(
    html: &str,
    base_url: Option<&str>,
    options: &ReadabilityOptions,
) -> Result<Option<Article>, Error>
}

Returns Ok(Some(article)) when content is found, Ok(None) when the document has no useful article content, and Err for invalid input or processing failures.

Extraction tries JSON-LD article text and common article-body containers before generic readability scoring.

Set content_selector when you already know the article root.

Set disable_json_ld when structured data is wrong for the page.

Use extract_with_diagnostics when you need extraction details in addition to the article.

#![allow(unused)]
fn main() {
pub fn extract_with_diagnostics(
    html: &str,
    base_url: Option<&str>,
    options: &ReadabilityOptions,
) -> Result<ExtractionReport, Error>
}

Returns the same article result with extraction diagnostics.

Use clean_article_html when you only need the cleaned article HTML.

#![allow(unused)]
fn main() {
pub fn clean_article_html(
    html: &str,
    base_url: Option<&str>,
    options: &ReadabilityOptions,
) -> Result<Option<String>, Error>
}

Readability Check

Use is_probably_readable before full extraction when you are filtering many documents.

#![allow(unused)]
fn main() {
pub fn is_probably_readable(
    html: &str,
    options: &ReadableOptions,
) -> Result<bool, Error>
}

Returns a quick readability estimate without full extraction.

Markdown

The Markdown helpers are available separately for callers that already have a clean HTML fragment, want to render Markdown as HTML, or want CLI-style frontmatter.

#![allow(unused)]
fn main() {
pub fn html_to_markdown(html: &str) -> String
}

Converts HTML fragments to Markdown.

#![allow(unused)]
fn main() {
pub fn markdown_to_html(markdown: &str, options: &MarkdownOptions) -> String
}

Converts Markdown to HTML using CommonMark/GFM options.

#![allow(unused)]
fn main() {
pub fn markdown_with_toml_frontmatter(
    article: &Article,
    source: Option<&str>,
) -> Result<String, Error>
}

Formats an article as Markdown with TOML frontmatter.

WASM API

The npm package @stormlightlabs/lectito exposes Lectito to JavaScript through wasm-bindgen.

It supports browser, web worker, bundler, and Node.js use.

npm install @stormlightlabs/lectito

The Rust crate is still named lectito-wasm.

Build Targets

wasm-pack build crates/wasm --target bundler
wasm-pack build crates/wasm --target web
wasm-pack build crates/wasm --target nodejs

wasm-pack writes lectito_wasm.d.ts with the public TypeScript API.

Initialization

Bundler builds initialize when imported:

import { extract } from "@stormlightlabs/lectito";

const article = extract(html, "https://example.com/post");

The web target needs the async initializer:

import init, { extract } from "./lectito_wasm.js";

await init();

const article = extract(html, "https://example.com/post");

The nodejs target initializes when loaded:

const { extract } = require("./lectito_wasm.js");

const article = extract(html, "https://example.com/post");

Functions

export function extract(
  html: string,
  baseUrl?: string | null,
  options?: ReadabilityOptions | null,
): Article | null;

export function extractWithDiagnostics(
  html: string,
  baseUrl?: string | null,
  options?: ReadabilityOptions | null,
): ExtractionReport;

export function isProbablyReadable(html: string, options?: ReadableOptions | null): boolean;

export function cleanHtml(
  html: string,
  baseUrl?: string | null,
  options?: CleanHtmlOptions | null,
): string | null;

export function htmlToMarkdown(html: string): string;

export function markdownToHtml(markdown: string, options?: MarkdownOptions | null): string;

Types

Option fields use camelCase. Returned article fields keep the core Rust snake_case names.

export type MediaRetention = "none" | "conservative" | "article" | "all";

export interface ReadabilityOptions {
  maxElemsToParse?: number | null;
  nbTopCandidates?: number;
  charThreshold?: number;
  contentSelector?: string | null;
  siteProfiles?: string[];
  mobileViewportWidth?: number | null;
  classesToPreserve?: string[];
  keepClasses?: boolean;
  disableJsonLd?: boolean;
  linkDensityModifier?: number;
  mediaRetention?: MediaRetention;
}

export interface ReadableOptions {
  minContentLength?: number;
  minScore?: number;
}

export interface MarkdownOptions {
  gfm?: boolean;
  footnotes?: boolean;
  math?: boolean;
  allowRawHtml?: boolean;
}

export type CleanHtmlOptions = ReadabilityOptions;

export interface Article {
  title?: string | null;
  byline?: string | null;
  dir?: string | null;
  lang?: string | null;
  content: string;
  markdown: string;
  text_content: string;
  length: number;
  excerpt?: string | null;
  site_name?: string | null;
  published_time?: string | null;
  image?: string | null;
  domain?: string | null;
  favicon?: string | null;
}

export interface ExtractionReport {
  article: Article | null;
  diagnostics: unknown;
}

mediaRetention accepts "none", "conservative", "article", or "all".

Errors

Functions throw JavaScript Error objects for invalid base URLs, oversized documents, option conversion failures, and serialization failures.

Sanitization

cleanHtml performs Lectito article cleanup. It is not a complete untrusted-HTML security policy.

Browser integrations that accept arbitrary HTML should run a dedicated sanitizer such as DOMPurify before passing content into Lectito. Sanitize again before rendering returned HTML when the original input is untrusted.

Release Checks

Run the WASM tests and build all supported package targets:

pnpm --dir web exec wasm-pack test --node ../crates/wasm
pnpm --dir web exec wasm-pack build ../crates/wasm --target bundler --out-dir ../../target/wasm-pack/bundler
pnpm --dir web exec wasm-pack build ../crates/wasm --target web --out-dir ../../target/wasm-pack/web
pnpm --dir web exec wasm-pack build ../crates/wasm --target nodejs --out-dir ../../target/wasm-pack/nodejs

The build commands run wasm-opt; restricted sandboxes may need permission to execute it.

Article

Article is the extraction result.

The struct is serializable and contains both content and metadata. The content fields are generated from the selected article root; metadata can come from document metadata, JSON-LD, Open Graph tags, or the extracted content itself.

#![allow(unused)]
fn main() {
pub struct Article {
    pub title: Option<String>,
    pub byline: Option<String>,
    pub dir: Option<String>,
    pub lang: Option<String>,
    pub content: String,
    pub markdown: String,
    pub text_content: String,
    pub length: usize,
    pub excerpt: Option<String>,
    pub site_name: Option<String>,
    pub published_time: Option<String>,
    pub image: Option<String>,
    pub domain: Option<String>,
    pub favicon: Option<String>,
}
}

Fields:

FieldMeaning
titleBest title from metadata or document content.
bylineAuthor/byline when detected.
dirText direction, such as ltr or rtl.
langDocument language when detected.
contentCleaned article HTML.
markdownMarkdown generated from content.
text_contentPlain text generated from content.
lengthUTF-16 length of extracted text, matching Mozilla Readability.
excerptShort summary or first useful paragraph.
site_namePublisher or site name.
published_timePublication timestamp when detected.
imageLead image URL when detected.
domainSource domain when available.
faviconFavicon URL when detected.

content, markdown, and text_content are different views of the same extracted article. Prefer content when structure matters, markdown when the article will be displayed or edited as text, and text_content when indexing or summarizing.

length follows Mozilla Readability's UTF-16 convention. It can differ from a Rust chars().count() value for text outside the Basic Multilingual Plane.

Options

ReadabilityOptions

ReadabilityOptions changes extraction behavior. Most callers should start with ReadabilityOptions::default() and only set fields that solve a specific problem.

#![allow(unused)]
fn main() {
pub struct ReadabilityOptions {
    pub max_elems_to_parse: Option<usize>,
    pub nb_top_candidates: usize,
    pub char_threshold: usize,
    pub content_selector: Option<String>,
    pub site_profiles: Vec<String>,
    pub mobile_viewport_width: Option<usize>,
    pub classes_to_preserve: Vec<String>,
    pub keep_classes: bool,
    pub disable_json_ld: bool,
    pub link_density_modifier: f32,
    pub media_retention: MediaRetention,
}

pub enum MediaRetention {
    None,
    Conservative,
    Article,
    All,
}
}

Defaults:

#![allow(unused)]
fn main() {
ReadabilityOptions {
    max_elems_to_parse: None,
    nb_top_candidates: 5,
    char_threshold: 500,
    content_selector: None,
    site_profiles: Vec::new(),
    mobile_viewport_width: Some(480),
    classes_to_preserve: Vec::new(),
    keep_classes: false,
    disable_json_ld: false,
    link_density_modifier: 0.0,
    media_retention: MediaRetention::Article,
}
}

content_selector is the most direct override. Use it when the caller knows where the article lives in the document. When it is unset, Lectito still tries a small built-in list of common article-body containers before generic scoring.

site_profiles accepts TOML profile strings that provide host-scoped content roots, removal selectors, metadata hints, cleanup settings, and fallback behavior. Profiles run before generic scoring, after the JSON-LD and known container fast paths.

char_threshold controls when an attempt is accepted. nb_top_candidates controls how many candidates remain in play during generic scoring.

disable_json_ld skips JSON-LD metadata extraction and the JSON-LD article-body fast path. It does not disable Open Graph, Twitter card, or DOM metadata.

media_retention controls image and media preservation in the extracted article:

  • None: remove figures, images, and embedded media from content.
  • Conservative: text-first cleanup; media survives only if the generic extractor keeps it.
  • Article: keep figures/images that look like article body content. This is the default.
  • All: keep media that remains in the selected article subtree, subject to unsafe/embed cleanup.

ReadableOptions

ReadableOptions only affects is_probably_readable. It does not change full article extraction.

#![allow(unused)]
fn main() {
pub struct ReadableOptions {
    pub min_content_length: usize,
    pub min_score: f32,
}
}

Use lower thresholds for short-form content. Use higher thresholds when false positives are more expensive than missed articles.

Defaults:

#![allow(unused)]
fn main() {
ReadableOptions {
    min_content_length: 140,
    min_score: 20.0,
}
}

Site Profiles

Site profiles are TOML extraction hints scoped by URL host. They are useful when a site has a stable content container or predictable clutter, but still returns ordinary article-shaped HTML.

Profiles run before generic readability scoring. If a profile produces text below char_threshold, Lectito records the profile decision in diagnostics and continues with generic extraction.

Example

name = "example"
hosts = ["example.com"]
subdomains = true
path_prefixes = ["/blog"]
exclude_path_prefixes = ["/blog/comments"]
content_roots = ["article", "#content"]
remove = [".ad", "nav", "footer"]
remove_id_or_class = ["sidebar"]

[metadata]
title = ["h1"]
author = [".byline"]
date = ["time/@datetime"]
image = ["meta[property='og:image']/@content"]
site_name = "Example"
title_suffixes = [" - Example"]

[cleanup]
enabled = true
prune = true

[fallback]
generic_on_empty = true

Fields

FieldMeaning
nameHuman-readable profile name used in diagnostics.
hostsHosts matched by the profile. www. is ignored during matching.
subdomainsWhen true, subdomains of each host also match.
path_prefixesOptional path prefixes. Omit to match every path on the host.
exclude_path_prefixesOptional path prefixes that suppress the profile after host matching.
content_rootsCSS selectors or supported XPath selectors for article roots.
removeCSS selectors or supported XPath selectors to remove before extraction.
remove_id_or_classExact id or class tokens to remove.

Metadata fields are optional selector lists, except site_name, which is a constant. Selectors may target attributes with the supported XPath .../@attr form.

Cleanup defaults to enabled. prune controls conditional cleanup. Disabling cleanup should be reserved for sites where the profile root is already clean and generic cleanup removes useful structure.

Selector Support

Profiles accept CSS selectors directly. They also accept a focused XPath subset for compatibility with rule corpuses and older bundled rules:

  • //tag
  • //*[@id='value']
  • //tag[@class='a b']
  • //tag[contains(@class, 'value')]
  • /text() suffixes
  • /@attribute suffixes for metadata selectors

Unsupported XPath expressions are ignored by selector matching, so bundled profiles should have tests that prove their roots match representative pages.

User Profiles

Rust callers pass profile TOML strings through ReadabilityOptions:

#![allow(unused)]
fn main() {
let options = ReadabilityOptions {
    site_profiles: vec![std::fs::read_to_string("example.com.toml")?],
    ..ReadabilityOptions::default()
};
}

The CLI accepts repeatable profile paths:

lectito article.html --base-url https://example.com/post --site-profile example.com.toml

User profiles take precedence over bundled profiles. More specific host and path matches win within each source group.