Configuration

ReadabilityOptions control extraction.

The defaults are conservative. They favor article pages with enough text to be useful and avoid exposing internal scoring knobs unless they affect common integration cases.

#![allow(unused)]
fn main() {
use lectito::{MediaRetention, ReadabilityOptions};

let options = ReadabilityOptions {
    char_threshold: 800,
    nb_top_candidates: 8,
    content_selector: Some("article".to_string()),
    site_profiles: Vec::new(),
    media_retention: MediaRetention::Article,
    ..ReadabilityOptions::default()
};
}

Fields:

Field	Default	Meaning
`max_elems_to_parse`	`None`	Reject documents above this element count.
`nb_top_candidates`	`5`	Number of high-scoring candidates to consider.
`char_threshold`	`500`	Minimum extracted text length for an accepted attempt.
`content_selector`	`None`	CSS selector to force as the content root.
`site_profiles`	`[]`	TOML site profiles for host-scoped extraction hints.
`mobile_viewport_width`	`Some(480)`	Width used by recovery rules for mobile snapshots.
`classes_to_preserve`	`[]`	Class names kept during cleanup.
`keep_classes`	`false`	Keep all class attributes.
`disable_json_ld`	`false`	Skip JSON-LD metadata extraction.
`link_density_modifier`	`0.0`	Adjust link-density cleanup tolerance.
`media_retention`	`Article`	Control figure/image/media retention.

Prefer content_selector when you already know the page shape. It bypasses root scoring for that document, then runs the normal cleanup pipeline.

When content_selector is not set, Lectito still tries a small list of common article-body containers such as #article-body and .entry-content before generic scoring. That catches many large publisher pages without site-specific profiles.

Use site_profiles when you want URL-scoped extraction hints, removal selectors, and metadata hints. Profiles are attempted before generic scoring, but weak profile output falls back to the generic extractor.

Use max_elems_to_parse as a guardrail for untrusted input. It rejects very large documents before extraction work continues.

Use media_retention when output fidelity matters. Article keeps body figures and images by default; None removes media; Conservative is text-first; All keeps media that remains in the selected article subtree.

ReadableOptions controls is_probably_readable.

Lower min_content_length for short posts or documentation pages. Raise min_score when you want the quick check to reject borderline pages.

#![allow(unused)]
fn main() {
use lectito::ReadableOptions;

let options = ReadableOptions {
    min_content_length: 140,
    min_score: 20.0,
};
}

Lectito.rs

Configuration