Configuration
ReadabilityOptions control extraction.
The defaults are conservative. They favor article pages with enough text to be useful and avoid exposing internal scoring knobs unless they affect common integration cases.
#![allow(unused)] fn main() { use lectito::{MediaRetention, ReadabilityOptions}; let options = ReadabilityOptions { char_threshold: 800, nb_top_candidates: 8, content_selector: Some("article".to_string()), site_profiles: Vec::new(), media_retention: MediaRetention::Article, ..ReadabilityOptions::default() }; }
Fields:
| Field | Default | Meaning |
|---|---|---|
max_elems_to_parse | None | Reject documents above this element count. |
nb_top_candidates | 5 | Number of high-scoring candidates to consider. |
char_threshold | 500 | Minimum extracted text length for an accepted attempt. |
content_selector | None | CSS selector to force as the content root. |
site_profiles | [] | TOML site profiles for host-scoped extraction hints. |
mobile_viewport_width | Some(480) | Width used by recovery rules for mobile snapshots. |
classes_to_preserve | [] | Class names kept during cleanup. |
keep_classes | false | Keep all class attributes. |
disable_json_ld | false | Skip JSON-LD metadata extraction. |
link_density_modifier | 0.0 | Adjust link-density cleanup tolerance. |
media_retention | Article | Control figure/image/media retention. |
Prefer content_selector when you already know the page shape. It bypasses
root scoring for that document, then runs the normal cleanup pipeline.
When content_selector is not set, Lectito still tries a small list of common
article-body containers such as #article-body and .entry-content before
generic scoring. That catches many large publisher pages without site-specific
profiles.
Use site_profiles when you want URL-scoped extraction hints, removal
selectors, and metadata hints. Profiles are attempted before generic scoring,
but weak profile output falls back to the generic extractor.
Use max_elems_to_parse as a guardrail for untrusted input. It rejects very
large documents before extraction work continues.
Use media_retention when output fidelity matters. Article keeps body figures
and images by default; None removes media; Conservative is text-first; All
keeps media that remains in the selected article subtree.
ReadableOptions controls is_probably_readable.
Lower min_content_length for short posts or documentation pages. Raise
min_score when you want the quick check to reject borderline pages.
#![allow(unused)] fn main() { use lectito::ReadableOptions; let options = ReadableOptions { min_content_length: 140, min_score: 20.0, }; }