A small and fast library for extracting content from HTML.
It is an one of implementation of the paper DOM Based Content Extraction via Text Density.
To install via NPM:
npm i @wrtnlabs/web-content-extractorimport { extractContent } from "@wrtnlabs/web-content-extractor";
const { title, description, content, contentHtmls, links } =
  extractContent(html);
console.log("title", title);
console.log("description", description);
console.log("content", content); // The content of the page; string
for (const fragment of contentHtmls) {
  console.log("fragment", fragment); // The fragment of the content; string
}
for (const link of links) {
  console.log("url", link.url); // The URL of the link
  console.log("content", link.content); // The content of the link
}It strips some tags that can be considered as non-content tags, including:
scriptnoscriptstylenavheaderfooterimgsvgvideoaudioformlabelinputselectoptionbuttonobjectembediframecanvasmaparea