defuddle

Name: defuddle
Rating: 5 (8323 reviews)

Extract clean article content and rich metadata from any web page, delivered as HTML or Markdown with zero clutter.

8.3Kstars

353forks

MIT License

View Source Visit Website

On This Page

Defuddle is a TypeScript content extraction library designed to cut through the noise of modern web pages and surface only what matters: the article, post, or primary content. Created by Steph Ango (the creator of Obsidian) as the engine powering the Obsidian Web Clipper browser extension, Defuddle was built to be more forgiving and consistent than Mozilla Readability — removing fewer uncertain elements while producing a standardized, predictable output.

Unlike generic scraping tools, Defuddle applies a multi-pass pipeline: it scores DOM blocks for content vs. navigation signals, uses a page’s own mobile CSS to identify decorative elements, removes hidden and low-scoring blocks, then standardizes the surviving HTML into a consistent structure. Headings are normalized, footnotes are converted to a uniform format, math expressions (MathJax, KaTeX) are converted to standard MathML, code blocks are cleaned of syntax-highlighting artifacts, and callouts from GitHub, Bootstrap, and Obsidian Publish are unified into a single blockquote dialect.

Defuddle ships as three distinct bundles — a zero-dependency browser core, a full bundle with math fallback libraries, and a Node.js bundle compatible with linkedom, JSDOM, or happy-dom — so it integrates cleanly into browser extensions, server-side pipelines, and CLI workflows. A parseAsync() path handles client-side-rendered sites by falling back to third-party APIs (e.g., FxTwitter for Twitter/X content), and 27 site-specific extractors ensure platforms like YouTube, Reddit, Medium, Substack, Wikipedia, and LinkedIn are handled with precision beyond generic DOM scoring.

The CLI interface (npx defuddle parse) accepts a URL, a local HTML file, or stdin, and can output raw HTML, Markdown, JSON with metadata, or YAML frontmatter — making it composable in shell pipelines and scripting workflows where a full browser runtime is unavailable.

What You Get

Three distributable bundles: a zero-dependency browser core (defuddle), a full bundle with math conversion libraries (defuddle/full), and a Node.js bundle (defuddle/node) compatible with linkedom, JSDOM, and happy-dom
A multi-pass content extraction pipeline that scores, filters hidden elements, removes low-quality blocks, and standardizes headings, footnotes, code blocks, math expressions, and callout elements
27 site-specific extractors for YouTube, Reddit, Medium, Substack, Wikipedia, LinkedIn, Twitter/X, Bluesky, Mastodon, HackerNews, GitHub, and more — each overriding generic scoring for their platform’s unique DOM structure
Rich metadata extraction including author, publication date, domain, favicon, main image, word count, language (BCP 47), and full schema.org data
A CLI tool (npx defuddle parse) accepting URLs, local files, or stdin — with output modes for HTML, Markdown, JSON, and YAML frontmatter
A parseAsync() path with configurable third-party API fallbacks for client-side-rendered pages where server HTML contains no usable content
Debug mode returning the exact CSS selector chosen as the content root and a detailed log of every removed element with its removal reason

Common Use Cases

Browser extension content clipping — powering the Obsidian Web Clipper to save clean Markdown versions of articles directly to a personal knowledge base
Server-side web archiving — feeding URLs into a Node.js pipeline that extracts structured article data and stores it in a database or file system
CLI-based read-later workflows — piping curl output through defuddle parse --markdown to produce clean Markdown that feeds into note-taking tools
LLM pre-processing pipelines — stripping boilerplate from crawled pages before passing content to a language model for summarization or extraction
Research scraping tools — using site-specific extractors to reliably pull article bodies from paywalled or dynamic platforms like Medium, Substack, and LinkedIn
Content migration utilities — converting legacy HTML archives to Markdown with consistent heading hierarchy, footnote formatting, and code block annotations

Under The Hood

Architecture Defuddle follows a layered, pipeline-oriented architecture centered on a single Defuddle class in src/defuddle.ts that orchestrates a sequenced series of passes over a DOM Document. The pipeline begins with attribute normalization and schema.org extraction (lazy-cached before script tags are stripped), then delegates to the ExtractorRegistry which evaluates each of the 27 site-specific extractors in src/extractors/ against the current URL. If a site extractor matches, it short-circuits the generic scoring path; otherwise, the main ContentScorer in src/removals/scoring.ts evaluates every block-level element against weighted content and navigation indicator lists. Removals cascade through dedicated modules: selectors.ts for exact and partial CSS selector lists, hidden.ts for display-none and visibility-hidden elements (with mobile CSS simulation), small-images.ts for tracking pixels and icon images, and metadata-block.ts for header/byline blocks. Surviving content is then passed through standardize.ts, which applies an ordered rule set (mathRules, codeBlockRules, headingRules, imageRules) plus standalone normalizers for footnotes (elements/footnotes.ts) and callouts (elements/callouts.ts). The separation between scoring/removal and standardization is clean and each concern lives in its own module, making the pipeline easy to extend at any stage.

Tech Stack Defuddle is written entirely in TypeScript 5.3 with strict mode enabled and targets ES6 output via three separate tsconfig files for browser, declarations, and Node.js builds. The browser bundles are produced by Webpack 5 using ts-loader and TerserPlugin, while the Node.js bundle uses tsc directly. The single runtime production dependency is commander (^12.1.0) for the CLI argument parser. Optional dependencies include linkedom (^0.18.12) as the preferred DOM implementation for Node.js, turndown (^7.2.0) for HTML-to-Markdown conversion, and mathml-to-latex plus temml for math conversion in the full bundle. Testing runs on Vitest 3.x with fixtures sourced from real-world HTML snapshots in tests/fixtures/, and the test suite spans 15 dedicated test files covering CLI behavior, Markdown output, media removal, schema.org fallback, SVG sanitization, async conversation parsing, and platform-specific extractors for YouTube, Reddit, Bilibili, and Twitter/X.

Code Quality The codebase demonstrates strong TypeScript discipline — all public APIs are fully typed through interfaces in src/types/ and src/types/extractors.ts, the BaseExtractor abstract class enforces a consistent canExtract() / extract() / canExtractAsync() contract across all 27 extractors, and strict mode is enforced across all tsconfig targets. Error handling is explicit: DOM operations are wrapped in try/catch blocks with console.warn fallbacks, URL parsing failures degrade gracefully, and the async extraction path is guarded by canExtractAsync() before any network call is made. The test suite is comprehensive with extensive fixture-based coverage — real HTML snapshots from target platforms are stored in tests/fixtures/ and compared against tests/expected/ outputs, ensuring regressions are caught at the HTML level rather than just unit-testing internals. A custom ESLint script (check-no-innerhtml.mjs) enforces that innerHTML is never used directly in production source, a meaningful security-conscious constraint given the library processes untrusted web content.

What Makes It Unique Defuddle’s most distinctive technical choices separate it clearly from Mozilla Readability and generic scraping tools. The use of a page’s own mobile CSS as a removal heuristic is particularly clever — rather than maintaining a static blocklist of CSS class names, Defuddle simulates a narrow viewport and reads the site’s own responsive design intent to identify what the site itself considers non-essential at small widths. The HTML standardization layer is unusually deep: while most content extractors stop at selecting the content node, Defuddle normalizes math equations (MathJax, KaTeX, raw LaTeX delimiters) to standard MathML, converts 4 callout dialects to a unified Obsidian-compatible format, and strips syntax-highlighting markup from code blocks while preserving language annotations — making the output directly suitable for downstream tools like Markdown converters and knowledge base applications without further post-processing. The per-platform extractor model for 27 major sites, including async transcript fetching for YouTube and Bilibili, addresses the fundamental limitation of pure DOM scoring on dynamic or opaque platforms.