Extract clean article content and rich metadata from any web page, delivered as HTML or Markdown with zero clutter.
Defuddle is a TypeScript content extraction library designed to cut through the noise of modern web pages and surface only what matters: the article, post, or primary content. Created by Steph Ango (the creator of Obsidian) as the engine powering the Obsidian Web Clipper browser extension, Defuddle was built to be more forgiving and consistent than Mozilla Readability — removing fewer uncertain elements while producing a standardized, predictable output.
Unlike generic scraping tools, Defuddle applies a multi-pass pipeline: it scores DOM blocks for content vs. navigation signals, uses a page’s own mobile CSS to identify decorative elements, removes hidden and low-scoring blocks, then standardizes the surviving HTML into a consistent structure. Headings are normalized, footnotes are converted to a uniform format, math expressions (MathJax, KaTeX) are converted to standard MathML, code blocks are cleaned of syntax-highlighting artifacts, and callouts from GitHub, Bootstrap, and Obsidian Publish are unified into a single blockquote dialect.
Defuddle ships as three distinct bundles — a zero-dependency browser core, a full bundle with math fallback libraries, and a Node.js bundle compatible with linkedom, JSDOM, or happy-dom — so it integrates cleanly into browser extensions, server-side pipelines, and CLI workflows. A parseAsync() path handles client-side-rendered sites by falling back to third-party APIs (e.g., FxTwitter for Twitter/X content), and 27 site-specific extractors ensure platforms like YouTube, Reddit, Medium, Substack, Wikipedia, and LinkedIn are handled with precision beyond generic DOM scoring.
The CLI interface (npx defuddle parse) accepts a URL, a local HTML file, or stdin, and can output raw HTML, Markdown, JSON with metadata, or YAML frontmatter — making it composable in shell pipelines and scripting workflows where a full browser runtime is unavailable.
defuddle), a full bundle with math conversion libraries (defuddle/full), and a Node.js bundle (defuddle/node) compatible with linkedom, JSDOM, and happy-domnpx defuddle parse) accepting URLs, local files, or stdin — with output modes for HTML, Markdown, JSON, and YAML frontmatterparseAsync() path with configurable third-party API fallbacks for client-side-rendered pages where server HTML contains no usable contentcurl output through defuddle parse --markdown to produce clean Markdown that feeds into note-taking toolsArchitecture
Defuddle follows a layered, pipeline-oriented architecture centered on a single Defuddle class in src/defuddle.ts that orchestrates a sequenced series of passes over a DOM Document. The pipeline begins with attribute normalization and schema.org extraction (lazy-cached before script tags are stripped), then delegates to the ExtractorRegistry which evaluates each of the 27 site-specific extractors in src/extractors/ against the current URL. If a site extractor matches, it short-circuits the generic scoring path; otherwise, the main ContentScorer in src/removals/scoring.ts evaluates every block-level element against weighted content and navigation indicator lists. Removals cascade through dedicated modules: selectors.ts for exact and partial CSS selector lists, hidden.ts for display-none and visibility-hidden elements (with mobile CSS simulation), small-images.ts for tracking pixels and icon images, and metadata-block.ts for header/byline blocks. Surviving content is then passed through standardize.ts, which applies an ordered rule set (mathRules, codeBlockRules, headingRules, imageRules) plus standalone normalizers for footnotes (elements/footnotes.ts) and callouts (elements/callouts.ts). The separation between scoring/removal and standardization is clean and each concern lives in its own module, making the pipeline easy to extend at any stage.
Tech Stack
Defuddle is written entirely in TypeScript 5.3 with strict mode enabled and targets ES6 output via three separate tsconfig files for browser, declarations, and Node.js builds. The browser bundles are produced by Webpack 5 using ts-loader and TerserPlugin, while the Node.js bundle uses tsc directly. The single runtime production dependency is commander (^12.1.0) for the CLI argument parser. Optional dependencies include linkedom (^0.18.12) as the preferred DOM implementation for Node.js, turndown (^7.2.0) for HTML-to-Markdown conversion, and mathml-to-latex plus temml for math conversion in the full bundle. Testing runs on Vitest 3.x with fixtures sourced from real-world HTML snapshots in tests/fixtures/, and the test suite spans 15 dedicated test files covering CLI behavior, Markdown output, media removal, schema.org fallback, SVG sanitization, async conversation parsing, and platform-specific extractors for YouTube, Reddit, Bilibili, and Twitter/X.
Code Quality
The codebase demonstrates strong TypeScript discipline — all public APIs are fully typed through interfaces in src/types/ and src/types/extractors.ts, the BaseExtractor abstract class enforces a consistent canExtract() / extract() / canExtractAsync() contract across all 27 extractors, and strict mode is enforced across all tsconfig targets. Error handling is explicit: DOM operations are wrapped in try/catch blocks with console.warn fallbacks, URL parsing failures degrade gracefully, and the async extraction path is guarded by canExtractAsync() before any network call is made. The test suite is comprehensive with extensive fixture-based coverage — real HTML snapshots from target platforms are stored in tests/fixtures/ and compared against tests/expected/ outputs, ensuring regressions are caught at the HTML level rather than just unit-testing internals. A custom ESLint script (check-no-innerhtml.mjs) enforces that innerHTML is never used directly in production source, a meaningful security-conscious constraint given the library processes untrusted web content.
What Makes It Unique Defuddle’s most distinctive technical choices separate it clearly from Mozilla Readability and generic scraping tools. The use of a page’s own mobile CSS as a removal heuristic is particularly clever — rather than maintaining a static blocklist of CSS class names, Defuddle simulates a narrow viewport and reads the site’s own responsive design intent to identify what the site itself considers non-essential at small widths. The HTML standardization layer is unusually deep: while most content extractors stop at selecting the content node, Defuddle normalizes math equations (MathJax, KaTeX, raw LaTeX delimiters) to standard MathML, converts 4 callout dialects to a unified Obsidian-compatible format, and strips syntax-highlighting markup from code blocks while preserving language annotations — making the output directly suitable for downstream tools like Markdown converters and knowledge base applications without further post-processing. The per-platform extractor model for 27 major sites, including async transcript fetching for YouTube and Bilibili, addresses the fundamental limitation of pure DOM scoring on dynamic or opaque platforms.
Licensing Model MIT licensed — all features available in self-hosted deployments with no restrictions or license keys required.
No Code Platforms · AI Development · Developer Tools
Visual LLM workflow platform with RAG pipelines, agent capabilities, and model management for building production AI applications.
Developer Tools · Game Development · Design Tools
Free, MIT-licensed 2D and 3D game engine with one-click multi-platform export and no royalties.
Developer Tools · Databases · Search
The open-source Postgres development platform that replaces Firebase with authentication, real-time APIs, edge functions, storage, and vector embeddings — all built on PostgreSQL.