liteparse

Name: liteparse
Rating: 5 (11361 reviews)

A fast, lightweight, open-source document parser that extracts spatial text, bounding boxes, and Markdown from PDFs and Office files — entirely on your machine.

11.4Kstars

760forks

Apache License 2.0

Rust

View Source Visit Website

On This Page

LiteParse is a standalone open-source document parsing library built in Rust by the LlamaIndex team. It extracts text from PDFs with precise spatial positioning and bounding boxes, runs OCR on image-based pages, reconstructs spatial layouts into readable Markdown, and generates high-quality page screenshots — all without requiring cloud APIs or proprietary dependencies.

Unlike heavier parsing platforms, LiteParse deliberately keeps its scope narrow: fast, local, and dependency-light. It uses PDFium for native PDF text extraction and bundles Tesseract for zero-configuration OCR, but also supports plugging in any HTTP-based OCR server such as EasyOCR or PaddleOCR when higher accuracy is needed for complex documents.

The library ships with first-class bindings for Python, Node.js/TypeScript, WebAssembly (browser), and a CLI tool called lit — all sharing the same underlying Rust core. This means developers can use LiteParse in a Python RAG pipeline, a Node.js backend service, or a client-side browser app without changing the parsing logic.

For documents that exceed local parsing capability — dense tables, multi-column layouts, handwritten text, or scanned PDFs with complex structure — LiteParse is designed as a complementary open-source tier to the cloud-based LlamaParse service, giving teams a clear upgrade path without vendor lock-in at the lower tier.

What You Get

A lit CLI tool installable via npm, pip, or cargo that parses PDFs, Office files, and images into text, JSON, or Markdown from the command line
Spatial text extraction with precise bounding boxes for every text item, preserving reading order and layout across complex page structures
Built-in Tesseract OCR bundled with the library — no separate installation needed — with support for HTTP-based OCR servers (EasyOCR, PaddleOCR, custom) for higher accuracy
Markdown output that reconstructs headings, tables, lists, images, and hyperlinks from spatial layout — formatted for direct ingestion into LLMs and RAG systems
Page screenshot generation as PNG files at configurable DPI, useful for visual document inspection or feeding images to multimodal LLM agents
Language bindings for Python (via PyO3), Node.js/TypeScript (via napi-rs), and browser environments (via WASM/wasm-bindgen) — all sharing the same Rust core
Batch parsing mode that processes entire directories of documents concurrently across CPU cores
Support for PDF, DOCX, XLSX, PPTX, and image input formats through automatic LibreOffice/ImageMagick conversion

Common Use Cases

Ingesting large PDF corpora into RAG pipelines and vector databases where clean, structured Markdown output reduces chunking noise
Building document Q&A applications that need bounding box metadata to highlight source locations back in the original PDF
Processing mixed document batches (PDFs, Word files, spreadsheets) through a unified CLI or API in automated data pipelines
Running privacy-sensitive document parsing on-premises where sending documents to external APIs is not permitted
Combining local text extraction with LLM-agent screenshot workflows, where LiteParse generates page images for multimodal reasoning on charts and figures
Integrating document parsing into browser-based tools via the WASM build, enabling client-side PDF extraction without server round-trips

Under The Hood

Architecture LiteParse follows a clean layered pipeline architecture where each stage has a single responsibility and hands off well-typed data to the next. The Rust core defines the full pipeline: format conversion (LibreOffice/ImageMagick for non-PDF inputs), PDF text extraction via PDFium FFI, selective OCR rendering and merging, grid projection for spatial layout reconstruction, and output formatting. A global PDFium lock serializes the FFI-unsafe extraction stage while the OCR pass and grid projection — which dominate runtime for OCR-heavy documents — run fully concurrently outside the lock. This design makes LiteParse structs Send + Sync and safe to share across async tasks, enabling genuine parallelism in multi-threaded tokio runtimes while respecting PDFium’s threading constraints. Language bindings for Node.js, Python, and WASM re-export the same core types with idiomatic wrappers rather than reimplementing logic.

Tech Stack The core is written in Rust 2024 edition using tokio for async execution and serde/serde_json for structured output. PDF rendering and text extraction rely on a custom PDFium wrapper crate that bundles the Google Chromium PDFium C library via a Rust FFI layer. OCR is handled through an abstract OcrEngine trait with two built-in implementations: a bundled Tesseract engine via the tesseract-rs crate (compiled as an optional feature), and a lightweight HTTP engine that POST-multiparts page images to any conforming OCR server. Node.js bindings are built with napi-rs, Python bindings with PyO3, and browser support via wasm-bindgen targeting the wasm32-unknown-unknown target. The CLI binary (lit) is the same across all installation paths. Multi-format input conversion depends on LibreOffice and ImageMagick as external system dependencies, with a Docker image that bundles everything for self-contained deployment.

Code Quality The codebase demonstrates strong Rust idioms throughout: typed error enums via thiserror, exhaustive pattern matching, and explicit Result propagation with no silent failures. Test coverage is comprehensive — extensive unit tests are embedded inline throughout all core modules (projection, OCR merge, markdown layout blocks), and a separate integration test suite validates end-to-end parse results against real PDF fixtures. CI runs separate workflow files for Rust, Python, Node.js, WASM, and end-to-end output validation. Inline documentation is abundant, with rustdoc comments on all public types and methods including threading safety guarantees on LiteParse. The markdown layout module is notably well-tested, with a dedicated test_helpers.rs module and per-feature test files covering headings, tables, lists, and inline reconstruction.

What Makes It Unique LiteParse’s primary technical differentiator is its grid projection algorithm, which reconstructs a 2D spatial grid from raw bounding box data rather than relying on PDF tag structure or heuristic line-sorting alone. This allows it to recover multi-column layouts, preserve indentation and whitespace relationships, and produce readable text output from untagged PDFs that confuse simpler text extractors. The pluggable OCR trait abstraction — with a well-defined API specification allowing any HTTP server to be plugged in — is a practical design that separates the parsing pipeline from OCR model choice. The cross-platform WASM build with a callback-based OCR API for browser environments, enabling client-side document parsing without server infrastructure, is unusual in the local PDF parsing space and reflects deliberate architecture choices rather than an afterthought port.

Self-Hosting

LiteParse is released under the Apache License 2.0, one of the most permissive open-source licenses available. You can use it commercially, modify the source, distribute it, and embed it in proprietary software without any obligation to open-source your own code. There are no copyleft requirements, no contributor license agreements that restrict commercial use, and no open-core feature gating — the entire codebase is public under the same license. Attribution is required only in the standard sense of preserving copyright notices in distributed copies.

Running LiteParse yourself is straightforward for the CLI use case — install via npm, pip, or cargo, and it works out of the box with Tesseract bundled. The library build requires a Rust toolchain, and native format conversion (DOCX, XLSX, PPTX) depends on LibreOffice and ImageMagick being present on the host. For production use at scale, the Docker image handles these dependencies cleanly. The library is stateless and single-binary (for the CLI), so horizontal scaling is as simple as distributing the binary. There is no database, no persistent state, and no background services — operational burden is minimal compared to most self-hosted document platforms.

The trade-off against the cloud-based LlamaParse service is accuracy on hard documents. LiteParse intentionally scopes itself to fast, heuristics-driven parsing and acknowledges it does not handle dense tables, multi-column academic papers, handwritten text, or low-quality scans with the same fidelity as a cloud pipeline backed by dedicated models. LlamaParse adds managed document queuing, SLA-backed uptime, advanced table extraction, and multimodal model integration — features that require significant infrastructure to replicate yourself. For straightforward PDFs and Office documents, LiteParse self-hosted is a fully capable and cost-free alternative.

On This Page