A fast, lightweight, open-source document parser that extracts spatial text, bounding boxes, and Markdown from PDFs and Office files — entirely on your machine.
LiteParse is a standalone open-source document parsing library built in Rust by the LlamaIndex team. It extracts text from PDFs with precise spatial positioning and bounding boxes, runs OCR on image-based pages, reconstructs spatial layouts into readable Markdown, and generates high-quality page screenshots — all without requiring cloud APIs or proprietary dependencies.
Unlike heavier parsing platforms, LiteParse deliberately keeps its scope narrow: fast, local, and dependency-light. It uses PDFium for native PDF text extraction and bundles Tesseract for zero-configuration OCR, but also supports plugging in any HTTP-based OCR server such as EasyOCR or PaddleOCR when higher accuracy is needed for complex documents.
The library ships with first-class bindings for Python, Node.js/TypeScript, WebAssembly (browser), and a CLI tool called lit — all sharing the same underlying Rust core. This means developers can use LiteParse in a Python RAG pipeline, a Node.js backend service, or a client-side browser app without changing the parsing logic.
For documents that exceed local parsing capability — dense tables, multi-column layouts, handwritten text, or scanned PDFs with complex structure — LiteParse is designed as a complementary open-source tier to the cloud-based LlamaParse service, giving teams a clear upgrade path without vendor lock-in at the lower tier.
lit CLI tool installable via npm, pip, or cargo that parses PDFs, Office files, and images into text, JSON, or Markdown from the command lineArchitecture
LiteParse follows a clean layered pipeline architecture where each stage has a single responsibility and hands off well-typed data to the next. The Rust core defines the full pipeline: format conversion (LibreOffice/ImageMagick for non-PDF inputs), PDF text extraction via PDFium FFI, selective OCR rendering and merging, grid projection for spatial layout reconstruction, and output formatting. A global PDFium lock serializes the FFI-unsafe extraction stage while the OCR pass and grid projection — which dominate runtime for OCR-heavy documents — run fully concurrently outside the lock. This design makes LiteParse structs Send + Sync and safe to share across async tasks, enabling genuine parallelism in multi-threaded tokio runtimes while respecting PDFium’s threading constraints. Language bindings for Node.js, Python, and WASM re-export the same core types with idiomatic wrappers rather than reimplementing logic.
Tech Stack
The core is written in Rust 2024 edition using tokio for async execution and serde/serde_json for structured output. PDF rendering and text extraction rely on a custom PDFium wrapper crate that bundles the Google Chromium PDFium C library via a Rust FFI layer. OCR is handled through an abstract OcrEngine trait with two built-in implementations: a bundled Tesseract engine via the tesseract-rs crate (compiled as an optional feature), and a lightweight HTTP engine that POST-multiparts page images to any conforming OCR server. Node.js bindings are built with napi-rs, Python bindings with PyO3, and browser support via wasm-bindgen targeting the wasm32-unknown-unknown target. The CLI binary (lit) is the same across all installation paths. Multi-format input conversion depends on LibreOffice and ImageMagick as external system dependencies, with a Docker image that bundles everything for self-contained deployment.
Code Quality
The codebase demonstrates strong Rust idioms throughout: typed error enums via thiserror, exhaustive pattern matching, and explicit Result propagation with no silent failures. Test coverage is comprehensive — extensive unit tests are embedded inline throughout all core modules (projection, OCR merge, markdown layout blocks), and a separate integration test suite validates end-to-end parse results against real PDF fixtures. CI runs separate workflow files for Rust, Python, Node.js, WASM, and end-to-end output validation. Inline documentation is abundant, with rustdoc comments on all public types and methods including threading safety guarantees on LiteParse. The markdown layout module is notably well-tested, with a dedicated test_helpers.rs module and per-feature test files covering headings, tables, lists, and inline reconstruction.
What Makes It Unique LiteParse’s primary technical differentiator is its grid projection algorithm, which reconstructs a 2D spatial grid from raw bounding box data rather than relying on PDF tag structure or heuristic line-sorting alone. This allows it to recover multi-column layouts, preserve indentation and whitespace relationships, and produce readable text output from untagged PDFs that confuse simpler text extractors. The pluggable OCR trait abstraction — with a well-defined API specification allowing any HTTP server to be plugged in — is a practical design that separates the parsing pipeline from OCR model choice. The cross-platform WASM build with a callback-based OCR API for browser environments, enabling client-side document parsing without server infrastructure, is unusual in the local PDF parsing space and reflects deliberate architecture choices rather than an afterthought port.
LiteParse is released under the Apache License 2.0, one of the most permissive open-source licenses available. You can use it commercially, modify the source, distribute it, and embed it in proprietary software without any obligation to open-source your own code. There are no copyleft requirements, no contributor license agreements that restrict commercial use, and no open-core feature gating — the entire codebase is public under the same license. Attribution is required only in the standard sense of preserving copyright notices in distributed copies.
Running LiteParse yourself is straightforward for the CLI use case — install via npm, pip, or cargo, and it works out of the box with Tesseract bundled. The library build requires a Rust toolchain, and native format conversion (DOCX, XLSX, PPTX) depends on LibreOffice and ImageMagick being present on the host. For production use at scale, the Docker image handles these dependencies cleanly. The library is stateless and single-binary (for the CLI), so horizontal scaling is as simple as distributing the binary. There is no database, no persistent state, and no background services — operational burden is minimal compared to most self-hosted document platforms.
The trade-off against the cloud-based LlamaParse service is accuracy on hard documents. LiteParse intentionally scopes itself to fast, heuristics-driven parsing and acknowledges it does not handle dense tables, multi-column academic papers, handwritten text, or low-quality scans with the same fidelity as a cloud pipeline backed by dedicated models. LlamaParse adds managed document queuing, SLA-backed uptime, advanced table extraction, and multimodal model integration — features that require significant infrastructure to replicate yourself. For straightforward PDFs and Office documents, LiteParse self-hosted is a fully capable and cost-free alternative.
No Code Platforms · AI Development · Developer Tools
Visual LLM workflow platform with RAG pipelines, agent capabilities, and model management for building production AI applications.
Developer Tools · Game Development · Design Tools
Free, MIT-licensed 2D and 3D game engine with one-click multi-platform export and no royalties.
Developer Tools · Databases · Search
The open-source Postgres development platform that replaces Firebase with authentication, real-time APIs, edge functions, storage, and vector embeddings — all built on PostgreSQL.