magika

Name: magika
Rating: 5 (17229 reviews)

AI-powered file type detection that identifies 200+ content types with ~99% accuracy in milliseconds using a compact deep learning model.

17.2Kstars

1.1Kforks

Apache License 2.0

Python

View Source Visit Website

On This Page

Magika is a novel file type detection tool developed by Google that replaces heuristic-based approaches with deep learning. Rather than relying on magic bytes or file extensions, Magika analyzes the actual byte content of a file through a custom-trained neural network, achieving approximately 99% average precision and recall across more than 200 content types spanning both binary and text formats.

The tool is used at Google’s production scale to route files through security and content policy scanners across Gmail, Drive, and Safe Browsing—processing hundreds of billions of samples weekly. It has also been integrated with VirusTotal and abuse.ch, demonstrating its value in security-critical pipelines where misclassification has real consequences.

Magika is designed for speed and portability: after a one-time model load, inference takes approximately 5 milliseconds per file even on a single CPU, and the model itself weighs only a few megabytes. It reads only a limited subset of each file’s bytes, meaning inference time remains near-constant regardless of file size. Files can be processed individually or in bulk batches of thousands at a time.

The project ships as a Rust-based CLI tool, a Python library with ONNX runtime inference, and experimental JavaScript/TypeScript bindings that run entirely in the browser via WebAssembly. GoLang bindings are in progress. A configurable prediction mode system (high-confidence, medium-confidence, best-guess) lets users tune the trade-off between precision and generality.

What You Get

A Rust-powered CLI tool installable via pipx, Homebrew, cargo, or a one-line installer script that scans individual files, directories recursively, or stdin pipes
A Python library exposing identify_path(), identify_bytes(), and identify_stream() methods with typed result objects and ONNX-based inference
JavaScript/TypeScript npm package that runs the model fully client-side via WebAssembly, powering the interactive browser demo with no server round-trips
Three prediction modes—high-confidence, medium-confidence, and best-guess—that let you tune precision vs. recall trade-offs for your use case
Rich output formats including plain text, JSON, JSONL, MIME type, label-only, and a custom printf-style format string with per-field placeholders
A per-content-type threshold system where each of the 200+ types has its own confidence threshold, falling back to generic labels when the model is uncertain

Common Use Cases

Security pipeline routing — classifying uploaded files before sending them to the appropriate malware scanner, content policy engine, or safe-browsing check
Bulk file auditing — recursively scanning directories of thousands of mixed-format files to inventory content types without relying on potentially-spoofed extensions
CI/CD validation — integrating into build pipelines to verify that generated or committed artifacts are the correct file type before publishing or deploying
SIEM and threat intelligence platforms — enriching file metadata in abuse databases, sandboxes, or forensic tooling where accurate MIME classification matters
Browser-side file inspection — using the JavaScript package to classify user-uploaded files in a web application without sending bytes to a server
Stream-based processing — identifying content type from arbitrary binary streams or in-memory byte buffers without writing temporary files to disk

Under The Hood

Architecture Magika follows a clean pipeline architecture where the flow moves from input abstraction through feature extraction into batched ONNX inference and then confidence-threshold post-processing. The design cleanly separates concerns: a Seekable abstraction handles all input types (file paths, byte buffers, BinaryIO streams) uniformly without leaking I/O specifics into the inference layer; feature extraction operates on seekable content without loading entire files into memory; and the prediction mode system (high-confidence, medium-confidence, best-guess) is applied as a post-inference policy rather than baked into the model itself. The overwrite_map in the model config allows remapping model outputs to canonical labels, providing a clean seam for model evolution without breaking API contracts. The multi-language structure (Rust CLI, Python library, JS package, Go WIP) shares the same ONNX model weights, so architectural decisions made in the Python core propagate consistently across runtimes.

Tech Stack The Python library (v1.x) uses ONNX Runtime for CPU-based inference against a custom-trained model stored as a bundled .onnx file, eliminating any server dependency and keeping cold-start overhead to a one-time session initialization. The Python package is built with Maturin (a Rust/Python build bridge), indicating the CLI binary is actually the Rust implementation wrapped for Python distribution via pipx. The Rust CLI is built on Tokio for async file I/O, Clap for argument parsing, ORT (the Rust bindings to ONNX Runtime) for inference, Serde for JSON serialization, and Colored for terminal output. The JavaScript/TypeScript package compiles the ONNX model to WebAssembly for browser execution. Development tooling includes Ruff for Python linting/formatting, mypy for type checking, pytest for testing, and GitHub Actions for CI with CodeQL security scanning.

Code Quality The Python codebase demonstrates strong typing discipline with fully annotated public APIs, a rich domain type hierarchy (ContentTypeLabel, MagikaResult, MagikaPrediction, ModelFeatures, Status), and Google-style docstrings enforced by Ruff. The test suite is comprehensive with dedicated test files for feature extraction, inference correctness against reference outputs, the Python module API surface, and client behavior. Error handling is explicit: custom MagikaError exceptions, typed Status enum returns for file-not-found and permission errors, and input validation with descriptive TypeError messages rather than silent failures. The repository has CodeQL scanning, an OpenSSF Best Practices badge, and a CONTRIBUTING.md with a CLA requirement. The Rust layer follows idiomatic Rust patterns with anyhow for error propagation and async-first file handling.

What Makes It Unique Magika’s core innovation is replacing the decades-old magic-byte heuristics (libmagic, file command) with a purpose-built, compact deep learning model trained on approximately 100 million samples. Unlike general-purpose LLMs applied to file classification, Magika’s model is tiny enough to ship bundled inside a package (a few MB), fast enough for sub-10ms inference on a CPU, and accurate enough to outperform rule-based approaches especially on textual content types where magic bytes are absent or ambiguous. The per-content-type threshold system—where each of 200+ types has its own confidence cutoff derived from training statistics—provides nuanced control that a single global threshold cannot achieve. The fact that this model runs production workloads at hundreds of billions of weekly classifications inside Google, while simultaneously being available as a WebAssembly module for in-browser use, demonstrates a genuinely novel point on the accuracy-vs-efficiency frontier for file type detection.

Self-Hosting

Magika is released under the Apache License 2.0, a permissive open-source license that allows commercial use, modification, distribution, and sublicensing without any copyleft requirements. You are free to integrate it into proprietary products and services. The only obligations are to include the original license notice and to not use Google’s trademarks. This makes it one of the most business-friendly licenses available for security tooling.

Operationally, Magika is a library and CLI tool rather than a service, so running it means bundling it into your own application or pipeline. The model weights ship with the Python package (a few MB of ONNX file), meaning there is no external model server to operate. Dependencies are minimal: the Python library requires only onnxruntime and click; the Rust CLI is statically compiled. Updates follow the package release cadence—you pin versions in your dependency manager and upgrade on your own schedule. There is no infrastructure to provision, no uptime SLA to maintain, and no data leaves your environment.

Because this is not a managed service, you own the operational burden entirely: you handle model updates as new versions of the magika package are released, you instrument and monitor false positive/negative rates in your own pipeline, and you manage scaling if you need higher throughput than a single-node ONNX session provides. There is no paid cloud tier, no enterprise support contract, and no SLA—support is community-driven through GitHub Issues. For organizations that need guaranteed response times, dedicated SLAs, or managed inference infrastructure, you would need to build and operate that layer yourself on top of the open-source library.

On This Page