Voicebox is a desktop application that provides a complete local voice I/O stack — combining voice cloning, text-to-speech, speech-to-text, and agent voice output in one privacy-first app. Designed for creators, developers, podcasters, and accessibility users, it eliminates reliance on cloud services like ElevenLabs and WisprFlow by running all models and processing entirely on your machine. Built with Tauri (Rust) and powered by MLX, CUDA, and ROCm, it supports macOS, Windows, Linux, and Docker with native GPU acceleration.
The app integrates Whisper for transcription, seven diverse TTS engines including Qwen3-TTS, Chatterbox Turbo, and Kokoro, and a full audio effects pipeline via pedalboard. It also features a stories editor for multi-track audio narratives and an MCP-compatible API to connect voice I/O to AI agents like Claude Code or Cursor.
What You Get
- 7 TTS Engines - Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro — each with distinct strengths in language support, speed, or expressiveness.
- Zero-Shot Voice Cloning - Clone any voice from a 3–5 second audio sample without needing training data or cloud APIs.
- Global Dictation with Accessibility Integration - Use a configurable hotkey to dictate anywhere on macOS/Windows; paste directly into focused text fields with clipboard preservation and accessibility verification.
- Post-Processing Audio Effects - Apply 8 professional-grade effects (pitch shift, reverb, delay, chorus, compressor, gain, high-pass, low-pass) with real-time preview and per-profile preset saving.
- Stories Editor - Multi-track timeline for building podcasts, conversations, or narratives with drag-and-drop clips, inline trimming, and versioned audio tracks.
- Agent Voice Output via MCP - Enable any MCP-aware AI agent (Claude Code, Cursor, Cline) to speak to you in your cloned voice using the
voicebox.speak tool call.
- Voice Profiles with Personas - Attach custom personas to voice clones (e.g., “Dry wit, composed British AI assistant”) and use them to refine speech via a bundled local LLM.
- Async Generation Queue - Submit multiple speech generations without blocking; failed jobs auto-retry and stale generations recover on app restart.
- Whisper-Based Speech-to-Text - Run OpenAI Whisper locally on MLX (Apple Silicon), CUDA, ROCm, or CPU for transcription in the Captures tab and dictation pipeline.
- Cross-Platform Native Performance - Built with Tauri (not Electron) for low memory usage and native GPU acceleration on macOS (MLX), Windows (CUDA), Linux (ROCm), and Intel Arc.
Common Use Cases
- Creating AI-powered podcasts - A podcaster clones their own voice and uses the Stories Editor to generate multi-character dialogues with different TTS engines and effects.
- Building accessible voice interfaces - A developer integrates Voicebox’s MCP API to let visually impaired users interact with AI agents using their own cloned voice.
- Local voice cloning for privacy-sensitive work - A journalist clones a source’s voice for audio clips without uploading recordings to third-party cloud services.
- Developing voice-enabled AI agents - A researcher uses Voicebox’s REST API and
voicebox.speak to give LLM-powered bots natural, personalized speech output in local environments.
Under The Hood
Architecture
- The repository exhibits a layered architecture with a clear separation between frontend and backend concerns.
- Frontend development is organized into distinct workspaces, suggesting a multi-platform strategy.
- API interactions are well-defined, utilizing dedicated clients and data types.
- A reactive approach to UI updates is evident through the use of Server-Sent Events and custom hooks.
Tech Stack
- The backend is built on Python with FastAPI and SQLAlchemy, indicating a scalable and robust server-side component.
- The frontend leverages React and TypeScript, incorporating component libraries and state management solutions.
- A monorepo structure managed by Bun facilitates organization and dependency management.
- Docker Compose is used for streamlined deployment and resource management.
Code Quality
- Comprehensive testing practices are in place, including unit and integration tests.
- Robust error handling is prioritized with custom error classes and extensive exception handling.
- Type safety is enforced through TypeScript and Python type hinting.
- A configuration system ensures code quality and consistency.
What Makes It Unique
- The integration of a local Model Call Protocol (MCP) server demonstrates a custom approach to LLM interaction.
- The use of Tauri enables the creation of a desktop application.
- Detailed instrumentation and monitoring of model download progress via SSE is a standout feature.
- A sophisticated audio processing pipeline with LLM-powered refinement capabilities sets it apart.