Speakr is a self-hosted web application that transcribes audio recordings into intelligent, searchable notes using AI-powered speech recognition with speaker diarization. Designed for privacy-conscious users—from researchers and legal teams to families and book clubs—it ensures sensitive conversations remain on your infrastructure while offering advanced organization features like AI tagging and retention policies.
Built with Python and Docker, Speakr supports multiple transcription backends including OpenAI, WhisperX, Mistral Voxtral, and VibeVoice, with full REST API access, OIDC SSO, and integrations with Obsidian, Logseq, and wikis. It runs on any system with Docker and GPU support for optimal performance.
What You Get
- AI Transcription with Speaker Diarization - Uses WhisperX, OpenAI gpt-4o-transcribe-diarize, or Mistral Voxtral to accurately transcribe audio and identify distinct speakers with voice embeddings.
- Voice Profiles - AI-powered speaker recognition that learns and labels voices over time, enabling consistent speaker labeling across recordings.
- Interactive Chat - Ask natural language questions about your recordings and receive AI-generated answers based on transcript content.
- Inquire Mode - Semantic search across all transcripts using natural language queries, not just keyword matching.
- Smart Tagging with Prompt Stacking - Apply custom AI prompts to tags (e.g., “Recipe” or “Code Review”) to auto-format transcripts into structured outputs like step-by-step recipes or action item lists.
- Tag-Driven Auto-Processing - Automatically apply tags and trigger transcription workflows when files are uploaded via watch folders or API.
- Group Management with Granular Permissions - Create groups with shared access, edit/reshare controls, and automatic sharing via group-scoped tags.
- Auto-Deletion with Retention Policies - Set custom retention periods per group or tag to automatically delete old recordings for compliance or storage management.
- REST API v1 with Swagger UI - Full programmatic access to upload, transcribe, and retrieve recordings with metadata support for title and meeting_date.
- Obsidian/Logseq Auto-Export - Automatically export completed transcripts to your note-taking system using custom templates without manual intervention.
- Fullscreen Video Mode with Live Subtitles - Double-click to enter fullscreen playback with synchronized speaker-labeled subtitles and keyboard shortcuts.
- Custom Vocabulary & Initial Prompts - Improve transcription accuracy by defining domain-specific hotwords or context prompts per user, tag, or folder.
- Video Passthrough Mode - Send raw video files directly to ASR backends that support video input (e.g., VibeVoice), bypassing audio extraction.
Common Use Cases
- Family memories - Families record trips and events; Speakr auto-shares them via a “Family” group with protected tags to preserve recordings forever.
- Book club discussions - Members tag monthly meetings with “Book Club” to auto-generate organized discussion summaries with personal notes.
- Legal consultations - Law firms use group tags with 7-year retention policies to ensure compliance and preserve client conversations indefinitely.
- Research interviews - Academics apply “Protected” tags and Obsidian export to preserve raw audio and transcripts for long-term analysis.
- Sales calls - Sales teams share recordings with view-only permissions and auto-tag with “Sales Call” and 1-year retention to review performance.
- Architecture decisions - Engineering teams use protected tags to preserve technical discussions permanently as reference material.
- Daily standups - Teams apply a “Standup” tag with 14-day retention to auto-share and auto-delete routine meetings.
- Lecture notes - Students tag lectures with “Study Notes” to convert spoken content into structured outlines with concepts and definitions.
Under The Hood
Architecture
- Flask-based monolithic structure with tightly coupled routes, models, and services, lacking clear layer separation between HTTP, business logic, and data access
- Modular transcription pipeline built on abstract base classes and plugin-like specifications, enabling extensible support for multiple ASR backends with capability flags
- Dynamic audio chunking system that adapts file segmentation based on provider constraints and overlap-aware reassembly to preserve speaker diarization
- Environment-driven configuration and lazy-loaded services provide lightweight dependency injection without formal containers
- Modular asset handling via Docker multi-stage builds isolates media processing from application logic
Tech Stack
- Python 3.11 backend with Flask, SQLAlchemy, and Flask-Login for core functionality and authentication
- SQLite as primary storage with dedicated directories for transcriptions and Hugging Face model caching
- Dockerized deployment using custom FFmpeg and ffprobe binaries to minimize image size and avoid system package bloat
- Gunicorn as production WSGI server with optimized timeout and worker settings
- MkDocs with Material theme and advanced Markdown extensions for rich, interactive documentation
- Offline asset bundling via custom scripts to eliminate runtime dependencies for JS/CSS and fonts
Code Quality
- Extensive test coverage including unit, integration, and edge-case scenarios with real database and API interactions
- Robust error handling with custom exceptions, fallback parsing, and comprehensive try-catch patterns for unstable LLM outputs
- Clear modular design with service layers for audio processing, transcription, and sharing logic
- Strong type safety enforced through dataclasses, typing annotations, and runtime validation across APIs and processing pipelines
- Innovative JSON recovery utilities that auto-correct malformed LLM responses with multiple fallback strategies
- Comprehensive linting and validation embedded in tests to enforce API contracts, database constraints, and configuration defaults
What Makes It Unique
- Plug-and-play transcription connectors allow seamless switching between OpenAI, GPT-4o, and custom ASR endpoints with dynamic fallbacks
- Adaptive audio chunking preserves speaker diarization integrity by intelligently respecting provider-specific limits and overlap requirements
- Tag-based retention policies with hierarchical inheritance enable granular, user-defined audio lifecycle management
- Built-in API token system with per-token rate limiting and one-time plaintext issuance provides secure, audit-ready access without external auth
- Semantic search via optional sentence-transformers enables context-aware querying, with graceful fallback to keyword search
- Real-time UI feedback for token usage creates tight coupling between backend policy and user experience for compliance awareness