ktx

ktx builds a self-improving context layer over your data warehouse so AI agents like Claude Code and Codex query it with approved metric definitions instead of reinventing SQL logic from scratch.

1.5Kstars
86forks
Apache License 2.0
TypeScript

ktx is an executable context layer for data and analytics agents. Instead of letting a general-purpose AI agent re-explore a warehouse schema on every question, invent its own metric logic, and return numbers that don’t match approved definitions, ktx builds and continuously maintains that context up front: it samples tables, detects joinable columns, ingests wiki and documentation knowledge, and compiles all of it into a searchable semantic layer that agents can query declaratively.

The project is a pnpm + uv monorepo split into a TypeScript CLI (packages/cli) and two Python packages: ktx-sl, a semantic-layer engine that loads YAML source definitions, builds a join graph, and compiles semantic queries into dialect-specific SQL, and ktx-daemon, a portable compute service the CLI talks to. The standout technical piece is the join graph’s explicit detection and resolution of fan-out and chasm traps — the classic semantic-layer bug where a one-to-many join silently double-counts a metric — backed by real regression tests, not just a README claim.

On the serving side, ktx ships both a CLI (ktx setup, ktx ingest, ktx sl, ktx wiki) and an MCP server (ktx mcp start) exposing tools like entity_details, discover_data, and wiki_search that Claude Code, Codex, Cursor, or OpenCode can call directly. It supports PostgreSQL, Snowflake, BigQuery, ClickHouse, MySQL, SQL Server, SQLite, DuckDB, Amazon Athena, and MongoDB as data sources, plus integrations with dbt, MetricFlow, LookML, Looker, Metabase, Sigma, Notion, and Google Drive for absorbing existing business knowledge.

ktx is Apache-2.0 licensed, runs entirely locally against a self-supplied LLM key or an existing Claude Code/Codex session (no hosted service and no extra usage billing), and enforces read-only warehouse connections by design. The project is built by Kaelio (Y Combinator P25) and is under very active early-stage development, with 21 releases and a fast-growing star count in its first months on GitHub.

What You Get

  • Automatic warehouse context ingestion — samples tables, captures metadata and usage patterns, and detects joinable columns without manual schema documentation
  • A semantic layer with join-graph fan-out/chasm-trap resolution, so metrics don’t get silently double-counted across one-to-many joins
  • Wiki knowledge ingestion from Notion, Google Drive, and local Markdown, with deduplication and contradiction flagging for human review
  • An MCP server (ktx mcp start) exposing tools like entity_details, discover_data, and wiki_search for Claude Code, Codex, Cursor, and OpenCode
  • A CLI workflow (ktx setup, ktx status, ktx ingest, ktx sl, ktx wiki) for building and inspecting context from the terminal
  • Read-only enforcement on every database connection — ktx never writes to the source warehouse

Common Use Cases

  • A data team wants Claude Code or Codex to answer ad hoc analytics questions against the warehouse without hallucinating metric definitions or reinventing joins each time
  • An analytics engineer has metric logic scattered across dbt, LookML, and Metabase and wants one searchable, agent-queryable semantic layer instead of three disconnected ones
  • A company’s tribal knowledge about refund policy, table quirks, or business definitions lives in Notion and gets lost on every new agent session — ktx ingests and organizes it
  • A team keeps hitting double-counted revenue or fan-out bugs when agents write ad hoc joins, and wants that class of error caught automatically instead of manually reviewed

Under The Hood

Architecture The repo is a pnpm+uv monorepo split into a TypeScript CLI (packages/cli) and two Python packages (python/ktx-sl for the semantic-layer engine, python/ktx-daemon for a portable compute service). The CLI layers cleanly: cli-program.ts wires Commander subcommands (setup, connection, ingest, wiki, sl, sql, status, mcp) to commands/*-commands.ts modules, which call into framework-agnostic context/* packages (context/scan, context/ingest, context/wiki, context/sl, context/mcp) behind small port interfaces, so the same domain logic backs both the CLI and the MCP server’s tool handlers. Database access is abstracted behind per-engine connector modules sharing a common interface. The semantic layer itself follows a linear pipeline: a SourceLoader parses YAML source definitions, a JoinGraph builds join edges and computes shortest join paths while explicitly detecting fan-out and chasm traps, and a SqlGenerator compiles the resolved plan into dialect-specific SQL. The two runtimes stay decoupled — the TypeScript CLI talks to the Python engine through a daemon process rather than importing it directly — which limits blast radius if either side’s core abstraction changes.

Tech Stack The CLI targets Node 22+ with TypeScript in strict ESM mode, using @commander-js/extra-typings and Ink/clack for the terminal UI, Zod for input validation, and the @modelcontextprotocol/sdk for the MCP server. Agent execution is built on ai/@ai-sdk packages (Anthropic, Google Vertex), @anthropic-ai/claude-agent-sdk, and @openai/codex-sdk, letting ktx run on a user’s own API keys or an existing local agent session. Warehouse access goes through dedicated driver packages per engine (pg, mysql2, mssql, mongodb, snowflake-sdk, @clickhouse/client, @google-cloud/bigquery, @aws-sdk/client-athena, @duckdb/node-api), with better-sqlite3 for local project state. The Python side runs on Python 3.13 via uv, with dialect-aware SQL parsing feeding the join graph. Build tooling includes Biome for lint/format, Knip for dead-code detection, Vitest for TypeScript tests, pytest for Python tests, and semantic-release for versioning — all wired into GitHub Actions CI with pre-commit hooks.

Code Quality Testing is extensive and clearly a first-class concern: the CLI package has roughly as many test files as source files, split into fast and slow (test:slow) suites, and the Python packages carry dozens of dedicated test modules covering the semantic-layer engine and daemon. Error handling favors typed sentinel errors with explicit type guards over generic thrown-and-caught exceptions, and Zod schemas validate MCP tool inputs at the boundary. Naming is consistent and the Python semantic-layer models are dataclass-based, giving reasonable type safety on both sides of the language split. CI enforces pre-commit hooks, dead-code checks, full type-checking (source and test configs separately), pytest, coverage reporting via Codecov, and a dedicated CLI smoke-test job — a genuinely rigorous gate for a project still in its early months.

What Makes It Unique The clearest technical differentiator, and one substantiated directly in the code rather than only in marketing copy, is automatic fan-out and chasm-trap detection in the join graph, backed by dedicated regression tests for cross-grain measure edge cases. This is a real, historically hard semantic-layer problem — a one-to-many join silently double-counting a metric — that ad hoc “let the agent write SQL” setups get wrong without anyone noticing. Combined with wiki ingestion that actively flags contradictions across sources, and a bring-your-own-LLM model that piggybacks on an existing Claude Code or Codex session instead of billing its own API usage, the overall package is a reasonably differentiated take on an underserved problem, even though each individual piece (semantic layers, MCP tool servers, wiki-backed RAG) exists elsewhere in some form.

Self-Hosting

Licensing Model Apache License 2.0 — all features, connectors, the CLI, and the MCP server are available in self-hosted use with no license keys or feature gates.

Self-Hosting Restrictions None found. A search of the source tree turned up no ee/, enterprise/, pro/, or cloud/ directories, and no isPro/requiresLicense/FEATURE_FLAGS-style checks anywhere in the codebase.

Enterprise Features None — there is no separate paid tier. ktx does not operate a hosted service at all; ingestion, the semantic layer, and the MCP server all run locally against your own LLM API keys or an existing Claude Code/Codex session.

Cloud vs Self-Hosted Not applicable. Per the project’s own FAQ, there is no hosted service — the only data leaving your machine is what you send to the LLM provider you configure.

License Key Required No.

Join founders buildingwith open source

Opinionated takes, migration guides, cost-saving tips, and insights from the open source ecosystem.

Subscribe on Substack

No spam. Unsubscribe anytime.

Join 750+ subscribers
No spam. Unsubscribe anytime.

Search