Local AI voice tools split into three different shapes: full voice studios, Python libraries, and model workflows.

Use this page as a map before choosing what to run. The detailed posts include the exact commands, model names, generated files, and local failure modes.

The Short Version

Tool Best For What It Is Not
Voicebox A local voice studio with web UI, REST endpoints, MCP tools, transcription, TTS, and voice cloning A tiny Python library
KittenTTS Lightweight CPU text-to-speech from Python scripts A voice cloning studio or Docker app
Chatterbox Local model workflows for speech generation and voice conversion A one-command low-resource web app

Voicebox

Voicebox is the full local voice-studio option.

Use Voicebox when you want:

  • a Docker-backed local UI
  • multiple local TTS and STT engines
  • voice cloning tests with reference audio
  • REST endpoints and MCP tools for agents
  • a practical way to generate audio for videos or assistant workflows

The practical mental model: Voicebox is the app you run when you want to experiment interactively. It is also the best fit when you want to compare voice-cloning models with the same input audio.

Useful links:

KittenTTS

KittenTTS is the small Python-library option.

Use KittenTTS when you want:

  • CPU speech generation from a script
  • small ONNX model downloads
  • a simple Python API
  • generated audio for notifications, demos, or agents
  • a lower-resource path than the heavier voice-cloning stacks

The practical mental model: KittenTTS is not a self-hosted app. It is a library you embed in scripts or your own service.

Useful links:

Chatterbox

Chatterbox is a model-workflow option from Resemble AI.

Use Chatterbox when you want:

  • local text-to-speech experiments
  • multilingual speech generation
  • voice conversion workflows
  • a Python/model-checkpoint setup where GPU access can matter
  • a research-style toolkit rather than a polished compose app

The practical mental model: Chatterbox is closer to a local ML package than a homelab service. Treat setup, model downloads, VRAM/RAM, and generated output paths as part of the workflow.

Useful links:

Choosing One

If You Need… Start With
A local web UI for voice cloning and transcription Voicebox
A Python one-liner style TTS library KittenTTS
CPU-friendly generated speech KittenTTS
Higher-quality voice cloning experiments Voicebox, then compare models inside it
Research-style TTS and voice conversion workflows Chatterbox
Agent voice via MCP or REST Voicebox

Validation Notes

Tool What Was Validated Locally Remaining Caveat
Voicebox Docker UI access, model selection, reference-audio cloning tests, generated audio, and video assembly with generated WAV audio quality depended heavily on model choice and reference-audio length
KittenTTS Python setup, nano model download, CPU generation, and a short generated WAV embedded in the post not a voice cloning tool
Chatterbox local setup exploration and model workflow notes heavier model paths need more careful hardware planning

Practical Advice

Start with a short sentence before generating long audio. Keep the exact text, reference-audio file, model name, output path, duration, and rough file size in your notes.

For voice cloning, longer and cleaner reference audio usually helps more than changing random settings. For a scriptable notification voice, a small TTS library is often enough.