ComfyUI-Qwen-TTS - Qwen3 Voice Nodes for ComfyUI

ComfyUI is not only for images.

With the right custom nodes, it becomes a visual workflow canvas for audio generation too.

ComfyUI-Qwen-TTS adds Qwen3-TTS nodes for speech synthesis, voice cloning, voice design, saved speaker prompts, and multi-role dialogue.

It is a plugin for an existing ComfyUI install, not a standalone server or Docker app.

What is ComfyUI-Qwen-TTS?

ComfyUI-Qwen-TTS is a custom-node package by flybirdxx for using Qwen3-TTS inside ComfyUI.

The upstream repository describes it as a simple implementation of Qwen3-TTS for ComfyUI, with nodes for speech synthesis, zero-shot voice cloning, and voice design.

ComfyUI-Qwen-TTS GitHub

License: Apache-2.0 plugin metadata ❤️

The model weights are separate from the plugin code and follow the upstream Qwen3-TTS license agreement.

It is worth being precise here: this is not a generic ComfyUI wrapper for many unrelated TTS engines.

It is a ComfyUI node pack for the Qwen3-TTS model family.

The different choices are Qwen3-TTS variants, such as 0.6B vs 1.7B, Base vs CustomVoice, and the 1.7B VoiceDesign model.

Example Workflow

The repository ships ComfyUI workflow examples and screenshots. I copied the main example image locally for reference:

Open the example workflow image

Why It Is Interesting

The useful part is not just “text goes in, audio comes out.” ComfyUI-Qwen-TTS exposes several voice workflows as graph nodes:

Voice Design: describe a voice in text and synthesize speech with that style.
Voice Clone: provide short reference audio and target text.
Custom Voice: use preset or saved speaker voices.
Voice Prompt Extraction: extract reusable voice features once and reuse them.
Role Bank: collect multiple voices for a dialogue workflow.
Dialogue Inference: synthesize multi-speaker scripts.
Save / Load Speaker: build a reusable local voice library.
Train: experiment with custom fine-tuning.

That makes it fit the ComfyUI mental model: reusable blocks, saved workflows, preview nodes, and repeatable generation chains.

Tech Overview

The repository is a Python package with a ComfyUI registration file, node implementation, bundled Qwen3-TTS runtime code, model download helper, workflow examples, and fine-tuning scripts.

Important files:

Path	Purpose
`__init__.py`	Registers ComfyUI node IDs and display names
`nodes.py`	Main speech, cloning, role-bank, and persistence nodes
`train.py`	Experimental training node
`download_models.py`	Downloads Qwen3-TTS model folders from Hugging Face
`qwen_tts/`	Bundled Qwen3-TTS model, tokenizer, inference, and fine-tuning code
`example/`	ComfyUI workflow JSON files and screenshots

The project metadata currently marks the package as qwen3-tts-comfyui version 1.0.7, requiring Python >=3.9.

Important Dependency Pin

The README calls out one dependency rule very clearly: Qwen3-TTS is incompatible with transformers >= 5.0.

Use the plugin’s supported range:

pip install "transformers>=4.57.0,<5.0.0"

Or pin the README recommendation:

pip install transformers==4.57.3

This matters in ComfyUI because custom nodes often share the same Python environment. If another node upgrades Transformers to version 5 later, this plugin may fail during model loading or generation.

Installing in ComfyUI

Install it inside an existing ComfyUI checkout:

cd ComfyUI/custom_nodes
git clone https://github.com/flybirdxx/ComfyUI-Qwen-TTS.git
cd ComfyUI-Qwen-TTS
pip install -r requirements.txt
pip install transformers==4.57.3

Then restart ComfyUI and look for the Qwen3-TTS node category.

Downloading the Models

The plugin expects Qwen3-TTS model variants under ComfyUI’s model tree:

ComfyUI/
└── models/
    └── qwen-tts/
        ├── Qwen3-TTS-Tokenizer-12Hz/
        ├── Qwen3-TTS-12Hz-1.7B-Base/
        ├── Qwen3-TTS-12Hz-0.6B-Base/
        ├── Qwen3-TTS-12Hz-1.7B-VoiceDesign/
        └── voices/

Use the helper script from the custom-node folder:

python download_models.py

For smaller models where available:

python download_models.py --small

For everything:

python download_models.py --all

You can also target a custom model directory:

python download_models.py --target /path/to/models/qwen-tts

Using extra_model_paths.yaml

The plugin supports ComfyUI’s extra_model_paths.yaml style model indirection:

qwen-tts: /mnt/models/Qwen

That is useful if your ComfyUI install is small but your model disk is separate.

Choosing Nodes

Goal	Node
Generate TTS from a described voice	`Qwen3-TTS VoiceDesign`
Clone a voice from reference audio	`Qwen3-TTS VoiceClone`
Use preset/custom speakers	`Qwen3-TTS CustomVoice`
Extract reusable voice features	`Qwen3-TTS VoiceClonePrompt`
Store several voices for dialogue	`Qwen3-TTS RoleBank`
Generate multi-speaker dialogue	`Qwen3-TTS DialogueInference`
Save a reusable voice	`Qwen3-TTS SaveVoice`
Load a saved voice	`Qwen3-TTS LoadSpeaker`
Experiment with single-speaker training	`Qwen3-TTS Train`

For most users, I would start with VoiceDesign, then VoiceClone, then SaveVoice / LoadSpeaker once a voice is worth reusing.

Performance and VRAM

The nodes expose an attention selector:

auto
sage_attn
flash_attn
sdpa
eager

The automatic path checks for faster optional attention backends first, then falls back to PyTorch SDPA and finally eager attention.

The important low-VRAM toggle is unload_model_after_generate. Enable it if you are running near your memory limit or switching between models. Leave it disabled if you are generating many clips with the same model and want faster repeat runs.

Field Note: Local Repo Test

I cloned the repository and ran a bounded local check rather than a full generation run.

What worked:

python3 -m compileall -q /tmp/foss-post-repos/ComfyUI-Qwen-TTS

That passed on Python 3.12.3, which means the checked-out Python files are syntactically valid in this environment.

What I did not run:

Full ComfyUI startup with the node installed.
Qwen model download.
Actual audio generation.
Fine-tuning.

I intentionally stopped a dependency helper attempt after uv detected the repo pyproject.toml and started resolving the full ML stack, including Torch and CUDA-related wheels. That is a useful practical warning: run this inside the ComfyUI environment you actually plan to use, and expect model/runtime dependencies to be heavy.

Voice Cloning Safety

Voice cloning is powerful enough to deserve a warning. Use reference audio that you own or have explicit permission to use. Do not clone private voices, public figures, coworkers, clients, or family members without consent.

For internal workflows, label generated audio clearly and keep reference clips out of public repositories.

Conclusion

ComfyUI-Qwen-TTS is worth a look if you already use ComfyUI and want voice generation in the same workflow canvas as your other AI pipelines.

The main setup risks are predictable: Transformers version compatibility, large model downloads, GPU/VRAM pressure, and shared-environment conflicts with other ComfyUI nodes. Handle those carefully and the node set gives you a practical path from simple TTS to reusable cloned voices and multi-role dialogue.