Industrial-grade speech synthesis in seconds
GLM-TTS is an industrial-grade open-source TTS system by Zhipu AI (zai-org).
Zero-shot voice cloning from ~3s prompt audio, RL-enhanced emotion, and phoneme-level control.
🎧 Natural, expressive text-to-speech
from 99+ happy users
Web demo built with a modern stack

What is GLM-TTS
GLM-TTS is an industrial-grade open-source text-to-speech system. It combines an LLM (text-to-token) with Flow Matching (token-to-wav) to produce human-like, emotionally expressive speech.
- Zero-shot Voice CloningClone a speaker's timbre and prosody using only ~3 seconds of prompt audio.
- Emotion & ParalinguisticsRL-enhanced emotions (happy/sad/angry) plus natural sounds like laughter and breathing.
- Pronunciation ControlHybrid phoneme + text input (Phoneme-in) to handle polyphones and rare words precisely.
Why GLM-TTS
Designed to overcome the “mechanical” feel of traditional TTS while staying controllable and production-ready.



Quickstart
Run GLM-TTS locally in minutes:
Core Capabilities
Key capabilities highlighted in the GLM-TTS technical reference.
Zero-shot Voice Cloning
Clone timbre and prosody from ~3 seconds of prompt audio (no fine-tuning required).
Emotion Control (RL)
GRPO-based RL improves expressiveness and enables emotions plus natural laughter/breathing.
Phoneme-in Control
Hybrid phoneme + text input for precise pronunciation (polyphones, rare words, education use).
Two-stage Architecture
LLM text-to-token (Llama-based) + Flow Matching token-to-wav (DiT) for quality and speed.
High-fidelity Vocoder
2D-Vocos vocoder improves sub-band modeling and stability across dynamic ranges.
Apache 2.0 License
Commercial-friendly open source license for self-hosted deployments and integration.
Built for production TTS
Highlights from the GLM-TTS technical reference.
Training data
100k+
Hours
Voice prompt
3s
Zero-shot
Accuracy
0.89%
CER
What builders say
From education to audiobooks to customer service—teams use GLM-TTS for natural, controllable speech.
Lin Chen
Education App Team
Phoneme control makes polyphones and mixed Chinese/English content reliable—perfect for reading and tutoring scenes.
Maya Singh
Audiobook Producer
The emotional range feels human—crying, laughter, and subtle tone shifts land naturally in long-form narration.
Alex Johnson
Customer Service Lead
Warm, professional speech without exaggerated performance—great for templated messages with variable inserts.
Sofia Garcia
Indie Game Studio
Zero-shot voice cloning from a few seconds of reference audio accelerates multi-character prototyping dramatically.
James Wilson
ML Engineer
Two-stage LLM + Flow Matching is a clean design: strong semantics with high-quality acoustics and stable synthesis.
Anna Zhang
Product Builder
Apache 2.0 keeps it simple for commercial integration—self-hosting and customization are straightforward.
Frequently asked questions
Need more details? Check the official repo and technical reference.
What is GLM-TTS?
GLM-TTS is an industrial-grade open-source TTS system by Zhipu AI. It uses an LLM for semantic modeling and Flow Matching for acoustic generation.
Is GLM-TTS open source and can I use it commercially?
Yes. GLM-TTS is released under Apache 2.0, which permits commercial use.
How does zero-shot voice cloning work?
Provide ~3 seconds of prompt audio and GLM-TTS can adapt timbre and prosody without fine-tuning.
How do I control pronunciation?
Use the Phoneme-in mechanism (hybrid phoneme + text input) to pin down pronunciations for polyphones and rare words.
Does it support emotion and laughter?
Yes. RL (GRPO) is used to improve emotional expressiveness and paralinguistic sounds like laughter.
How do I run inference?
Follow the official quickstart: install requirements, download checkpoints, run glmtts_inference.py, and optionally launch the Gradio app.
Build with GLM-TTS
Get the code, run the demo, and generate your first expressive sample.

