A library for generating speech from text. Easily integrate AI powered text-to-speech into your applications.
Choose your platform and synthesize your first voice clip.
Install the ailia Voice Python package together with the librosa and soundfile helpers used by the sample.
pip3 install ailia_voice librosa soundfile
View on PyPI
Download example_ailia_voice.py from ailia-models and run it. It clones a target voice from a short reference clip and writes output.wav. Models are downloaded into ./models/ automatically.
wget https://raw.githubusercontent.com/ailia-ai/ailia-models/master/audio_processing/gpt-sovits/example_ailia_voice.py
python3 example_ailia_voice.py
example_ailia_voice.py
ailia Voice runs on desktop and mobile platforms with CPU and GPU acceleration.
Voice synthesis capabilities provided across the C, C#, and Python APIs.
Minimal examples for synthesizing speech in your own application.
import ailia_voice
voice = ailia_voice.GPTSoVITS()
voice.initialize_model(model_path="./models/")
voice.set_reference_audio(
ref_text, ailia_voice.AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA, ref_audio, rate,
)
buf, sr = voice.synthesize_voice("こんにちは。", ailia_voice.AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA)
Common questions about ailia Voice.
Two families: Tacotron2 (English baseline) and GPT-SoVITS (zero-shot voice cloning).
GPT-SoVITS comes in four versions — v1, v2, v2-pro, and v3 — each with Japanese, English, and Chinese variants. Pick a sample model with the CLI argument (tacotron2, gpt-sovits, gpt-sovits-v2-en, etc.).
v1: lightest and fastest, no Japanese accent support.
v2: adds Japanese pitch / accent and playback-speed control. Good real-time default.
v3: highest audio quality (CFM + DiT + BigVGAN diffusion), but slower.
v2-pro: combines v3's text analysis with v2's fast vocoder plus speaker verification embeddings — recommended for the best quality/speed balance.
GPT-SoVITS clones the voice characteristics of a target speaker from about 10 seconds of clean reference audio plus the matching transcript. Pass both to set_reference_audio() before calling synthesize_voice().
Tacotron2 does not require reference audio — it speaks in a fixed voice.
ailia Voice integrates OpenJtalk for Japanese phoneme conversion. To override pronunciations, prepare a userdic.csv in MeCab format (the trailing 0/5 means 5 morae with accent on position 0) and convert it to a binary .dic with pyopenjtalk:
import pyopenjtalk
pyopenjtalk.mecab_dict_index("userdic.csv", "userdic.dic")
Then pass user_dict_path to initialize_model() (Python) or call ailiaVoiceSetUserDictionary (C). A standard user dictionary for v3 is also available.
Japanese, English, and Chinese, selected via the AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA / _EN / _ZH constants passed to set_reference_audio() and synthesize_voice().
The C++ binding requires ailia.lic next to the runtime libraries:
Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/
Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.
On macOS / iOS, Metal is used automatically. On Windows / Linux, install CUDA Toolkit and cuDNN. See the CUDA Toolkit / cuDNN Installation Guide for detailed instructions.
Yes, after the first run. Model weights are downloaded into the directory passed to initialize_model(model_path=...) on first use, and the evaluation license is fetched automatically. Subsequent runs work without an internet connection.
An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.