ailia Voice

A library for generating speech from text. Easily integrate AI powered text-to-speech into your applications.

Getting Started

Choose your platform and synthesize your first voice clip.

1

Install

Install the ailia Voice Python package together with the librosa and soundfile helpers used by the sample.

pip3 install ailia_voice librosa soundfile
View on PyPI
2

Run a Sample

Download example_ailia_voice.py from ailia-models and run it. It clones a target voice from a short reference clip and writes output.wav. Models are downloaded into ./models/ automatically.

wget https://raw.githubusercontent.com/ailia-ai/ailia-models/master/audio_processing/gpt-sovits/example_ailia_voice.py
python3 example_ailia_voice.py
example_ailia_voice.py

System Requirements

ailia Voice runs on desktop and mobile platforms with CPU and GPU acceleration.

Operating Systems

  • Windows 10 / 11
  • macOS 11 or later
  • Linux (Ubuntu 20.04+)
  • iOS 13+ / Android 7+

Languages & Compilers

  • Python 3.6+, Dart / Flutter 3.19+
  • C++17 (VS 2019+ / Xcode 14.2+ / clang) + CMake
  • C# / Unity 2021.3.10f1+

Supported Models

  • Tacotron2 (English)
  • GPT-SoVITS v1 / v2 / v2-pro / v3
  • JA / EN / ZH variants per version
  • Output: 32 kHz mono

Language Coverage

  • Japanese (G2P_TYPE_GPT_SOVITS_JA)
  • English (G2P_TYPE_GPT_SOVITS_EN)
  • Chinese (G2P_TYPE_GPT_SOVITS_ZH)

Features

Voice synthesis capabilities provided across the C, C#, and Python APIs.

TTS Models

  • Tacotron2 — fast English baseline
  • GPT-SoVITS v1 / v2 / v2-pro / v3

Voice Cloning

  • Clone any timbre from ~10 s reference audio
  • Reference audio + transcript pairing
  • Speaker Verification embeddings (v2-pro)

Multi-Language

  • Japanese accent support (v2 onward)
  • OpenJtalk built in for JA phonemes
  • g2pw + jieba for Chinese (v2 onward)

Customization

  • User dictionary (pyopenjtalk format)
  • Standard v3 user dictionary downloadable
  • Playback speed control (v2 onward)

Use the API in Your Project

Minimal examples for synthesizing speech in your own application.

import ailia_voice

voice = ailia_voice.GPTSoVITS()
voice.initialize_model(model_path="./models/")
voice.set_reference_audio(
    ref_text, ailia_voice.AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA, ref_audio, rate,
)
buf, sr = voice.synthesize_voice("こんにちは。", ailia_voice.AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA)

API Reference by Platform

Python

C++

Unity

Flutter

JNI

FAQ

Common questions about ailia Voice.

Which TTS models are supported?

Two families: Tacotron2 (English baseline) and GPT-SoVITS (zero-shot voice cloning).

GPT-SoVITS comes in four versions — v1, v2, v2-pro, and v3 — each with Japanese, English, and Chinese variants. Pick a sample model with the CLI argument (tacotron2, gpt-sovits, gpt-sovits-v2-en, etc.).

Which GPT-SoVITS version should I use?

v1: lightest and fastest, no Japanese accent support.
v2: adds Japanese pitch / accent and playback-speed control. Good real-time default.
v3: highest audio quality (CFM + DiT + BigVGAN diffusion), but slower.
v2-pro: combines v3's text analysis with v2's fast vocoder plus speaker verification embeddings — recommended for the best quality/speed balance.

What does "reference audio" mean and why is it required?

GPT-SoVITS clones the voice characteristics of a target speaker from about 10 seconds of clean reference audio plus the matching transcript. Pass both to set_reference_audio() before calling synthesize_voice().

Tacotron2 does not require reference audio — it speaks in a fixed voice.

How do I create a custom pronunciation dictionary?

ailia Voice integrates OpenJtalk for Japanese phoneme conversion. To override pronunciations, prepare a userdic.csv in MeCab format (the trailing 0/5 means 5 morae with accent on position 0) and convert it to a binary .dic with pyopenjtalk:

import pyopenjtalk
pyopenjtalk.mecab_dict_index("userdic.csv", "userdic.dic")

Then pass user_dict_path to initialize_model() (Python) or call ailiaVoiceSetUserDictionary (C). A standard user dictionary for v3 is also available.

Which languages can I synthesize?

Japanese, English, and Chinese, selected via the AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA / _EN / _ZH constants passed to set_reference_audio() and synthesize_voice().

Where do I place the license file when using C++?

The C++ binding requires ailia.lic next to the runtime libraries:

Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/

Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.

How do I enable GPU acceleration?

On macOS / iOS, Metal is used automatically. On Windows / Linux, install CUDA Toolkit and cuDNN, then add cuDNN to your PATH.

Can I run it offline?

Yes, after the first run. Model weights are downloaded into the directory passed to initialize_model(model_path=...) on first use, and the evaluation license is fetched automatically. Subsequent runs work without an internet connection.

How does licensing work?

An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.

Materials