Features of ailia AI Voice

In this page, we present the features that are provided by both the C and the C# APIs.

Text-to-speech conversion

With ailia AI Voice, it is possible to use the Tacotron2 and GPT-SoVITS (v1/v2/v2-pro/v3) algorithms for speech synthesis.

GPT-SoVITS Model Comparison

GPT-SoVITS is available in multiple versions, each suited for different use cases.

Feature	v1	v2	v3	v2-pro
Highlights	Initial model	Added accent support	High-quality synthesis	Fast and high-quality
Japanese accent	No	Yes	Yes	Yes
Chinese G2P	jieba	g2pw + jieba	g2pw + jieba	g2pw + jieba
Playback speed control	No	Yes	Yes	Yes
Inference speed	Fast	Fast	Slow	Fast
Synthesis method	HiFi-GAN	HiFi-GAN	CFM+DiT+BigVGAN (diffusion)	HiFi-GAN + speaker vector
Output sample rate	32kHz	32kHz	32kHz	32kHz

v1/v2: Lightweight and fast, ideal for real-time applications. v2 improves Japanese accent (pitch) reproduction compared to v1.
v3: The highest quality model using diffusion (CFM+DiT) and BigVGAN. Produces the best audio quality but requires more inference time.
v2-pro: The latest GPT-SoVITS model. It combines v3's text analysis (T2S) pipeline with v2's fast vocoder, enhanced with Speaker Verification embeddings for better voice cloning. Recommended when you need a balance of speed and quality.

Text-to-speech model

To synthesize Japanese speech, it is necessary to convert Japanese text into phonemes, and OpenJtalk is used for the conversion to phonemes. OpenJtalk is integrated into the ailia AI Voice library.

Japanese speech synthesis

To synthesize Japanese speech, it is necessary to convert Japanese text into phonemes, and OpenJtalk is used for the conversion to phonemes. OpenJtalk is incorporated into the ailia AI Voice library.

Voice synthesis in any tone of voice

When using GPT-SoVITS, it is possible to synthesize speech in any voice timbre by providing an audio file of about 10 seconds.

User Dictionary

By defining a user dictionary, it is possible to correct the pronunciation of Japanese. It is also possible to use the standard user dictionary of GPT-SoVITS v3.

Standard user dictionary of GPT-SoVITS v3

GPU usage

On Windows and Linux, it is possible to perform inference on the GPU with cuDNN. In order to use cuDNN, please install the CUDA Toolkit and cuDNN from the NVIDIA website:

Please install the CUDA Toolkit by following the installer instructions. For cuDNN, after downloading it (and uncompressing it) please adjust the environment variable PATH to reflect its location. You need to register as NVIDIA developper in order to download these libraries.

Creating a user dictionary

To create a user dictionary, prepare a userdic.csv like the one below. The 0/5 at the end indicates that there are 5 morae, and the accent is on the 0th position.

超電磁砲,,,1,名詞,固有名詞,一般,*,*,*,超電磁砲,レールガン,レールガン,0/5,*

The user dictionary is converted from a CSV file to a dic file using pyopenjtalk.

import pyopenjtalk

pyopenjtalk.mecab_dict_index("userdic.csv", "userdic.dic")

The converted dic file can be loaded by executing the ailiaVoiceSetUserDictionary API.