ailia_voice  1.5.0.0
About feature

Features of ailia AI Voice

In this page, we present the features that are provided by both the C and the C# APIs.

Text-to-speech conversion

With ailia AI Voice, it is possible to use the Tacotron2 and GPT-SoVITS (v1/v2/v2-pro/v3) algorithms for speech synthesis.

GPT-SoVITS Model Comparison

GPT-SoVITS is available in multiple versions, each suited for different use cases.

Feature v1 v2 v3 v2-pro
Highlights Initial model Added accent support High-quality synthesis Fast and high-quality
Japanese accent No Yes Yes Yes
Chinese G2P jieba g2pw + jieba g2pw + jieba g2pw + jieba
Playback speed control No Yes Yes Yes
Inference speed Fast Fast Slow Fast
Synthesis method HiFi-GAN HiFi-GAN CFM+DiT+BigVGAN (diffusion) HiFi-GAN + speaker vector
Output sample rate 32kHz 32kHz 32kHz 32kHz
  • v1/v2: Lightweight and fast, ideal for real-time applications. v2 improves Japanese accent (pitch) reproduction compared to v1.
  • v3: The highest quality model using diffusion (CFM+DiT) and BigVGAN. Produces the best audio quality but requires more inference time.
  • v2-pro: The latest GPT-SoVITS model. It combines v3's text analysis (T2S) pipeline with v2's fast vocoder, enhanced with Speaker Verification embeddings for better voice cloning. Recommended when you need a balance of speed and quality.

Text-to-speech model

To synthesize Japanese speech, it is necessary to convert Japanese text into phonemes, and OpenJtalk is used for the conversion to phonemes. OpenJtalk is integrated into the ailia AI Voice library.

Japanese speech synthesis

To synthesize Japanese speech, it is necessary to convert Japanese text into phonemes, and OpenJtalk is used for the conversion to phonemes. OpenJtalk is incorporated into the ailia AI Voice library.

Voice synthesis in any tone of voice

When using GPT-SoVITS, it is possible to synthesize speech in any voice timbre by providing an audio file of about 10 seconds.

User Dictionary

By defining a user dictionary, it is possible to correct the pronunciation of Japanese. It is also possible to use the standard user dictionary of GPT-SoVITS v3.

GPU usage

On Windows and Linux, it is possible to perform inference on the GPU with cuDNN. In order to use cuDNN, please install the CUDA Toolkit and cuDNN from the NVIDIA website:

Please install the CUDA Toolkit by following the installer instructions. For cuDNN, after downloading it (and uncompressing it) please adjust the environment variable PATH to reflect its location. You need to register as NVIDIA developper in order to download these libraries.

Creating a user dictionary

To create a user dictionary, prepare a userdic.csv like the one below. The 0/5 at the end indicates that there are 5 morae, and the accent is on the 0th position.

超電磁砲,,,1,名詞,固有名詞,一般,*,*,*,超電磁砲,レールガン,レールガン,0/5,*

The user dictionary is converted from a CSV file to a dic file using pyopenjtalk.

import pyopenjtalk
pyopenjtalk.mecab_dict_index("userdic.csv", "userdic.dic")

The converted dic file can be loaded by executing the ailiaVoiceSetUserDictionary API.