ailia Speech

Getting Started

Choose your platform and run your first speech-to-text inference.

Install

Install the ailia Speech Python package and the librosa audio loader used by the sample.

pip3 install ailia_speech librosa

View on PyPI

Run a Sample

Download example_ailia_speech.py from ailia-models together with a sample WAV file (saved as ax.wav to match what the script expects), then run it. Whisper weights are downloaded automatically into ./models/.

wget https://raw.githubusercontent.com/ailia-ai/ailia-models/master/audio_processing/whisper/example_ailia_speech.py
wget -O ax.wav https://raw.githubusercontent.com/ailia-ai/ailia-models/master/audio_processing/whisper/demo.wav
python3 example_ailia_speech.py

example_ailia_speech.py

Get Evaluation Package

Apply for the free trial to obtain the evaluation package, which contains the C++ binding (ailia_speech.h), runtime libraries, the license file, and a runnable sample.

Apply for Free Trial

Build & Run

Build the bundled sample and pass an audio file plus the model size and runtime options. Whisper Small is included; larger models are downloadable.

# macOS
clang++ -o ailia_speech_sample ailia_speech_sample.cpp wave_reader.cpp \
  libailia.dylib libailia_audio.dylib \
  libailia_tokenizer.dylib libailia_speech.dylib \
  -Wl,-rpath,./ -std=c++17

./ailia_speech_sample demo.wav small auto transcribe vad_enable none auto

Full C++ Setup Guide

Install via UPM

Open Window > Package Manager in Unity (2021.3.10f1+), click + > Add package from git URL, and enter the binding URL below.

https://github.com/ailia-ai/ailia-speech-unity.git

Unity API Reference

Run a Sample

Clone ailia-models-unity, open it in the Unity Editor (2021.3.10f1+), and play the SpeechToText scene with a microphone or an AudioClip attached.

git clone https://github.com/ailia-ai/ailia-models-unity.git

AiliaSpeechToTextSample.cs

Add to pubspec

Add ailia Speech as a git dependency in your Flutter project's pubspec.yaml, then run flutter pub get. Flutter 3.19.6 or later is required. On macOS, set com.apple.security.app-sandbox to false in macos/Runner/Release.entitlements and Debug.entitlements.

dependencies:
  ailia_speech:
    git:
      url: https://github.com/ailia-ai/ailia-speech-flutter.git
      ref: main

Flutter API Reference

Run a Sample

Clone the Flutter sample repository for a ready-to-run app demonstrating Whisper transcription.

git clone https://github.com/ailia-ai/ailia-models-flutter.git
cd ailia-models-flutter
flutter pub get
flutter run

Sample Repository

Clone Binding

For your own project, clone the JNI binding repository and add it to your Android Studio project.

git clone https://github.com/ailia-ai/ailia-speech-jni.git

Binding

Run a Sample

Clone ailia-models-kotlin with submodules and open it in Android Studio. Run the Speech sample on a connected device — it bundles microphone capture plus optional Silero VAD endpointing.

git clone https://github.com/ailia-ai/ailia-models-kotlin.git
cd ailia-models-kotlin
git submodule update --init --recursive

AiliaSpeechSample.kt

Features

Beyond basic transcription — capabilities that ship with ailia Speech.

Real-Time & Streaming

Live mode for instant previews
VAD silence detection (Silero AI)
Volume-based silent threshold
Intermediate callback / interruption

Translation & Multilingual

Transcribe and Translate (to English) modes
Auto language detection per segment
Fixed-language mode via SetLanguage

Accuracy Boosters

Prompt context for rare names / terms
Character / vocabulary constraints
Autocorrection dictionary (CSV)
Post-processing (T5 medical, FuguMT)

Output & Memory

Confidence score per segment
Begin / end timestamps
Speaker Diarization (PyannoteAudio)
Virtual Memory Mode (~55% RAM saving)

Use the API in Your Project

Minimal examples for transcribing audio in your own application.

import ailia_speech

speech = ailia_speech.Whisper()
speech.initialize_model(
    model_path="./models/",
    model_type=ailia_speech.AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_LARGE_V3_TURBO,
)
text = speech.transcribe(audio_waveform, sample_rate)

#include "ailia_speech.h"

struct AILIASpeech *speech = nullptr;
ailiaSpeechCreate(&speech, env_id, AILIA_SPEECH_TASK_TRANSCRIBE,
                  AILIA_SPEECH_FLAG_NONE, /*memory_mode=*/0, nullptr);
ailiaSpeechOpenModelFileA(speech, encoder_path, decoder_path,
                          AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL);

ailiaSpeechPushInputData(speech, audio, channels, samples, sample_rate);
ailiaSpeechFinalizeInputData(speech);
ailiaSpeechTranscribe(speech);

unsigned int n; ailiaSpeechGetTextCount(speech, &n);
ailiaSpeechDestroy(speech);

using ailiaSpeech;

var speech = new AiliaSpeechModel();
speech.Open(encoderPath, decoderPath, envId, memoryMode,
            AiliaSpeech.AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL,
            AiliaSpeech.AILIA_SPEECH_TASK_TRANSCRIBE,
            AiliaSpeech.AILIA_SPEECH_FLAG_NONE, "auto");

speech.Transcribe(waveData, frequency, channels, /*isFinal=*/true);
foreach (var seg in speech.GetResults()) Debug.Log(seg.text);

import 'package:ailia_speech/ailia_speech.dart';

final speech = AiliaSpeechModel();
await speech.open(
  encoderFile: encoderPath,
  decoderFile: decoderPath,
  modelType: AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL,
);
final segments = await speech.transcribe(audio, sampleRate);
for (final s in segments) print(s.text);

val speech = AiliaSpeech(
    envId = -1,
    task = AiliaSpeech.AILIA_SPEECH_TASK_TRANSCRIBE,
    flags = AiliaSpeech.AILIA_SPEECH_FLAG_NONE,
)
speech.openModel(encoderPath, decoderPath, modelTypeId)
speech.pushInputData(audio, channels, samples, sampleRate)
speech.finalizeInputData()
speech.transcribe()

for (i in 0 until speech.getTextCount()) Log.d("ailia", speech.getText(i).text)

FAQ

Common questions about ailia Speech.

Which speech recognition models are supported?

ailia Speech ships with Whisper variants (Tiny, Base, Small, Medium, Large, and Large-V3-Turbo) and SenseVoice. Pick a model with the AILIA_SPEECH_MODEL_TYPE_* constants when calling initialize_model().

Does it support real-time / streaming transcription?

Yes. Use transcribe_step() together with set_silent_threshold() to feed audio in chunks and emit partial transcripts as they become available, instead of waiting for the full file with transcribe().

Enable Live mode (AILIA_SPEECH_FLAG_LIVE) to also preview tentative results before the next silence boundary is detected — useful for low-latency UX.

How does Translate mode work?

Pass AILIA_SPEECH_TASK_TRANSLATE instead of AILIA_SPEECH_TASK_TRANSCRIBE to transcribe non-English audio and emit the English translation in one pass. Note: Translate mode is not supported on Whisper Large V3 Turbo or SenseVoice.

For the reverse direction (English → Japanese), chain Whisper Translate with the FuguMT post-processing model.

How do I improve accuracy on rare names or domain-specific terms?

Three complementary tools:

Prompt — pass a short list of expected terms (e.g. "hardware software") to bias decoding.

Constraints — restrict decoding to a character set or vocabulary (e.g. "command1,command2") for voice-command UIs.

Autocorrection dictionary — load a CSV of phonetic,correct substitutions via the OpenDictionary API.

How do I enable VAD for accurate silence detection?

Use AILIA_SPEECH_VAD_TYPE_SILERO together with the OpenVadFile API. AI-based VAD avoids the "Thank you for your attention" hallucinations that volume-based detection can trigger on near-silent audio.

What input audio format is required?

Mono float32 PCM. The library accepts arbitrary sample rates (commonly produced by librosa.load(..., mono=True)) and resamples internally. Stereo and integer-PCM inputs should be converted before being passed in.

Where do I place the license file when using C++?

The C++ binding requires ailia.lic next to the runtime libraries:

Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/

Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.

How do I enable GPU acceleration?

On macOS / iOS, Metal is used automatically. On Windows / Linux, install CUDA Toolkit and cuDNN, then pass an environment ID corresponding to the GPU back-end (or use auto in the sample CLI). See the CUDA Toolkit / cuDNN Installation Guide for detailed instructions.

Can I run it offline?

Yes, after the first run. Whisper / SenseVoice weights are downloaded into the directory passed to initialize_model(model_path=...) on first use, and the evaluation license is fetched automatically. Subsequent runs work without an internet connection.

How does licensing work?

An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.

Getting Started

Install

Run a Sample

Get Evaluation Package

Build & Run

Install via UPM

Run a Sample

Add to pubspec

Run a Sample

Clone Binding

Run a Sample

System Requirements

Operating Systems

Languages & Compilers

Supported Models

Audio & Languages

Features

Real-Time & Streaming

Translation & Multilingual

Accuracy Boosters

Output & Memory

Use the API in Your Project

API Reference by Platform

Python

C++

Unity

Flutter

FAQ

Materials