ailia Speech

A library for speech recognition. It uses ailia SDK and ailia.audio for doing Speech-to-Text.

Getting Started

Choose your platform and run your first speech-to-text inference.

1

Install

Install the ailia Speech Python package and the librosa audio loader used by the sample.

pip3 install ailia_speech librosa
View on PyPI
2

Run a Sample

Download example_ailia_speech.py from ailia-models together with a sample WAV file (saved as ax.wav to match what the script expects), then run it. Whisper weights are downloaded automatically into ./models/.

wget https://raw.githubusercontent.com/ailia-ai/ailia-models/master/audio_processing/whisper/example_ailia_speech.py
wget -O ax.wav https://raw.githubusercontent.com/ailia-ai/ailia-models/master/audio_processing/whisper/demo.wav
python3 example_ailia_speech.py
example_ailia_speech.py

System Requirements

ailia Speech runs on desktop and mobile platforms with CPU and GPU acceleration.

Operating Systems

  • Windows 10 / 11
  • macOS 11 or later
  • Linux (Ubuntu 20.04+)
  • iOS 13+ / Android 7+

Languages & Compilers

  • Python 3.6+, Dart / Flutter 3.19+
  • C++17 (VS 2019+ / Xcode 14.2+ / clang)
  • C# / Unity 2021.3.10f1+

Supported Models

  • Whisper Tiny / Base / Small (bundled)
  • Whisper Medium / Large / Large V3
  • Whisper Large V3 Turbo
  • SenseVoice (small)

Audio & Languages

  • Mono float32 PCM (auto-resampled)
  • 99 languages (Whisper)
  • 5 languages (SenseVoice: zh / en / yue / ja / ko)
  • Batch, Live (real-time), and streaming

Features

Beyond basic transcription — capabilities that ship with ailia Speech.

Real-Time & Streaming

  • Live mode for instant previews
  • VAD silence detection (Silero AI)
  • Volume-based silent threshold
  • Intermediate callback / interruption

Translation & Multilingual

  • Transcribe and Translate (to English) modes
  • Auto language detection per segment
  • Fixed-language mode via SetLanguage

Accuracy Boosters

  • Prompt context for rare names / terms
  • Character / vocabulary constraints
  • Autocorrection dictionary (CSV)
  • Post-processing (T5 medical, FuguMT)

Output & Memory

  • Confidence score per segment
  • Begin / end timestamps
  • Speaker Diarization (PyannoteAudio)
  • Virtual Memory Mode (~55% RAM saving)

Use the API in Your Project

Minimal examples for transcribing audio in your own application.

import ailia_speech

speech = ailia_speech.Whisper()
speech.initialize_model(
    model_path="./models/",
    model_type=ailia_speech.AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_LARGE_V3_TURBO,
)
text = speech.transcribe(audio_waveform, sample_rate)

API Reference by Platform

Python

C++

Unity

Flutter

FAQ

Common questions about ailia Speech.

Which speech recognition models are supported?

ailia Speech ships with Whisper variants (Tiny, Base, Small, Medium, Large, and Large-V3-Turbo) and SenseVoice. Pick a model with the AILIA_SPEECH_MODEL_TYPE_* constants when calling initialize_model().

Does it support real-time / streaming transcription?

Yes. Use transcribe_step() together with set_silent_threshold() to feed audio in chunks and emit partial transcripts as they become available, instead of waiting for the full file with transcribe().

Enable Live mode (AILIA_SPEECH_FLAG_LIVE) to also preview tentative results before the next silence boundary is detected — useful for low-latency UX.

How does Translate mode work?

Pass AILIA_SPEECH_TASK_TRANSLATE instead of AILIA_SPEECH_TASK_TRANSCRIBE to transcribe non-English audio and emit the English translation in one pass. Note: Translate mode is not supported on Whisper Large V3 Turbo or SenseVoice.

For the reverse direction (English → Japanese), chain Whisper Translate with the FuguMT post-processing model.

How do I improve accuracy on rare names or domain-specific terms?

Three complementary tools:

Prompt — pass a short list of expected terms (e.g. "hardware software") to bias decoding.

Constraints — restrict decoding to a character set or vocabulary (e.g. "command1,command2") for voice-command UIs.

Autocorrection dictionary — load a CSV of phonetic,correct substitutions via the OpenDictionary API.

How do I enable VAD for accurate silence detection?

Use AILIA_SPEECH_VAD_TYPE_SILERO together with the OpenVadFile API. AI-based VAD avoids the "Thank you for your attention" hallucinations that volume-based detection can trigger on near-silent audio.

What input audio format is required?

Mono float32 PCM. The library accepts arbitrary sample rates (commonly produced by librosa.load(..., mono=True)) and resamples internally. Stereo and integer-PCM inputs should be converted before being passed in.

Where do I place the license file when using C++?

The C++ binding requires ailia.lic next to the runtime libraries:

Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/

Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.

How do I enable GPU acceleration?

On macOS / iOS, Metal is used automatically. On Windows / Linux, install CUDA Toolkit and cuDNN, then pass an environment ID corresponding to the GPU back-end (or use auto in the sample CLI). See the CUDA Toolkit / cuDNN Installation Guide for detailed instructions.

Can I run it offline?

Yes, after the first run. Whisper / SenseVoice weights are downloaded into the directory passed to initialize_model(model_path=...) on first use, and the evaluation license is fetched automatically. Subsequent runs work without an internet connection.

How does licensing work?

An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.

Materials