A library for speech recognition. It uses ailia SDK and ailia.audio for doing Speech-to-Text.
Choose your platform and run your first speech-to-text inference.
Install the ailia Speech Python package and the librosa audio loader used by the sample.
pip3 install ailia_speech librosa
View on PyPI
Download example_ailia_speech.py from ailia-models together with a sample WAV file (saved as ax.wav to match what the script expects), then run it. Whisper weights are downloaded automatically into ./models/.
wget https://raw.githubusercontent.com/ailia-ai/ailia-models/master/audio_processing/whisper/example_ailia_speech.py
wget -O ax.wav https://raw.githubusercontent.com/ailia-ai/ailia-models/master/audio_processing/whisper/demo.wav
python3 example_ailia_speech.py
example_ailia_speech.py
ailia Speech runs on desktop and mobile platforms with CPU and GPU acceleration.
Beyond basic transcription — capabilities that ship with ailia Speech.
Minimal examples for transcribing audio in your own application.
import ailia_speech
speech = ailia_speech.Whisper()
speech.initialize_model(
model_path="./models/",
model_type=ailia_speech.AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_LARGE_V3_TURBO,
)
text = speech.transcribe(audio_waveform, sample_rate)
Common questions about ailia Speech.
ailia Speech ships with Whisper variants (Tiny, Base, Small, Medium, Large, and Large-V3-Turbo) and SenseVoice. Pick a model with the AILIA_SPEECH_MODEL_TYPE_* constants when calling initialize_model().
Yes. Use transcribe_step() together with set_silent_threshold() to feed audio in chunks and emit partial transcripts as they become available, instead of waiting for the full file with transcribe().
Enable Live mode (AILIA_SPEECH_FLAG_LIVE) to also preview tentative results before the next silence boundary is detected — useful for low-latency UX.
Pass AILIA_SPEECH_TASK_TRANSLATE instead of AILIA_SPEECH_TASK_TRANSCRIBE to transcribe non-English audio and emit the English translation in one pass. Note: Translate mode is not supported on Whisper Large V3 Turbo or SenseVoice.
For the reverse direction (English → Japanese), chain Whisper Translate with the FuguMT post-processing model.
Three complementary tools:
Prompt — pass a short list of expected terms (e.g. "hardware software") to bias decoding.
Constraints — restrict decoding to a character set or vocabulary (e.g. "command1,command2") for voice-command UIs.
Autocorrection dictionary — load a CSV of phonetic,correct substitutions via the OpenDictionary API.
Use AILIA_SPEECH_VAD_TYPE_SILERO together with the OpenVadFile API. AI-based VAD avoids the "Thank you for your attention" hallucinations that volume-based detection can trigger on near-silent audio.
Mono float32 PCM. The library accepts arbitrary sample rates (commonly produced by librosa.load(..., mono=True)) and resamples internally. Stereo and integer-PCM inputs should be converted before being passed in.
The C++ binding requires ailia.lic next to the runtime libraries:
Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/
Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.
On macOS / iOS, Metal is used automatically. On Windows / Linux, install CUDA Toolkit and cuDNN, then pass an environment ID corresponding to the GPU back-end (or use auto in the sample CLI). See the CUDA Toolkit / cuDNN Installation Guide for detailed instructions.
Yes, after the first run. Whisper / SenseVoice weights are downloaded into the directory passed to initialize_model(model_path=...) on first use, and the evaluation license is fetched automatically. Subsequent runs work without an internet connection.
An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.