A library for speech recognition. It uses ailia SDK and ailia.audio for doing Speech-to-Text.
Choose your platform and run your first speech-to-text inference.
Install the ailia Speech Python package and the librosa audio loader used by the sample.
pip3 install ailia_speech librosa
Haven't installed Python or git yet? Start with Setting up your Python environment (Windows / Mac / Linux).
View on PyPIDownload example_ailia_speech.py from ailia-models together with a sample WAV file (saved as ax.wav to match what the script expects), then run it. Whisper weights are downloaded automatically into ./models/.
wget https://raw.githubusercontent.com/ailia-ai/ailia-models/master/audio_processing/whisper/example_ailia_speech.py
wget -O ax.wav https://raw.githubusercontent.com/ailia-ai/ailia-models/master/audio_processing/whisper/demo.wav
python3 example_ailia_speech.py
On Windows, use python instead of python3.
ailia Speech runs on desktop and mobile platforms with CPU and GPU acceleration.
Beyond basic transcription — capabilities that ship with ailia Speech.
Minimal examples for transcribing audio in your own application. The API is designed around streaming microphone input, but wrapper abstractions differ by language, so the API model is described per platform tab below.
The Python wrapper hides the internal state machine. Just hand the full audio buffer to transcribe() — push / finalize / drain are handled internally.
import ailia_speech
speech = ailia_speech.Whisper()
speech.initialize_model(
model_path="./models/",
model_type=ailia_speech.AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_LARGE_V3_TURBO,
)
text = speech.transcribe(audio_waveform, sample_rate)
Common questions about ailia Speech.
ailia Speech ships with Whisper variants (Tiny, Base, Small, Medium, Large, and Large-V3-Turbo) and SenseVoice. Pick a model with the AILIA_SPEECH_MODEL_TYPE_* constants when calling initialize_model().
Yes. Use transcribe_step() together with set_silent_threshold() to feed audio in chunks and emit partial transcripts as they become available, instead of waiting for the full file with transcribe().
Enable Live mode (AILIA_SPEECH_FLAG_LIVE) to also preview tentative results before the next silence boundary is detected — useful for low-latency UX.
Internally, buffered is set whenever either (1) 30 seconds of audio have accumulated, or (2) VAD detects a speech boundary.
For an AI-agent flow that streams from a microphone and ends after one utterance, keep calling pushInputData in a loop while polling ailiaSpeechBuffered. The first time buffered == 1 (VAD detected the end of the utterance), call ailiaSpeechFinalizeInputData to close mic input — this drives ailiaSpeechComplete to 1 and exits the transcribe loop.
Enable VAD via ailiaSpeechOpenVadFile with AILIA_SPEECH_VAD_TYPE_SILERO.
Pass AILIA_SPEECH_TASK_TRANSLATE instead of AILIA_SPEECH_TASK_TRANSCRIBE to transcribe non-English audio and emit the English translation in one pass. Note: Translate mode is not supported on Whisper Large V3 Turbo or SenseVoice.
For the reverse direction (English → Japanese), chain Whisper Translate with the FuguMT post-processing model.
Three complementary tools:
Prompt — pass a short list of expected terms (e.g. "hardware software") to bias decoding.
Constraints — restrict decoding to a character set or vocabulary (e.g. "command1,command2") for voice-command UIs.
Autocorrection dictionary — load a CSV of phonetic,correct substitutions via the OpenDictionary API.
Use AILIA_SPEECH_VAD_TYPE_SILERO together with the OpenVadFile API. AI-based VAD avoids the "Thank you for your attention" hallucinations that volume-based detection can trigger on near-silent audio.
Mono float32 PCM. The library accepts arbitrary sample rates (commonly produced by librosa.load(..., mono=True)) and resamples internally. Stereo and integer-PCM inputs should be converted before being passed in.
The C++ binding requires ailia.lic next to the runtime libraries:
Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/
Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.
On macOS / iOS, Metal is used automatically. On Windows / Linux, install CUDA Toolkit and cuDNN, then pass an environment ID corresponding to the GPU back-end (or use auto in the sample CLI). See the CUDA Toolkit / cuDNN Installation Guide for detailed instructions.
Yes, after the first run. Whisper / SenseVoice weights are downloaded into the directory passed to initialize_model(model_path=...) on first use, and the evaluation license is fetched automatically. Subsequent runs work without an internet connection.
An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.
Model deep dives, release notes, and tutorials from the ailia tech blog.