A library for running local LLMs. It can load GGUF and easily implement chat functionality.
Choose your platform and run your first local chat completion.
Download example_ailia_llm.py from ailia-models and run it. The script downloads the Gemma 3 4B GGUF file on first run and streams a chat completion. For a multimodal (vision) variant, fetch example_ailia_llm_mtmd.py from the same folder instead.
wget https://raw.githubusercontent.com/ailia-ai/ailia-models/master/large_language_model/gemma3/example_ailia_llm.py
python3 example_ailia_llm.py
example_ailia_llm.py
example_ailia_llm_mtmd.py (multimodal)
ailia LLM runs local language models on desktop and mobile platforms. Memory requirements scale with model size and quantization.
Minimal examples for streaming a chat completion in your own application.
import ailia_llm
model = ailia_llm.AiliaLLM()
model.open("gemma-2-2b-it-Q4_K_M.gguf")
messages = [{"role": "user", "content": "あなたの名前は何ですか?"}]
for delta in model.generate(messages):
print(delta, end="")
Common questions about ailia LLM.
ailia LLM loads GGUF files, the format used by llama.cpp. You can convert Hugging Face checkpoints to GGUF with the convert_hf_to_gguf.py script bundled with llama.cpp, or download pre-converted GGUF weights from Hugging Face.
Llama, Gemma, Mistral, Qwen, Phi, DeepSeek, and other architectures supported by llama.cpp. Compatibility tracks the upstream llama.cpp project.
model.generate(messages) returns an iterator that yields delta strings as the model decodes each token. Iterate over it and append to a buffer (or print directly) for streaming UX.
Memory usage roughly equals the GGUF file size plus the KV cache and intermediate tensors. As a rule of thumb at Q4 quantization: 2B models ≈ 2 GB, 7B ≈ 5 GB, 13B ≈ 9 GB. Use smaller models or higher quantization (Q4 → Q3) on memory-constrained devices.
Yes. ailia LLM uses Metal on iOS and macOS, and Vulkan on Windows. (Unlike ailia SDK / Speech / Voice, ailia LLM does not require cuDNN.) Inference falls back to CPU when no GPU is available.
The C++ binding requires ailia.lic next to the runtime libraries:
Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/
Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.
An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.