ailia LLM

A library for running local LLMs. It can load GGUF and easily implement chat functionality.

Getting Started

Choose your platform and run your first local chat completion.

1

Install

Install the ailia LLM Python package from PyPI.

pip3 install ailia_llm
View on PyPI
2

Run a Sample

Download example_ailia_llm.py from ailia-models and run it. The script downloads the Gemma 3 4B GGUF file on first run and streams a chat completion. For a multimodal (vision) variant, fetch example_ailia_llm_mtmd.py from the same folder instead.

wget https://raw.githubusercontent.com/ailia-ai/ailia-models/master/large_language_model/gemma3/example_ailia_llm.py
python3 example_ailia_llm.py
example_ailia_llm.py example_ailia_llm_mtmd.py (multimodal)

System Requirements

ailia LLM runs local language models on desktop and mobile platforms. Memory requirements scale with model size and quantization.

Operating Systems

  • Windows 10 / 11
  • macOS 11 or later
  • Linux (Ubuntu 20.04+)
  • iOS 13+ / Android 7+

Languages & Compilers

  • Python 3.6+, Dart / Flutter 3.19+
  • C++17 (VS 2019+ / Xcode 14.2+ / clang)
  • C# / Unity 2021.3.10f1+
  • GPU: Metal (iOS / macOS), Vulkan (Windows)

Model Format

  • GGUF (llama.cpp compatible)
  • Llama / Gemma / Mistral / Qwen
  • Phi / DeepSeek and more
  • Q4 / Q5 / Q8 quantization

Memory Guidance

  • 2B Q4: ~2 GB RAM
  • 7B Q4: ~5 GB RAM
  • 13B Q4: ~9 GB RAM
  • Streaming token-by-token output

Use the API in Your Project

Minimal examples for streaming a chat completion in your own application.

import ailia_llm

model = ailia_llm.AiliaLLM()
model.open("gemma-2-2b-it-Q4_K_M.gguf")

messages = [{"role": "user", "content": "あなたの名前は何ですか?"}]
for delta in model.generate(messages):
    print(delta, end="")

API Reference by Platform

Python

C++

Unity

Flutter

FAQ

Common questions about ailia LLM.

What model formats are supported?

ailia LLM loads GGUF files, the format used by llama.cpp. You can convert Hugging Face checkpoints to GGUF with the convert_hf_to_gguf.py script bundled with llama.cpp, or download pre-converted GGUF weights from Hugging Face.

Which model architectures are supported?

Llama, Gemma, Mistral, Qwen, Phi, DeepSeek, and other architectures supported by llama.cpp. Compatibility tracks the upstream llama.cpp project.

How do I stream tokens as they are generated?

model.generate(messages) returns an iterator that yields delta strings as the model decodes each token. Iterate over it and append to a buffer (or print directly) for streaming UX.

How much memory do I need?

Memory usage roughly equals the GGUF file size plus the KV cache and intermediate tensors. As a rule of thumb at Q4 quantization: 2B models ≈ 2 GB, 7B ≈ 5 GB, 13B ≈ 9 GB. Use smaller models or higher quantization (Q4 → Q3) on memory-constrained devices.

Does it support GPU acceleration?

Yes. ailia LLM uses Metal on iOS and macOS, and Vulkan on Windows. (Unlike ailia SDK / Speech / Voice, ailia LLM does not require cuDNN.) Inference falls back to CPU when no GPU is available.

Where do I place the license file when using C++?

The C++ binding requires ailia.lic next to the runtime libraries:

Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/

Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.

How does licensing work?

An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.

Materials