ailia Tokenizer

Getting Started

Choose your platform and tokenize without depending on Python transformers.

Install

Install the ailia Tokenizer Python package from PyPI.

pip3 install ailia_tokenizer

View on PyPI

Run a Sample

Clone ailia-models and run the multilingual-MiniLMv2 zero-shot classification sample. Its tokenizer/ directory ships sentencepiece.bpe.model + tokenizer_config.json, the file layout XLMRobertaTokenizer.from_pretrained() expects.

git clone https://github.com/ailia-ai/ailia-models.git
cd ailia-models/natural_language_processing/multilingual-minilmv2
pip3 install -r requirements.txt
python3 multilingual-minilmv2.py

tokenizer/ file layout

Clone Binding

Clone the C++ binding repository and link against libailia_tokenizer. Open each tokenizer file via the corresponding API (OpenModelFile, OpenVocabFile, OpenMergeFile, OpenDictionaryFile).

git clone https://github.com/ailia-ai/ailia-tokenizer-cpp.git

Binding

Build & Run

Build the bundled sample and pass it tokenizer files plus input text.

# macOS
clang++ -o tokenizer_sample tokenizer_sample.cpp \
  libailia_tokenizer.dylib -Wl,-rpath,./ -std=c++17

./tokenizer_sample bert vocab.txt

Full C++ Setup Guide

Install via UPM

Open Window > Package Manager in Unity (2021.3.10f1+), click + > Add package from git URL, and enter the binding URL below.

https://github.com/ailia-ai/ailia-tokenizer-unity.git

Unity API Reference

Run a Sample

Clone ailia-models-unity, open it in the Unity Editor (2021.3.10f1+), and play NaturalLanguageProcessing/AiliaNaturalLanguageProcessingSample.unity to see token IDs in the Console.

git clone https://github.com/ailia-ai/ailia-models-unity.git

Sample Scene

Add to pubspec

Add ailia Tokenizer as a git dependency in your Flutter project's pubspec.yaml, then run flutter pub get. Flutter 3.19.6 or later is required. On macOS, set com.apple.security.app-sandbox to false in macos/Runner/Release.entitlements and Debug.entitlements.

dependencies:
  ailia_tokenizer:
    git:
      url: https://github.com/ailia-ai/ailia-tokenizer-flutter.git
      ref: main

Flutter API Reference

Run a Sample

Clone ailia-tokenizer-flutter and run the bundled example app — it copies an XLM-RoBERTa SentencePiece model from assets/ and encodes a string of text.

git clone https://github.com/ailia-ai/ailia-tokenizer-flutter.git
cd ailia-tokenizer-flutter/example
flutter pub get
flutter run

example/lib/main.dart

Clone Binding

Clone the JNI binding repository and add it to your Android Studio project.

git clone https://github.com/ailia-ai/ailia-tokenizer-jni.git

Binding

Run a Sample

Clone ailia-models-kotlin with submodules and open it in Android Studio. The repository ships an XLM-RoBERTa zero-shot classification sample built on the tokenizer flow.

git clone https://github.com/ailia-ai/ailia-models-kotlin.git
cd ailia-models-kotlin
git submodule update --init --recursive

AiliaMiniLMv2Sample.kt

Features

Twelve tokenizer types with Hugging Face-compatible APIs.

Speech / Vision

Whisper (multilingual)
CLIP (text-image)

Translation / Summarization

Marian (FuguMT EN ↔ JA)
T5 (sentencepiece)
XLM-RoBERTa

BERT Family

BERT (English)
BERT Japanese WordPiece
BERT Japanese Character
RoBERTa

LLM Family

GPT-2
Llama
Gemma

Use the API in Your Project

Minimal examples for encoding and decoding text in your own application.

import ailia_tokenizer

tok = ailia_tokenizer.BertTokenizer.from_pretrained("./tokenizer/")
ids = tok.encode("Hello, world!")
text = tok.decode(ids)

#include "ailia_tokenizer.h"

struct AILIATokenizer *tok = nullptr;
ailiaTokenizerCreate(&tok, AILIA_TOKENIZER_TYPE_BERT, AILIA_TOKENIZER_FLAG_NONE);
ailiaTokenizerOpenVocabFileA(tok, "vocab.txt");

ailiaTokenizerEncode(tok, "Hello, world!");
unsigned int n; ailiaTokenizerGetTokenCount(tok, &n);

ailiaTokenizerDestroy(tok);

using ailiaTokenizer;

var tokenizer = new AiliaTokenizerModel();
tokenizer.Create(AiliaTokenizer.AILIA_TOKENIZER_TYPE_BERT,
                 AiliaTokenizer.AILIA_TOKENIZER_FLAG_NONE);
tokenizer.Open(Path.Combine(Application.streamingAssetsPath, "tokenizer/"));

int[] ids = tokenizer.Encode("Hello, world!");
string text = tokenizer.Decode(ids);

import 'package:ailia_tokenizer/ailia_tokenizer.dart' as ailia_tokenizer_dart;
import 'package:ailia_tokenizer/ailia_tokenizer_model.dart';
import 'dart:typed_data';

final tokenizer = AiliaTokenizerModel();
tokenizer.openFile(modelFile: bpePath,
                   ailia_tokenizer_dart.AILIA_TOKENIZER_TYPE_XLM_ROBERTA);

final Int32List ids = tokenizer.encode('Hello, world!');
final String text = tokenizer.decode(ids);

val tokenizer = AiliaTokenizer(AiliaTokenizer.AILIA_TOKENIZER_TYPE_BERT)
tokenizer.loadFiles(modelPath = tokenizerPath)

val ids = tokenizer.encode("Hello, world!")
val text = tokenizer.decode(ids)

FAQ

Common questions about ailia Tokenizer.

Why use ailia Tokenizer instead of Hugging Face transformers?

transformers is Python-only. ailia Tokenizer ships the same tokenizers as a native library callable from C++, Unity (C#), Flutter (Dart), JNI, and Python — letting you tokenize on iOS, Android, and embedded targets without bundling a Python runtime.

The Python API mirrors transformers (from_pretrained(), encode(), decode(), etc.) so existing code typically requires only an import change.

Which tokenizer types are supported?

Twelve types covering the most common modern model families: Whisper, CLIP, XLM-RoBERTa, Marian, BERT (English), BERT Japanese WordPiece, BERT Japanese Character, T5, RoBERTa, GPT-2, Llama, and Gemma.

How does ailia Tokenizer match transformers' behaviour?

encode() matches tokenizer(sents, split_special_tokens=True) (special tokens encoded as text, no padding/truncation).

encodeWithSpecialTokens() matches tokenizer(sents) (special tokens encoded as IDs).

decode() matches tokenizer.decode(ids, skip_special_tokens=True); decodeWithSpecialTokens() keeps the special tokens.

What extra files do I need for each tokenizer?

Whisper / CLIP / GPT-2 are self-contained. Other tokenizers need their model files placed alongside the tokenizer:

SentencePiece (T5, XLM-RoBERTa, Marian, Llama, Gemma): spiece.model / tokenizer.model / source.spm.
BERT (English): vocab.txt + tokenizer_config.json.
BERT Japanese: ipadic dictionary + vocab.txt (NFKC normalization is automatic).
RoBERTa: vocab.json + merges.txt.

Python: place all required files in one directory and pass that directory path to from_pretrained() (e.g. BertTokenizer.from_pretrained("./tokenizer/")).
C / C++ / Unity / Flutter / JNI: open each file individually with the corresponding OpenModelFile / OpenVocabFile / OpenMergeFile / OpenDictionaryFile API.

Where do I place the license file when using C++?

The C++ binding requires ailia.lic next to the runtime libraries:

Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/

Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.

How does licensing work?

An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.

Getting Started

Install

Run a Sample

Clone Binding

Build & Run

Install via UPM

Run a Sample

Add to pubspec

Run a Sample

Clone Binding

Run a Sample

System Requirements

Operating Systems

Languages & Compilers

Built-in Components

Output Formats

Features

Speech / Vision

Translation / Summarization

BERT Family

LLM Family

Use the API in Your Project

API Reference by Platform

Python

C++

Unity

Flutter

JNI

FAQ

Materials