A library for encoding NLP text and decoding NLP tokens. Supports tokenization without Python transformers.
Choose your platform and tokenize without depending on Python transformers.
Install the ailia Tokenizer Python package from PyPI.
pip3 install ailia_tokenizer
View on PyPI
Clone ailia-models and run the multilingual-MiniLMv2 zero-shot classification sample. Its tokenizer/ directory ships sentencepiece.bpe.model + tokenizer_config.json, the file layout XLMRobertaTokenizer.from_pretrained() expects.
git clone https://github.com/ailia-ai/ailia-models.git
cd ailia-models/natural_language_processing/multilingual-minilmv2
pip3 install -r requirements.txt
python3 multilingual-minilmv2.py
tokenizer/ file layout
ailia Tokenizer is a cross-platform replacement for the Python transformers tokenizer suite, usable from C++, Unity, and Flutter without a Python runtime.
Twelve tokenizer types with Hugging Face-compatible APIs.
Minimal examples for encoding and decoding text in your own application.
import ailia_tokenizer
tok = ailia_tokenizer.BertTokenizer.from_pretrained("./tokenizer/")
ids = tok.encode("Hello, world!")
text = tok.decode(ids)
Common questions about ailia Tokenizer.
transformers?transformers is Python-only. ailia Tokenizer ships the same tokenizers as a native library callable from C++, Unity (C#), Flutter (Dart), JNI, and Python — letting you tokenize on iOS, Android, and embedded targets without bundling a Python runtime.
The Python API mirrors transformers (from_pretrained(), encode(), decode(), etc.) so existing code typically requires only an import change.
Twelve types covering the most common modern model families: Whisper, CLIP, XLM-RoBERTa, Marian, BERT (English), BERT Japanese WordPiece, BERT Japanese Character, T5, RoBERTa, GPT-2, Llama, and Gemma.
encode() matches tokenizer(sents, split_special_tokens=True) (special tokens encoded as text, no padding/truncation).
encodeWithSpecialTokens() matches tokenizer(sents) (special tokens encoded as IDs).
decode() matches tokenizer.decode(ids, skip_special_tokens=True); decodeWithSpecialTokens() keeps the special tokens.
Whisper / CLIP / GPT-2 are self-contained. Other tokenizers need their model files placed alongside the tokenizer:
SentencePiece (T5, XLM-RoBERTa, Marian, Llama, Gemma): spiece.model / tokenizer.model / source.spm.
BERT (English): vocab.txt + tokenizer_config.json.
BERT Japanese: ipadic dictionary + vocab.txt (NFKC normalization is automatic).
RoBERTa: vocab.json + merges.txt.
Python: place all required files in one directory and pass that directory path to from_pretrained() (e.g. BertTokenizer.from_pretrained("./tokenizer/")).
C / C++ / Unity / Flutter / JNI: open each file individually with the corresponding OpenModelFile / OpenVocabFile / OpenMergeFile / OpenDictionaryFile API.
The C++ binding requires ailia.lic next to the runtime libraries:
Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/
Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.
An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.