ailia Tokenizer

A library for encoding NLP text and decoding NLP tokens. Supports tokenization without Python transformers.

Getting Started

Choose your platform and tokenize without depending on Python transformers.

1

Install

Install the ailia Tokenizer Python package from PyPI.

pip3 install ailia_tokenizer
View on PyPI
2

Run a Sample

Clone ailia-models and run the multilingual-MiniLMv2 zero-shot classification sample. Its tokenizer/ directory ships sentencepiece.bpe.model + tokenizer_config.json, the file layout XLMRobertaTokenizer.from_pretrained() expects.

git clone https://github.com/ailia-ai/ailia-models.git
cd ailia-models/natural_language_processing/multilingual-minilmv2
pip3 install -r requirements.txt
python3 multilingual-minilmv2.py
tokenizer/ file layout

System Requirements

ailia Tokenizer is a cross-platform replacement for the Python transformers tokenizer suite, usable from C++, Unity, and Flutter without a Python runtime.

Operating Systems

  • Windows 10 / 11
  • macOS 11 or later
  • Linux (Ubuntu 20.04+)
  • iOS 13+ / Android 7+

Languages & Compilers

  • Python 3.6+, Dart / Flutter 3.19+
  • C++17 (VS 2019+ / Xcode 14.2+ / clang)
  • C# / Unity 2021.3.10f1+
  • Kotlin / Java (JNI)

Built-in Components

  • SentencePiece
  • MeCab + ipadic
  • BPE (optimized)
  • NFKC normalization

Output Formats

  • Token IDs (int)
  • Decoded text (str / UTF-8)
  • convert_tokens_to_ids / ids_to_tokens
  • encode_plus with attention masks

Features

Twelve tokenizer types with Hugging Face-compatible APIs.

Speech / Vision

  • Whisper (multilingual)
  • CLIP (text-image)

Translation / Summarization

  • Marian (FuguMT EN ↔ JA)
  • T5 (sentencepiece)
  • XLM-RoBERTa

BERT Family

  • BERT (English)
  • BERT Japanese WordPiece
  • BERT Japanese Character
  • RoBERTa

LLM Family

  • GPT-2
  • Llama
  • Gemma

Use the API in Your Project

Minimal examples for encoding and decoding text in your own application.

import ailia_tokenizer

tok = ailia_tokenizer.BertTokenizer.from_pretrained("./tokenizer/")
ids = tok.encode("Hello, world!")
text = tok.decode(ids)

API Reference by Platform

Python

C++

Unity

Flutter

JNI

FAQ

Common questions about ailia Tokenizer.

Why use ailia Tokenizer instead of Hugging Face transformers?

transformers is Python-only. ailia Tokenizer ships the same tokenizers as a native library callable from C++, Unity (C#), Flutter (Dart), JNI, and Python — letting you tokenize on iOS, Android, and embedded targets without bundling a Python runtime.

The Python API mirrors transformers (from_pretrained(), encode(), decode(), etc.) so existing code typically requires only an import change.

Which tokenizer types are supported?

Twelve types covering the most common modern model families: Whisper, CLIP, XLM-RoBERTa, Marian, BERT (English), BERT Japanese WordPiece, BERT Japanese Character, T5, RoBERTa, GPT-2, Llama, and Gemma.

How does ailia Tokenizer match transformers' behaviour?

encode() matches tokenizer(sents, split_special_tokens=True) (special tokens encoded as text, no padding/truncation).

encodeWithSpecialTokens() matches tokenizer(sents) (special tokens encoded as IDs).

decode() matches tokenizer.decode(ids, skip_special_tokens=True); decodeWithSpecialTokens() keeps the special tokens.

What extra files do I need for each tokenizer?

Whisper / CLIP / GPT-2 are self-contained. Other tokenizers need their model files placed alongside the tokenizer:

SentencePiece (T5, XLM-RoBERTa, Marian, Llama, Gemma): spiece.model / tokenizer.model / source.spm.
BERT (English): vocab.txt + tokenizer_config.json.
BERT Japanese: ipadic dictionary + vocab.txt (NFKC normalization is automatic).
RoBERTa: vocab.json + merges.txt.

Python: place all required files in one directory and pass that directory path to from_pretrained() (e.g. BertTokenizer.from_pretrained("./tokenizer/")).
C / C++ / Unity / Flutter / JNI: open each file individually with the corresponding OpenModelFile / OpenVocabFile / OpenMergeFile / OpenDictionaryFile API.

Where do I place the license file when using C++?

The C++ binding requires ailia.lic next to the runtime libraries:

Windows: same folder as ailia.dll (or in cpp/ for the sample).
macOS: ~/Library/SHALO/
Linux: ~/.shalo/

Python, Unity, Flutter, and JNI bindings auto-download the license on first run, so this only applies to the native C++ binding.

How does licensing work?

An evaluation license is downloaded automatically at runtime, suitable for development and trial. For commercial deployment, request a production license. See the ailia license terms.

Materials