{/* 此页面由 website/scripts/generate-skill-docs.py 从技能的 SKILL.md 自动生成。请编辑源文件 SKILL.md,而非此页面。 */}

Huggingface Tokenizers

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

技能元数据

来源可选 — 通过 hermes skills install official/mlops/huggingface-tokenizers
路径optional-skills/mlops/huggingface-tokenizers
版本1.0.0
作者Orchestra Research
许可证MIT
依赖项tokenizers, transformers, datasets
平台linux, macos, windows
标签Tokenization, HuggingFace, BPE, WordPiece, Unigram, Fast Tokenization, Rust, Custom Tokenizer, Alignment Tracking, Production

参考:完整 SKILL.md

:::info 以下是 Hermes 在触发此技能时加载的完整技能定义。这是技能激活时代理所看到的指令。 :::

HuggingFace Tokenizers - Fast Tokenization for NLP

Fast, production-ready tokenizers with Rust performance and Python ease-of-use.

何时使用 HuggingFace Tokenizers

Use HuggingFace Tokenizers when:

  • Need extremely fast tokenization (<20s per GB of text)
  • Training custom tokenizers from scratch
  • Want alignment tracking (token → original text position)
  • Building production NLP pipelines
  • Need to tokenize large corpora efficiently

Performance:

  • Speed: <20 seconds to tokenize 1GB on CPU
  • Implementation: Rust core with Python/Node.js bindings
  • Efficiency: 10-100× faster than pure Python implementations

Use alternatives instead:

  • SentencePiece: Language-independent, used by T5/ALBERT
  • tiktoken: OpenAI’s BPE tokenizer for GPT models
  • transformers AutoTokenizer: Loading pretrained only (uses this library internally)

快速开始

安装

# Install tokenizers
pip install tokenizers
 
# With transformers integration
pip install tokenizers transformers

Load pretrained tokenizer

from tokenizers import Tokenizer
 
# Load from HuggingFace Hub
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
 
# Encode text
output = tokenizer.encode("Hello, how are you?")
print(output.tokens)  # ['hello', ',', 'how', 'are', 'you', '?']
print(output.ids)     # [7592, 1010, 2129, 2024, 2017, 1029]
 
# Decode back
text = tokenizer.decode(output.ids)
print(text)  # "hello, how are you?"

Train custom BPE tokenizer

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
 
# Initialize tokenizer with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
 
# Configure trainer
trainer = BpeTrainer(
    vocab_size=30000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    min_frequency=2
)
 
# Train on files
files = ["train.txt", "validation.txt"]
tokenizer.train(files, trainer)
 
# Save
tokenizer.save("my-tokenizer.json")

Training time: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB

Batch encoding with padding

# Enable padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
 
# Encode batch
texts = ["Hello world", "This is a longer sentence"]
encodings = tokenizer.encode_batch(texts)
 
for encoding in encodings:
    print(encoding.ids)
# [101, 7592, 2088, 102, 3, 3, 3]
# [101, 2023, 2003, 1037, 2936, 6251, 102]

分词算法

BPE (Byte-Pair Encoding)

工作原理

  1. Start with character-level vocabulary
  2. Find most frequent character pair
  3. Merge into new token, add to vocabulary
  4. Repeat until vocabulary size reached

Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
 
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()
 
trainer = BpeTrainer(
    vocab_size=50257,
    special_tokens=["<|endoftext|>"],
    min_frequency=2
)
 
tokenizer.train(files=["data.txt"], trainer=trainer)

Advantages:

  • Handles OOV words well (breaks into subwords)
  • Flexible vocabulary size
  • Good for morphologically rich languages

Trade-offs:

  • Tokenization depends on merge order
  • May split common words unexpectedly

WordPiece

工作原理

  1. Start with character vocabulary
  2. Score merge pairs: frequency(pair) / (frequency(first) × frequency(second))
  3. Merge highest scoring pair
  4. Repeat until vocabulary size reached

Used by: BERT, DistilBERT, MobileBERT

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import BertNormalizer
 
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = Whitespace()
 
trainer = WordPieceTrainer(
    vocab_size=30522,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    continuing_subword_prefix="##"
)
 
tokenizer.train(files=["corpus.txt"], trainer=trainer)

Advantages:

  • Prioritizes meaningful merges (high score = semantically related)
  • Used successfully in BERT (state-of-the-art results)

Trade-offs:

  • Unknown words become [UNK] if no subword match
  • Saves vocabulary, not merge rules (larger files)

Unigram

工作原理

  1. Start with large vocabulary (all substrings)
  2. Compute loss for corpus with current vocabulary
  3. Remove tokens with minimal impact on loss
  4. Repeat until vocabulary size reached

Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)

from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
 
tokenizer = Tokenizer(Unigram())
 
trainer = UnigramTrainer(
    vocab_size=8000,
    special_tokens=["<unk>", "<s>", "</s>"],
    unk_token="<unk>"
)
 
tokenizer.train(files=["data.txt"], trainer=trainer)

Advantages:

  • Probabilistic (finds most likely tokenization)
  • Works well for languages without word boundaries
  • Handles diverse linguistic contexts

Trade-offs:

  • Computationally expensive to train
  • More hyperparameters to tune

分词流水线

Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing

Normalization

Clean and standardize text:

from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence
 
tokenizer.normalizer = Sequence([
    NFD(),           # Unicode normalization (decompose)
    Lowercase(),     # Convert to lowercase
    StripAccents()   # Remove accents
])
 
# Input: "Héllo WORLD"
# After normalization: "hello world"

Common normalizers:

  • NFD, NFC, NFKD, NFKC - Unicode normalization forms
  • Lowercase() - Convert to lowercase
  • StripAccents() - Remove accents (é → e)
  • Strip() - Remove whitespace
  • Replace(pattern, content) - Regex replacement

Pre-tokenization

Split text into word-like units:

from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel
 
# Split on whitespace and punctuation
tokenizer.pre_tokenizer = Sequence([
    Whitespace(),
    Punctuation()
])
 
# Input: "Hello, world!"
# After pre-tokenization: ["Hello", ",", "world", "!"]

Common pre-tokenizers:

  • Whitespace() - Split on spaces, tabs, newlines
  • ByteLevel() - GPT-2 style byte-level splitting
  • Punctuation() - Isolate punctuation
  • Digits(individual_digits=True) - Split digits individually
  • Metaspace() - Replace spaces with ▁ (SentencePiece style)

Post-processing

Add special tokens for model input:

from tokenizers.processors import TemplateProcessing
 
# BERT-style: [CLS] sentence [SEP]
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B [SEP]",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

Common patterns:

# GPT-2: sentence <|endoftext|>
TemplateProcessing(
    single="$A <|endoftext|>",
    special_tokens=[("<|endoftext|>", 50256)]
)
 
# RoBERTa: <s> sentence </s>
TemplateProcessing(
    single="<s> $A </s>",
    pair="<s> $A </s> </s> $B </s>",
    special_tokens=[("<s>", 0), ("</s>", 2)]
)

对齐跟踪

Track token positions in original text:

output = tokenizer.encode("Hello, world!")
 
# Get token offsets
for token, offset in zip(output.tokens, output.offsets):
    start, end = offset
    print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
 
# Output:
# hello      → [ 0,  5): 'Hello'
# ,          → [ 5,  6): ','
# world      → [ 7, 12): 'world'
# !          → [12, 13): '!'

Use cases:

  • Named entity recognition (map predictions back to text)
  • Question answering (extract answer spans)
  • Token classification (align labels to original positions)

Integration with transformers

Load with AutoTokenizer

from transformers import AutoTokenizer
 
# AutoTokenizer automatically uses fast tokenizers
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
 
# Check if using fast tokenizer
print(tokenizer.is_fast)  # True
 
# Access underlying tokenizers.Tokenizer
fast_tokenizer = tokenizer.backend_tokenizer
print(type(fast_tokenizer))  # <class 'tokenizers.Tokenizer'>

Convert custom tokenizer to transformers

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
 
# Train custom tokenizer
tokenizer = Tokenizer(BPE())
# ... train tokenizer ...
tokenizer.save("my-tokenizer.json")
 
# Wrap for transformers
transformers_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="my-tokenizer.json",
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]"
)
 
# Use like any transformers tokenizer
outputs = transformers_tokenizer(
    "Hello world",
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

常用模式

Train from iterator (large datasets)

from datasets import load_dataset
 
# Load dataset
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
 
# Create batch iterator
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i:i + batch_size]["text"]
 
# Train tokenizer
tokenizer.train_from_iterator(
    batch_iterator(),
    trainer=trainer,
    length=len(dataset)  # For progress bar
)

Performance: Processes 1GB in ~10-20 minutes

Enable truncation and padding

# Enable truncation
tokenizer.enable_truncation(max_length=512)
 
# Enable padding
tokenizer.enable_padding(
    pad_id=tokenizer.token_to_id("[PAD]"),
    pad_token="[PAD]",
    length=512  # Fixed length, or None for batch max
)
 
# Encode with both
output = tokenizer.encode("This is a long sentence that will be truncated...")
print(len(output.ids))  # 512

Multi-processing

from tokenizers import Tokenizer
from multiprocessing import Pool
 
# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
 
def encode_batch(texts):
    return tokenizer.encode_batch(texts)
 
# Process large corpus in parallel
with Pool(8) as pool:
    # Split corpus into chunks
    chunk_size = 1000
    chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
 
    # Encode in parallel
    results = pool.map(encode_batch, chunks)

Speedup: 5-8× with 8 cores

性能 benchmarks

Training speed

Corpus SizeBPE (30k vocab)WordPiece (30k)Unigram (8k)
10 MB15 sec18 sec25 sec
100 MB1.5 min2 min4 min
1 GB15 min20 min40 min

Hardware: 16-core CPU, tested on English Wikipedia

Tokenization speed

Implementation1 GB corpusThroughput
Pure Python~20 minutes~50 MB/min
HF Tokenizers~15 seconds~4 GB/min
Speedup80×80×

Test: English text, average sentence length 20 words

Memory usage

TaskMemory
Load tokenizer~10 MB
Train BPE (30k vocab)~200 MB
Encode 1M sentences~500 MB

支持的模型

Pre-trained tokenizers available via from_pretrained():

BERT family:

  • bert-base-uncased, bert-large-cased
  • distilbert-base-uncased
  • roberta-base, roberta-large

GPT family:

  • gpt2, gpt2-medium, gpt2-large
  • distilgpt2

T5 family:

  • t5-small, t5-base, t5-large
  • google/flan-t5-xxl

Other:

  • facebook/bart-base, facebook/mbart-large-cc25
  • albert-base-v2, albert-xlarge-v2
  • xlm-roberta-base, xlm-roberta-large

Browse all: https://huggingface.co/models?library=tokenizers

References

资源