KI-Glossar
Das vollständige Wörterbuch der Künstlichen Intelligenz
Byte Pair Encoding (BPE)
A data compression algorithm adapted for tokenization that iteratively merges the most frequent character pairs to create an optimized subword vocabulary.
WordPiece
A variant of BPE developed by Google that maximizes language probability when merging tokens, notably used in BERT models and their variants.
Unigram Language Model
A tokenization approach based on a unigram language model that selects the best segmentation by maximizing the product probability of tokens in the sequence.
SentencePiece
A language-independent tokenization library that treats text as a raw unicode sequence, eliminating the need for language-specific preprocessing.
Vocabulary Size
A critical parameter determining the total number of unique tokens in a model's vocabulary, directly influencing model size and its ability to handle linguistic diversity.
Special Tokens
Reserved tokens like [CLS], [SEP], [MASK], [PAD] used to delimit sequences, mask elements, or pad batches to a uniform length.
Tokenizer Training
The machine learning process of learning vocabulary and segmentation rules from a text corpus, optimizing representation for a specific task or domain.
Subword Regularization
A data augmentation technique applying different possible segmentations of the same text during training, improving model robustness and generalization.
Vocabulary Truncation
Process of limiting the vocabulary to the N most frequent tokens, replacing less frequent tokens with subwords or an [UNK] token to optimize computational efficiency.
Tokenization Pipeline
Sequential chain of preprocessing steps including normalization, pre-tokenization, model segmentation, and post-processing to produce the final tokens.
Tokenizer Config
JSON configuration file containing all the hyperparameters and metadata necessary to exactly reproduce the behavior of a specific tokenizer.
Fast Tokenizers
Optimized tokenizer implementations using Rust and efficient data structures, offering 10-100x better performance than pure Python implementations.
Tokenizer Inference
Phase of applying a trained tokenizer to new text data, converting raw text into token sequences ready for processing by the model.