AI Glossary
The complete dictionary of Artificial Intelligence
Unigram Language Model Tokenization
Tokenization method that initializes a large vocabulary and then iteratively reduces it by removing subwords with the least impact on the unigram model's likelihood, producing an optimal vocabulary.
Vocabulary
Static and predefined set of all unique tokens that a language model can recognize and process, whose size directly influences the model's capabilities and computational complexity.
Special Token
Predefined token with a specific semantic function, such as [CLS] for classification, [SEP] for separation, or [PAD] for sequence alignment, used to structure model inputs.
Embedding Matrix
Learned weight array where each row corresponds to the dense vector representation of a vocabulary token, serving as a projection layer to transform token identifiers into vectors.
Subword Tokenization
Tokenization strategy that divides words into smaller units (subwords), allowing management of a finite vocabulary while being able to represent infinite words, including neologisms and typos.
Character-level Tokenization
Granular tokenization approach where each character becomes a token, eliminating the out-of-vocabulary word problem but generating very long sequences and increasing computational complexity.
Word-level Tokenization
Segmentation method where each entire word, delimited by spaces or punctuation, is treated as a single token, simple but vulnerable to the out-of-vocabulary (OOV) word problem.
Tokenization Method
Specific set of rules and algorithms (e.g., BPE, WordPiece) that define how raw text is split into tokens, directly influencing model performance and robustness.
Whitespace Tokenisation
Simple tokenization technique that segments text based solely on whitespace characters, often used as a first step before more sophisticated methods.
Regular Expression Tokenisation (Regex Tokenisation)
Segmentation method that uses regular expression patterns to define complex tokenization rules, allowing for controlled separation of words, punctuation, and other symbols.
SentencePiece Tokenisation
Specific implementation that treats text as a Unicode stream and applies a tokenization algorithm (like BPE or unigram) to create a fully decodable and language-independent vocabulary.
Character Pair Encoding
BPE variant operating at the character level rather than byte level, merging the most frequent adjacent character pairs to build a subword vocabulary.
N-gram Tokenisation
Approach that segments text into contiguous sequences of n items (characters or words), capturing local context information but suffering from combinatorial vocabulary explosion.