Tokenization - Bảng thuật ngữ Trí tuệ nhân tạo

📖

thuật ngữ

Byte Pair Encoding (BPE)

A data compression algorithm adapted for tokenization that iteratively merges the most frequent character pairs to create an optimized subword vocabulary.

📖

thuật ngữ

WordPiece

A variant of BPE developed by Google that maximizes language probability when merging tokens, notably used in BERT models and their variants.

📖

thuật ngữ

Unigram Language Model

A tokenization approach based on a unigram language model that selects the best segmentation by maximizing the product probability of tokens in the sequence.

📖

thuật ngữ

SentencePiece

A language-independent tokenization library that treats text as a raw unicode sequence, eliminating the need for language-specific preprocessing.

📖

thuật ngữ

Vocabulary Size

A critical parameter determining the total number of unique tokens in a model's vocabulary, directly influencing model size and its ability to handle linguistic diversity.

📖

thuật ngữ

Special Tokens

Reserved tokens like [CLS], [SEP], [MASK], [PAD] used to delimit sequences, mask elements, or pad batches to a uniform length.

📖

thuật ngữ

Tokenizer Training

The machine learning process of learning vocabulary and segmentation rules from a text corpus, optimizing representation for a specific task or domain.

📖

thuật ngữ

Subword Regularization

A data augmentation technique applying different possible segmentations of the same text during training, improving model robustness and generalization.

📖

thuật ngữ

Vocabulary Truncation

Process of limiting the vocabulary to the N most frequent tokens, replacing less frequent tokens with subwords or an [UNK] token to optimize computational efficiency.

📖

thuật ngữ

Tokenization Pipeline

Sequential chain of preprocessing steps including normalization, pre-tokenization, model segmentation, and post-processing to produce the final tokens.

📖

thuật ngữ

Tokenizer Config

JSON configuration file containing all the hyperparameters and metadata necessary to exactly reproduce the behavior of a specific tokenizer.

📖

thuật ngữ

Fast Tokenizers

Optimized tokenizer implementations using Rust and efficient data structures, offering 10-100x better performance than pure Python implementations.

📖

thuật ngữ

Tokenizer Inference

Phase of applying a trained tokenizer to new text data, converting raw text into token sequences ready for processing by the model.

Thuật ngữ AI

Byte Pair Encoding (BPE)

WordPiece

Unigram Language Model

SentencePiece

Vocabulary Size

Special Tokens

Tokenizer Training

Subword Regularization

Vocabulary Truncation

Tokenization Pipeline

Tokenizer Config

Fast Tokenizers

Tokenizer Inference

Không tìm thấy kết quả