Thuật ngữ AI
Từ điển đầy đủ về Trí tuệ nhân tạo
Masked Language Modeling (MLM)
Pre-training objective where 15% of tokens are randomly masked and the model must predict them using bidirectional context. This technique enables BERT to learn deep contextual representations by forcing the model to understand semantic relationships between words.
Next Sentence Prediction (NSP)
Binary pre-training task where the model predicts whether two given sentences are consecutive in the original text. Although controversial, this objective helps BERT understand inter-sentence relationships for tasks like QA and NLI.
WordPiece Tokenization
Segmentation algorithm that divides words into morphological sub-units to handle unknown vocabulary and optimize representation. This approach allows BERT to efficiently process rare words and neologisms by breaking them down into known tokens.
Self-Attention Mechanism
Fundamental mechanism where each token calculates attention weights relative to all other tokens in the sequence. This operation enables BERT to capture long-distance dependencies and create rich contextual representations.
Segment Embeddings
Specialized embeddings that distinguish different segments in the input, typically used to separate sentences A and B in sentence pair tasks. These embeddings allow the model to differentiate the context of each segment.
Transformer Encoder Block
Fundamental computational unit of BERT composed of multi-head attention followed by a feed-forward network with residual connections and normalization. Each block processes the entire sequence simultaneously, preserving global relationships.
Pooling Layer
Final layer that aggregates token representations into a single vector for classification tasks. BERT typically uses the [CLS] token representation or performs mean pooling over all tokens.
Hidden States
High-dimensional vector representations produced at each layer of the Transformer for each token in the sequence. These hidden states progressively capture increasingly abstract semantic features.
Pre-training
Unsupervised training phase on large corpora where BERT learns general linguistic representations via MLM and NSP. This step establishes the knowledge foundations of the model before task-specific fine-tuning.
Encoder-Only Architecture
Structure of BERT using only the encoder blocks of the Transformer, unlike encoder-decoder models. This architecture is optimized for text understanding and classification tasks.
[CLS] Token
Special token added at the beginning of each input sequence whose final representation is used for classification tasks. This token aggregates the contextual information of the entire sequence to make global-level decisions.