BERT Architecture - AI Glossarium

📖

termen

Masked Language Modeling (MLM)

Pre-training objective where 15% of tokens are randomly masked and the model must predict them using bidirectional context. This technique enables BERT to learn deep contextual representations by forcing the model to understand semantic relationships between words.

📖

termen

Next Sentence Prediction (NSP)

Binary pre-training task where the model predicts whether two given sentences are consecutive in the original text. Although controversial, this objective helps BERT understand inter-sentence relationships for tasks like QA and NLI.

📖

termen

WordPiece Tokenization

Segmentation algorithm that divides words into morphological sub-units to handle unknown vocabulary and optimize representation. This approach allows BERT to efficiently process rare words and neologisms by breaking them down into known tokens.

📖

termen

Self-Attention Mechanism

Fundamental mechanism where each token calculates attention weights relative to all other tokens in the sequence. This operation enables BERT to capture long-distance dependencies and create rich contextual representations.

📖

termen

Segment Embeddings

Specialized embeddings that distinguish different segments in the input, typically used to separate sentences A and B in sentence pair tasks. These embeddings allow the model to differentiate the context of each segment.

📖

termen

Transformer Encoder Block

Fundamental computational unit of BERT composed of multi-head attention followed by a feed-forward network with residual connections and normalization. Each block processes the entire sequence simultaneously, preserving global relationships.

📖

termen

Pooling Layer

Final layer that aggregates token representations into a single vector for classification tasks. BERT typically uses the [CLS] token representation or performs mean pooling over all tokens.

📖

termen

Hidden States

High-dimensional vector representations produced at each layer of the Transformer for each token in the sequence. These hidden states progressively capture increasingly abstract semantic features.

📖

termen

Pre-training

Unsupervised training phase on large corpora where BERT learns general linguistic representations via MLM and NSP. This step establishes the knowledge foundations of the model before task-specific fine-tuning.

📖

termen

Encoder-Only Architecture

Structure of BERT using only the encoder blocks of the Transformer, unlike encoder-decoder models. This architecture is optimized for text understanding and classification tasks.

📖

termen

[CLS] Token

Special token added at the beginning of each input sequence whose final representation is used for classification tasks. This token aggregates the contextual information of the entire sequence to make global-level decisions.

AI-woordenlijst

Masked Language Modeling (MLM)

Next Sentence Prediction (NSP)

WordPiece Tokenization

Self-Attention Mechanism

Segment Embeddings

Transformer Encoder Block

Pooling Layer

Hidden States

Pre-training

Encoder-Only Architecture

[CLS] Token

Geen resultaten gevonden