BERT Architecture - 인공지능 용어집

📖

용어

Masked Language Modeling (MLM)

Pre-training objective where 15% of tokens are randomly masked and the model must predict them using bidirectional context. This technique enables BERT to learn deep contextual representations by forcing the model to understand semantic relationships between words.

📖

용어

Next Sentence Prediction (NSP)

Binary pre-training task where the model predicts whether two given sentences are consecutive in the original text. Although controversial, this objective helps BERT understand inter-sentence relationships for tasks like QA and NLI.

📖

용어

WordPiece Tokenization

Segmentation algorithm that divides words into morphological sub-units to handle unknown vocabulary and optimize representation. This approach allows BERT to efficiently process rare words and neologisms by breaking them down into known tokens.

📖

용어

Self-Attention Mechanism

Fundamental mechanism where each token calculates attention weights relative to all other tokens in the sequence. This operation enables BERT to capture long-distance dependencies and create rich contextual representations.

📖

용어

Segment Embeddings

Specialized embeddings that distinguish different segments in the input, typically used to separate sentences A and B in sentence pair tasks. These embeddings allow the model to differentiate the context of each segment.

📖

용어

Transformer Encoder Block

Fundamental computational unit of BERT composed of multi-head attention followed by a feed-forward network with residual connections and normalization. Each block processes the entire sequence simultaneously, preserving global relationships.

📖

용어

Pooling Layer

Final layer that aggregates token representations into a single vector for classification tasks. BERT typically uses the [CLS] token representation or performs mean pooling over all tokens.

📖

용어

Hidden States

High-dimensional vector representations produced at each layer of the Transformer for each token in the sequence. These hidden states progressively capture increasingly abstract semantic features.

📖

용어

Pre-training

Unsupervised training phase on large corpora where BERT learns general linguistic representations via MLM and NSP. This step establishes the knowledge foundations of the model before task-specific fine-tuning.

📖

용어

Encoder-Only Architecture

Structure of BERT using only the encoder blocks of the Transformer, unlike encoder-decoder models. This architecture is optimized for text understanding and classification tasks.

📖

용어

[CLS] Token

Special token added at the beginning of each input sequence whose final representation is used for classification tasks. This token aggregates the contextual information of the entire sequence to make global-level decisions.

AI 용어집