AI Glossary
The complete dictionary of Artificial Intelligence
Encoder Stack
Stack of identical layers transforming the input sequence into rich and contextual representations, each layer containing attention and feed-forward components.
Decoder Stack
Architecture composed of layers generating the output sequence, using masked attention to prevent future information leakage and cross-attention with the encoder.
Encoder-Decoder Attention
Mechanism allowing the decoder to access and focus on encoder representations to generate each output token in an informed manner.
Layer Normalization
Training stabilization technique normalizing activations for each position, applied before or after sub-layers in the transformer architecture.
Masked Self-Attention
Variant of self-attention used in decoders where future positions are masked to prevent the use of information not available during generation.
Scaled Dot-Product Attention
Attention calculation normalizing dot products by the square root of the key dimension to stabilize gradients during training.
Attention Heads
Independent subspaces in multi-head attention, each learning to focus on different types of relationships and patterns in the data.
Token Embedding
Dense and continuous vector representation of each input token, the starting point of the transformer architecture before adding positional information.