AI Glossary
The complete dictionary of Artificial Intelligence
Attention Scaling
Normalization technique for attention scores by dividing by the square root of dimensionality to maintain constant variance and stabilize the training of Transformer models.
Dimensional Scaling Factor
Coefficient √dk used to normalize attention scores, where dk represents the dimensionality of query and key vectors in the Transformer architecture.
Gradient Stabilization
Process aimed at keeping gradients within a stable numerical range during backpropagation, essential for preventing training issues in deep networks.
Attention Score Normalization
Normalization of similarity scores before applying Softmax to control the probability distribution and prevent extreme attention concentrations.
Query-Key Dimensionality
Common dimension of query and key vectors in multi-head attention, whose square root determines the normalization scaling factor.
Attention Variance Control
Maintenance of constant variance of attention scores across different layers to ensure optimal numerical stability of the model.
Numerical Stability in Attention
Set of techniques ensuring that attention calculations remain within manageable numerical ranges, preventing floating-point overflows and underflows.
Score Distribution Sharpening
Phenomenon where attention distributions become too concentrated without proper normalization, leading to suboptimal model behavior.
Multi-Head Attention Scaling
Application of the √dk scaling factor independently to each attention head in the multi-head architecture to maintain consistency across parallel representations.
Embedding Dimension Normalization
Normalization technique based on embedding dimensionality to ensure comparable magnitude of vector representations in the attention space.
Attention Temperature Scaling
Dynamic adjustment of the scaling factor to modulate attention concentration, enabling fine-grained control over attention weight distribution.
Gradient Flow Optimization
Optimization of gradient pathways through attention layers to maintain effective learning in deep networks.
Score Magnitude Regularization
Control of attention score magnitude through normalization to prevent numerical instabilities and improve model convergence.
Attention Entropy Preservation
Maintenance of appropriate entropy levels in attention distributions through normalization, preventing overly sharp or overly uniform distributions.