AI-woordenlijst
Het complete woordenboek van kunstmatige intelligentie
Multi-Head Self-Attention (MHSA)
Mechanism allowing the model to focus on different parts of the image simultaneously by computing multiple attention matrices in parallel, thus capturing various types of spatial relationships.
Layer Scale
Regularization technique introduced in deep ViTs where learnable weights are applied to residual outputs to stabilize the training of initial layers.
Windowed Attention
Attention mechanism restricted to local non-overlapping windows of the image, reducing computational complexity from O(n²) to O(n) where n is the number of patches.
Shifted Window Attention
Technique where attention windows are shifted between layers to enable cross-window connections, thereby improving the model's ability to model long-range relationships.
DeiT (Data-efficient Image Transformer)
Variant of ViT trainable with more modest amounts of data through a knowledge distillation strategy where a distillation token is added to learn from a CNN teacher.
Distillation Token
Additional token in DeiT that learns to mimic the predictions of a teacher model (often a CNN), facilitating knowledge transfer and improving performance with less data.
Masked Autoencoder (MAE)
Self-supervised approach for ViT where random patches of the image are masked (up to 75%) and the model learns to reconstruct them, revealing surprising learning capabilities.
Patch Merging
Operation in hierarchical transformers that combines groups of 2x2 adjacent patches to create lower-resolution tokens, thereby increasing depth and receptive field.
Relative Position Bias
Bias added to attention scores that depends on the relative positions of patches, improving the model's ability to understand spatial relationships without absolute position encoding.
Hybrid Architecture
Approach combining an initial convolutional network for feature extraction with a transformer for global processing, used in early ViT implementations to reduce data requirements.
Token Labeling
Training strategy where each patch receives a supervised label instead of a single label per image, forcing the model to learn richer and more localized representations.