Glossario IA
Il dizionario completo dell'Intelligenza Artificiale
Vision Transformer (ViT)
Neural architecture applying Transformer mechanisms to image processing by dividing images into sequences of patches for sequential processing.
Patch Embedding
Process of converting image patches into fixed-dimensional embedding vectors through linear projection to feed into the Transformer.
Class Token
Special token added to the embedding sequence whose final representation after passing through the Transformer is used for image classification.
Multi-Head Self-Attention
Mechanism allowing the model to simultaneously compute multiple attention representations to capture different relationships between image patches.
Transformer Encoder
Fundamental block composed of self-attention layers and feed-forward networks alternating with normalization and residual connections.
Image Patch Tokenization
Process of cutting an image into non-overlapping fixed-size patches, typically 16x16 pixels, which are then converted into sequential tokens.
Attention Map Visualization
Interpretability technique visualizing attention weights between patches to understand which image regions the model focuses on.
Pre-training on Large Datasets
Initial training phase on millions of images like ImageNet-21k to learn general visual representations before fine-tuning.
Patch Size Hyperparameter
Crucial parameter defining the dimension of image patches directly influencing computational complexity and model performance.
Token-to-Patch Reconstruction
Reverse process in generative tasks where tokens are converted back into image patches to reconstruct the original image.
Hierarchical Vision Transformer
Variant of ViT using a pyramid structure with variable patch sizes to capture multi-scale features.
Self-Supervised ViT Pre-training
Unsupervised training methods like DINO or MAE leveraging the Transformer structure to learn without annotations.
Cross-Attention in Multi-Modal ViT
Mechanism extending ViT to jointly process images and text using attention between different modalities.
Computational Complexity O(n²)
Quadratic complexity of self-attention with respect to the number of patches constituting the main limitation of Vision Transformers.