Vision Transformers (ViT)
Multi-Head Self-Attention (MHSA)
Mechanism allowing the model to focus on different parts of the image simultaneously by computing multiple attention matrices in parallel, thus capturing various types of spatial relationships.
← Geri