Słownik AI
Kompletny słownik sztucznej inteligencji
Token Fusion
Technique of concatenating or fusing tokens from different modalities before processing them through transformer layers. Enables early integration of multimodal information for better joint representation.
ALIGN
Contrastive image-text model trained on one billion automatically filtered noisy pairs. Demonstrates that data quantity can compensate for noise in large-scale multimodal learning.
Flamingo
Vision-language model that adapts pre-trained transformers with visual-linguistic attention modules. Enables few-shot learning on complex multimodal understanding tasks without full retraining.
Cross-Modal Representation
Shared vector space where embeddings from different modalities are semantically aligned to enable cross-modal interactions. Facilitates knowledge transfer and unified understanding between text, images, audio, and video.
MViT (Multiscale Vision Transformer)
Video transformer architecture that combines features at multiple temporal and spatial scales. Uses pyramid attention to efficiently capture long-range relationships in video sequences.
Multi-Head Cross Attention
Extension of the multi-head mechanism where each head learns different cross-modal correspondences between modalities. Allows richer and more diverse capture of inter-modal relationships in multimodal transformer architectures.