Słownik AI
Kompletny słownik sztucznej inteligencji
Multi-Modal Transformer
Extended Transformer architecture capable of simultaneously processing multiple data modalities (text, image, audio) using cross-attention mechanisms to integrate inter-modal information.
Vision-Language Transformer
Transformer architecture specifically designed to jointly understand and generate visual and textual content, using shared or separate encoders for each modality.
Fusion Mechanism
Algorithmic strategy for effectively combining representations of different modalities at one or more levels of the network, including early, late, or hierarchical fusion.
Modality Embedding
Specific encoding vectors added to token embeddings to indicate the original modality (text, image, audio), allowing the Transformer to distinguish and process each data type differently.
CLIP
Contrastive Language-Image Pre-training model trained on 400 million image-text pairs using a contrastive objective to learn shared representations between vision and language.
VLP
Family of Vision-Language Pre-training models using a shared Transformer encoder for both modalities with pre-training tasks like masked modeling and image-text prediction.
Unified Encoder-Decoder
Transformer architecture where the same encoder processes all input modalities, and a decoder generates the output, enabling tasks like VQA, captioning, and retrieval with a single model.
Modality Gap
Inherent structural and semantic difference between the representation spaces of different modalities, requiring specific alignment mechanisms in multi-modal models.
Multi-Modal Fusion
Process of integrating features from different modalities into a unified representation, leveraging inter-modal complementarities to improve performance on complex tasks.
Cross-Modal Alignment
Training objective aimed at semantically aligning representations of different modalities in a shared space, enabling correspondence between visual and linguistic concepts.
Perceiver IO
General Transformer architecture capable of processing any combination of modalities using a cross-attention network between input data and a set of learned latents.
Flamingo Model
80-billion parameter vision-language model using pre-trained adapters and attentional gating to effectively combine Vision Transformers and language models without full retraining.
BLIP
Bootstrapping Language-Image Pre-training framework generating pseudo-captions to filter noise and improve data quality, using a multimodal encoder and an image-text decoder.
CoCa
Contrastive Captioners model combining a contrastive objective for representation learning and a generative objective for captioning in a single unified Transformer architecture.
BEiT-3
Bidirectional Encoder representation from Image Transformer v3 model using a multiway Transformer with modality-specific embeddings to process image, text, and image-text in a unified manner.
LayoutLM
Family of document pre-trained models combining 2D spatial layout, text, and visual information for understanding structured documents like forms and invoices.
UniPerceiver
Universal perception framework that treats various multi-modal tasks as a unified token generation problem, using a single Transformer model for classification, detection, and generation.
GIT
Generative Image-to-text Transformer model that treats images as a foreign language and uses a simple encoder-decoder architecture for image description and VQA with state-of-the-art performance.