Glossario IA
Il dizionario completo dell'Intelligenza Artificiale
Cross-modality
Ability of a system to understand and relate information from different modalities, such as text and images, to enrich contextual understanding.
Vision-Language Transformer (VLT)
Transformer architecture pre-trained on large corpora of paired images and texts, designed for multimodal comprehension and generation tasks.
Visual Reasoning
Ability of a QA system to infer non-explicit information by analyzing spatial relationships, object attributes, or complex scenes in an image.
Visual Grounding
The act of anchoring linguistic concepts (words, phrases) to specific entities or regions in an image or video, creating a tangible semantic link.
Modality-to-Modality Alignment
Learning process that matches segments of one modality (e.g., a sentence) with relevant segments of another (e.g., an image region).
Vector Quantized Codebook (VQ)
Technique used in multimodal models to discretize continuous representations (e.g., of images) into a finite set of discrete tokens, facilitating their processing by language models.
Multimodal Perceptron (MLP)
Neural network, often an MLP, that takes fused features from multiple modalities as input to perform a final classification or regression task.
Two-Stream Fusion Model
Architecture where each modality is processed by a separate neural network (a stream) before their representations are combined for joint decision-making.
Multimodal Information Retrieval
Task of retrieving relevant documents (e.g., images) from a query in another modality (e.g., text), based on their similarity in a shared embedding space.
Conditional Response Generation
Process where a language model generates a textual response whose content is conditioned and guided by information extracted from a non-textual modality such as an image.
Image Tokenization
Process of converting an image into a sequence of discrete tokens, often via a VAE or VQ-VAE, to make it compatible with Transformer-type architectures.