KI-Glossar
Das vollständige Wörterbuch der Künstlichen Intelligenz
Vision-Language Model (VLM)
Subclass of multimodal models specialized in joint understanding of text and images, capable of tasks like image captioning, visual reasoning, or image generation from text.
Visual Tokenization
Technique that cuts an image into a sequence of patches or discrete tokens, often through a neural network like a Vision Transformer (ViT), to make it compatible with textual transformer architecture.
Alignment Model
Model, often based on a contrastor like CLIP, trained on immense corpora of (image, text) pairs to learn to project both modalities into a shared vector space where cosine similarity reflects their mutual relevance.
Multimodal Conditional Generation
Generation task where the output (e.g., text, image) is produced based on one or more inputs of different modalities, such as describing an image or creating an image from text.
Multimodal Chain-of-Thought Reasoning
Ability of a model to use information from multiple modalities to construct a logical sequence of thought and reach a conclusion, for example by analyzing a chart and text to answer a question.
Multimodal Perceptron
Theoretical concept or primitive architecture where inputs of different natures are combined, often by concatenation or a fusion operation, before being processed by fully connected neural layers.
Multimodal Diffusion Model
Generation architecture that uses an iterative noising and denoising process to create data (e.g., images) conditioned by another modality (e.g., a text description), guiding the denoising with conditioning information.
Separate Encoding vs Unified Encoding
Two architectural strategies for multimodal models: separate encoding processes each modality with a dedicated encoder before fusion, while unified encoding uses a single transformer to process a sequence of mixed tokens.
Multimodal Zero-Shot Learning
Ability of a model to perform a task on one modality (e.g., classifying an image) without having been explicitly trained for it, by leveraging knowledge transferred from another modality (e.g., the text of class labels).
Audio-Visual-Text Model
An advanced form of multimodal model integrating three data streams (audio, image, text) for complex tasks like video description, where the model must synchronize and interpret visual and auditory information to produce a textual narration.
Latent Projection
A neural network layer, often a simple linear transformation, used to map the embedding vectors of each modality into a common latent space before their fusion or comparison.
Multimodal Foundation Model
A very large-scale model, pre-trained on massive amounts of heterogeneous data, that serves as a base for adaptation (fine-tuning) to a multitude of specific multimodal tasks.
Modularity in Multimodal Models
A design principle where the encoders for each modality are distinct and interchangeable modules, allowing for updating or replacing a component (e.g., the vision encoder) without retraining the entire model.
Multimodal Prompting
An interaction technique with a model where the input (the 'prompt') is composed of multiple modalities, for example, an image accompanied by a textual question, to guide the model towards a specific response.