Multimodal Models - AI Glossary

📖

terms

Vision-Language Model (VLM)

Subclass of multimodal models specialized in joint understanding of text and images, capable of tasks like image captioning, visual reasoning, or image generation from text.

📖

terms

Technique that cuts an image into a sequence of patches or discrete tokens, often through a neural network like a Vision Transformer (ViT), to make it compatible with textual transformer architecture.

📖

terms

Alignment Model

Model, often based on a contrastor like CLIP, trained on immense corpora of (image, text) pairs to learn to project both modalities into a shared vector space where cosine similarity reflects their mutual relevance.

📖

terms

Multimodal Conditional Generation

Generation task where the output (e.g., text, image) is produced based on one or more inputs of different modalities, such as describing an image or creating an image from text.

📖

terms

Multimodal Chain-of-Thought Reasoning

Ability of a model to use information from multiple modalities to construct a logical sequence of thought and reach a conclusion, for example by analyzing a chart and text to answer a question.

📖

terms

Multimodal Perceptron

Theoretical concept or primitive architecture where inputs of different natures are combined, often by concatenation or a fusion operation, before being processed by fully connected neural layers.

📖

terms

Multimodal Diffusion Model

Generation architecture that uses an iterative noising and denoising process to create data (e.g., images) conditioned by another modality (e.g., a text description), guiding the denoising with conditioning information.

📖

terms

Separate Encoding vs Unified Encoding

Two architectural strategies for multimodal models: separate encoding processes each modality with a dedicated encoder before fusion, while unified encoding uses a single transformer to process a sequence of mixed tokens.

📖

terms

Multimodal Zero-Shot Learning

Ability of a model to perform a task on one modality (e.g., classifying an image) without having been explicitly trained for it, by leveraging knowledge transferred from another modality (e.g., the text of class labels).

📖

terms

Audio-Visual-Text Model

An advanced form of multimodal model integrating three data streams (audio, image, text) for complex tasks like video description, where the model must synchronize and interpret visual and auditory information to produce a textual narration.

📖

terms

Latent Projection

A neural network layer, often a simple linear transformation, used to map the embedding vectors of each modality into a common latent space before their fusion or comparison.

📖

terms

Multimodal Foundation Model

A very large-scale model, pre-trained on massive amounts of heterogeneous data, that serves as a base for adaptation (fine-tuning) to a multitude of specific multimodal tasks.

📖

terms

Modularity in Multimodal Models

A design principle where the encoders for each modality are distinct and interchangeable modules, allowing for updating or replacing a component (e.g., the vision encoder) without retraining the entire model.

📖

terms

Multimodal Prompting

An interaction technique with a model where the input (the 'prompt') is composed of multiple modalities, for example, an image accompanied by a textual question, to guide the model towards a specific response.

AI Glossary

Vision-Language Model (VLM)

Visual Tokenization

Alignment Model

Multimodal Conditional Generation

Multimodal Chain-of-Thought Reasoning

Multimodal Perceptron

Multimodal Diffusion Model

Separate Encoding vs Unified Encoding

Multimodal Zero-Shot Learning

Audio-Visual-Text Model

Latent Projection

Multimodal Foundation Model

Modularity in Multimodal Models

Multimodal Prompting

No results found