Multimodal Models - Bảng thuật ngữ Trí tuệ nhân tạo

📖

thuật ngữ

Vision-Language Model (VLM)

Subclass of multimodal models specialized in joint understanding of text and images, capable of tasks like image captioning, visual reasoning, or image generation from text.

📖

thuật ngữ

Technique that cuts an image into a sequence of patches or discrete tokens, often through a neural network like a Vision Transformer (ViT), to make it compatible with textual transformer architecture.

📖

thuật ngữ

Alignment Model

Model, often based on a contrastor like CLIP, trained on immense corpora of (image, text) pairs to learn to project both modalities into a shared vector space where cosine similarity reflects their mutual relevance.

📖

thuật ngữ

Multimodal Conditional Generation

Generation task where the output (e.g., text, image) is produced based on one or more inputs of different modalities, such as describing an image or creating an image from text.

📖

thuật ngữ

Multimodal Chain-of-Thought Reasoning

Ability of a model to use information from multiple modalities to construct a logical sequence of thought and reach a conclusion, for example by analyzing a chart and text to answer a question.

📖

thuật ngữ

Multimodal Perceptron

Theoretical concept or primitive architecture where inputs of different natures are combined, often by concatenation or a fusion operation, before being processed by fully connected neural layers.

📖

thuật ngữ

Multimodal Diffusion Model

Generation architecture that uses an iterative noising and denoising process to create data (e.g., images) conditioned by another modality (e.g., a text description), guiding the denoising with conditioning information.

📖

thuật ngữ

Separate Encoding vs Unified Encoding

Two architectural strategies for multimodal models: separate encoding processes each modality with a dedicated encoder before fusion, while unified encoding uses a single transformer to process a sequence of mixed tokens.

📖

thuật ngữ

Multimodal Zero-Shot Learning

Ability of a model to perform a task on one modality (e.g., classifying an image) without having been explicitly trained for it, by leveraging knowledge transferred from another modality (e.g., the text of class labels).

📖

thuật ngữ

Audio-Visual-Text Model

An advanced form of multimodal model integrating three data streams (audio, image, text) for complex tasks like video description, where the model must synchronize and interpret visual and auditory information to produce a textual narration.

📖

thuật ngữ

Latent Projection

A neural network layer, often a simple linear transformation, used to map the embedding vectors of each modality into a common latent space before their fusion or comparison.

📖

thuật ngữ

Multimodal Foundation Model

A very large-scale model, pre-trained on massive amounts of heterogeneous data, that serves as a base for adaptation (fine-tuning) to a multitude of specific multimodal tasks.

📖

thuật ngữ

Modularity in Multimodal Models

A design principle where the encoders for each modality are distinct and interchangeable modules, allowing for updating or replacing a component (e.g., the vision encoder) without retraining the entire model.

📖

thuật ngữ

Multimodal Prompting

An interaction technique with a model where the input (the 'prompt') is composed of multiple modalities, for example, an image accompanied by a textual question, to guide the model towards a specific response.

Thuật ngữ AI

Vision-Language Model (VLM)

Visual Tokenization

Alignment Model

Multimodal Conditional Generation

Multimodal Chain-of-Thought Reasoning

Multimodal Perceptron

Multimodal Diffusion Model

Separate Encoding vs Unified Encoding

Multimodal Zero-Shot Learning

Audio-Visual-Text Model

Latent Projection

Multimodal Foundation Model

Modularity in Multimodal Models

Multimodal Prompting

Không tìm thấy kết quả