Multi-Modal Diffusion

📖

Begriffe

Multi-Modal Diffusion

Class of generative models learning a joint probability distribution over multiple modalities (text, image, audio) through a shared or coordinated diffusion process.

📖

Begriffe

Unified Latent Space

Common vector representation where data from different modalities are projected to enable their interaction and mutual transformation within a diffusion model.

📖

Begriffe

Cross-Modal Conditioning

Technique where the generation process of one modality is guided by information from another modality, for example generating an image from text or audio from an image.

📖

Begriffe

Multi-Modal Structured Noise

Noise addition process that preserves inter-modal correlations, jointly degrading different modalities to maintain their semantic alignment throughout the diffusion process.

📖

Begriffe

Coordinated Denoising

Denoising step where neural networks dedicated to each modality exchange information to coherently reconstruct data from their shared noisy version.

📖

Begriffe

Multi-Modal Encoder

Neural network responsible for projecting data from different modalities into the unified latent space, capturing their essential features and relationships.

📖

Begriffe

Multi-Modal Decoder

Neural network that reconstructs data for each modality from their representation in the unified latent space after the denoising process.

📖

Begriffe

Inter-Modal Alignment

Learning objective aimed at minimizing the distance between latent representations of different modalities describing the same concept, ensuring their semantic consistency.

📖

Begriffe

Unified Diffusion Model

Single model architecture that simultaneously processes and generates multiple modalities using a single diffusion process and a shared set of weights.

📖

Begriffe

Multi-Modal Guidance

Inference technique that uses the gradient of a multi-modal classification model to guide the sampling process towards outputs better aligned with a given condition.

📖

Begriffe

Multi-Arm Diffusion

Architecture where a central diffusion process has specialized 'arms' to handle noise addition and denoising specific to each modality while sharing a common trunk.

📖

Begriffe

Multi-Modal Consistency Loss

Loss function that penalizes semantic inconsistencies between generated modalities, measured for example via cosine distance in the unified latent space.

📖

Begriffe

Inter-Modal Sampling

Generation process where one modality is sampled while conditioning on another already existing or simultaneously generated modality.

📖

Begriffe

Shared Noise Prediction Network

Central component of the diffusion model, often a U-Net architecture, whose lower layers are shared between modalities and upper layers are specialized.

📖

Begriffe

Multi-Modal Time Embedding

Representation of the diffusion process timestep that is injected into the model, often conditioned by the modality to handle different noise dynamics.

📖

Begriffe

Multi-Modal Sequence Diffusion

Application of diffusion to sequential data involving multiple modalities, such as video generation (image + time) or synchronized dialogue (audio + text).

📖

Begriffe

Multi-Modal Tokenization

Process of discretizing data from different modalities into a unified sequence of tokens that can be processed by a Transformer-like architecture in the context of diffusion.

KI-Glossar