Thuật ngữ AI
Từ điển đầy đủ về Trí tuệ nhân tạo
Multi-Modal Diffusion
Class of generative models learning a joint probability distribution over multiple modalities (text, image, audio) through a shared or coordinated diffusion process.
Unified Latent Space
Common vector representation where data from different modalities are projected to enable their interaction and mutual transformation within a diffusion model.
Cross-Modal Conditioning
Technique where the generation process of one modality is guided by information from another modality, for example generating an image from text or audio from an image.
Multi-Modal Structured Noise
Noise addition process that preserves inter-modal correlations, jointly degrading different modalities to maintain their semantic alignment throughout the diffusion process.
Coordinated Denoising
Denoising step where neural networks dedicated to each modality exchange information to coherently reconstruct data from their shared noisy version.
Multi-Modal Encoder
Neural network responsible for projecting data from different modalities into the unified latent space, capturing their essential features and relationships.
Multi-Modal Decoder
Neural network that reconstructs data for each modality from their representation in the unified latent space after the denoising process.
Inter-Modal Alignment
Learning objective aimed at minimizing the distance between latent representations of different modalities describing the same concept, ensuring their semantic consistency.
Unified Diffusion Model
Single model architecture that simultaneously processes and generates multiple modalities using a single diffusion process and a shared set of weights.
Multi-Modal Guidance
Inference technique that uses the gradient of a multi-modal classification model to guide the sampling process towards outputs better aligned with a given condition.
Multi-Arm Diffusion
Architecture where a central diffusion process has specialized 'arms' to handle noise addition and denoising specific to each modality while sharing a common trunk.
Multi-Modal Consistency Loss
Loss function that penalizes semantic inconsistencies between generated modalities, measured for example via cosine distance in the unified latent space.
Inter-Modal Sampling
Generation process where one modality is sampled while conditioning on another already existing or simultaneously generated modality.
Shared Noise Prediction Network
Central component of the diffusion model, often a U-Net architecture, whose lower layers are shared between modalities and upper layers are specialized.
Multi-Modal Time Embedding
Representation of the diffusion process timestep that is injected into the model, often conditioned by the modality to handle different noise dynamics.
Multi-Modal Sequence Diffusion
Application of diffusion to sequential data involving multiple modalities, such as video generation (image + time) or synchronized dialogue (audio + text).
Multi-Modal Tokenization
Process of discretizing data from different modalities into a unified sequence of tokens that can be processed by a Transformer-like architecture in the context of diffusion.