KI-Glossar
Das vollständige Wörterbuch der Künstlichen Intelligenz
Video Diffusion Model
Generation architecture that applies the diffusion process to spatio-temporal data, gradually adding noise to images in a video sequence before learning to denoise it to reconstruct or create coherent videos.
Spatio-Temporal Latent Diffusion
Variant of video diffusion models that operates in a compressed latent space, reducing computational complexity by applying the noise addition and denoising process on low-dimensional representations rather than on raw pixels of each frame.
3D Attention
Attention mechanism that simultaneously processes spatial (height, width) and temporal (time) dimensions of a video, allowing the model to weight the importance of different regions across different moments to capture spatio-temporal dependencies.
Time Embedding
Technique for encoding temporal information (diffusion step, position in the sequence) as vectors that are injected into the network, guiding the generation process to maintain consistency and movement over time.
Conditional Denoising by Previous Frame
Strategy where the noise prediction for a video frame is conditioned by the denoised version of the previous frame, ensuring strong continuity and temporal consistency between successive images in the generated sequence.
3D U-Net Architecture
Convolutional neural network structure adapted for video data, combining encoder-decoder paths with 3D residual connections to effectively capture multi-scale spatial and temporal contexts during denoising.
Spatio-Temporal Latent Space
Compressed and abstract representation of a video sequence, where spatial and temporal information is encoded in a low-dimensional vector or feature map, serving as the basis for video generation or manipulation.
Video Classifier-Free Guidance (CFG)
Method for controlling video generation without an explicit classifier, by training a model on both conditional (e.g., text) and unconditional data, then interpolating between their predictions to adjust adherence to the prompt while preserving diversity.
Temporal Diffusion Scheduling
Planning the number of denoising steps allocated to each frame or temporal segments, which can be uniform or adaptive to optimize the quality and overall consistency of the generated video based on motion complexity.
Diffusion-based Temporal Super-Resolution
Application of diffusion models to increase the frame rate (fps) of a video, generating coherent intermediate frames that realistically interpolate motion and changes between existing frames.
Video Inpainting by Diffusion
Process of filling missing or masked areas in a video sequence using a diffusion model, which generates pixels that are spatially and temporally coherent based on the context of surrounding frames.
Latent Motion Modeling
Technique where motion in a video is modeled and generated directly in the latent space, often by predicting displacements or transformations between latent codes of successive frames, before decoding them into images.
Temporal Consistency by Constraint
Approach that adds an explicit penalty or constraint in the model's loss function to discourage appearance changes (e.g., color, texture) unrelated to motion between adjacent frames, promoting visual stability.
Spatio-Temporal Noise Decomposition
Advanced method where the noise added and predicted by the model is decomposed into a spatial component (appearance) and a temporal component (motion), allowing finer control and more robust generation of dynamic videos.
Auto-Regression on Diffusion Latents
Hybrid strategy that generates a video auto-regressively frame by frame, where each latent frame is produced by a diffusion step conditioned on previous latent frames, combining the consistency of auto-regression and the quality of diffusion.
Temporal Feature Normalization
Normalization layer applied to the temporal dimension of feature maps in a 3D U-Net, stabilizing training by ensuring that the distribution of activations remains consistent across different temporal stages of the sequence.