KI-Glossar
Das vollständige Wörterbuch der Künstlichen Intelligenz
Multimodal fusion
Process of integrating representations from different modalities (audio and video) to create a unified and enriched understanding of content.
Joint representation
Shared vector space where audio and video features are projected to capture their common semantic relationships.
Temporal alignment
Process of precisely matching audio and video events in time to establish causal and semantic correlations.
Multimodal transformer model
Neural architecture based on attention mechanisms specifically designed to simultaneously process and integrate audio and video data.
Joint feature extraction
Process of identifying and extracting attributes that exist only when audio and video modalities are considered together.
Cross-modal correlation
Statistical measure of dependencies and relationships between audio and video signals to quantify their degree of semantic association.
Audio-video segmentation
Joint division of audio and video streams into coherent temporal segments based on shared semantic or thematic changes.
Multimodal reconstruction
Task of generating a missing modality (audio or video) from the available modality, using jointly learned representations.