AI-woordenlijst
Het complete woordenboek van kunstmatige intelligentie
Vision-Language Pre-training
Self-supervised learning approach where models are pre-trained on large corpora of images and associated texts. Establishes fundamental mappings between visual concepts and linguistic descriptions before fine-tuning.
Joint Representation Learning
Process of simultaneously learning shared features between multiple modalities to create a unified representation. Captures inter-modal correlations and complementarities in a single vector.
Modal Fusion
Strategic integration of information from different modalities to create an enriched and coherent representation. Effectively combines the respective strengths of each modality in a unified output.
Grounding
Process of associating abstract concepts (often textual) with concrete elements in another modality (typically visual). Establishes direct links between words and specific regions or objects in images.
Alignment Loss
Loss function specifically designed to optimize semantic matching between elements of different modalities. Guides learning toward optimal alignment in the shared representation space.
Semantic Consistency
Principle ensuring that multimodal representations preserve consistent meaning across different modalities. Ensures that semantically equivalent elements share similar characteristics.
Multimodal Pre-training
Initialization phase of a multimodal model's weights on massive unannotated data. Develops fundamental alignment capabilities before adaptation to specific tasks.
Modal Alignment Metrics
Quantitative indicators evaluating the quality of correspondence between representations of different modalities. Measure the accuracy and semantic consistency of learned alignments.
Weakly Supervised Alignment
Learning approach using partial or noisy annotations to align modalities. Reduces dependency on labeled data while maintaining reasonable alignment performance.
Self-supervised Multimodal Learning
Paradigm where the model automatically learns alignments by exploiting natural correlations between unannotated modalities. Generates intrinsic learning signals from the multimodal structure of the data.