Multimodal interpretability - 인공지능 용어집

📖

용어

Multimodal Attribute Fusion

Process of combining features from different modalities (text, image, sound) into a unified representation for a learning model, aiming to capture complex interactions between data sources.

📖

용어

Interpretability method that involves projecting the contribution of a complex modality (e.g., an image) onto a simpler and interpretable space (e.g., keywords or concepts) to explain its influence on the model's prediction.

📖

용어

Multimodal Salience Map

Visualization that highlights the most influential regions or segments of each modality (pixels of an image, words of a text, audio segments) for a specific model decision, often by overlaying contributions on the original data.

📖

용어

Inter-modality Semantic Alignment

Technique aimed at establishing semantic correspondences between elements of different modalities (e.g., linking a word to an image region or a sound to an action), crucial for the model to understand relationships and provide coherent explanations.

📖

용어

Modality-wise Decomposition

Explainability approach that isolates and quantifies the individual contribution of each input modality to the final prediction, allowing understanding of whether a decision is primarily guided by text, image, or sound.

📖

용어

Multimodal Concept Bottleneck

Model architecture where the final prediction is conditioned on a set of interpretable concepts, themselves inferred from the fusion of modalities, offering clear traceability from raw data to concepts then to decision.

📖

용어

Orthogonality Regularization

Constraint applied during training to force the representations of different modalities in the shared latent space to be as independent as possible, avoiding redundancy and improving clarity of modality-specific explanations.

📖

용어

Multimodal Counterfactual Explanation

Generation of modified examples (by changing one or more modalities) that are sufficient to reverse the model's prediction, helping to understand the minimal conditions and interactions between modalities necessary for a decision.

📖

용어

Late Fusion for Interpretability

Strategy where each modality is processed by a specialized model up to an intermediate decision, with results then being fused. This approach facilitates interpretation by isolating the logic of each modality before the final combination.

📖

용어

Foundational VISION-LANGUAGE Model

Large-scale model pre-trained on massive corpora of textual and visual data, capable of understanding and generating content from these two modalities, whose interpretability is a major challenge due to its intrinsic complexity.

📖

용어

Modal Role Analysis

Systematic evaluation of the role played by each modality in different tasks or contexts, determining whether a modality acts as contextual support, a primary information source, or a modifier for others.

📖

용어

Visuo-Linguistic Grounding

Process of anchoring linguistic symbols (words, phrases) to concrete entities or concepts in visual data, fundamental for model explanations linking text and image to be semantically correct and understandable.

📖

용어

Fusion Node Interpretability

Method that focuses on analyzing the specific neurons or layers where the fusion of multimodal information occurs, to understand how interactions are encoded and how they influence the model's output.

📖

용어

Cross-Modal Gradient Explanation

Interpretability technique that calculates the gradient of the model's output with respect to the features of one modality, while conditioning this calculation on the features of another modality, thus revealing inter-modal dependencies.

AI 용어집