Multimodal Translation - AI Glossarium

📖

termen

Text-to-Image Synthesis

Generation of photorealistic or stylized images from textual descriptions using generative models like GANs or diffusion models. These models understand text semantics to create coherent and detailed visuals.

📖

termen

Image-to-Text Translation

Automatic conversion of visual content from images into descriptive text using vision-language models. This technology underpins applications like automatic captioning and visual accessibility.

📖

termen

Diffusion Models

Generative models that learn to progressively denoise data to generate high-quality samples, particularly effective for text-to-image synthesis. These models use forward and reverse diffusion processes for generation.

📖

termen

Multimodal Transformers

Transformer architecture adapted to simultaneously process multiple data modalities (text, image, audio) through cross-modal attention mechanisms. These models unify the representation and processing of heterogeneous data.

📖

termen

Vision-Language Models

AI models designed to understand and generate content combining visual and linguistic information, such as ViT, BLIP or ALIGN. They learn joint representations through pre-training on large image-text corpora.

📖

termen

Multimodal Embeddings

Vector representations in a shared space where different modalities (text, image, audio) can be compared and manipulated mathematically. These embeddings enable cross-modal semantic operations like search and similarity.

📖

termen

Text-to-Video Generation

Generation of coherent video sequences from textual descriptions, modeling both spatial content and temporal dynamics. These models combine natural language understanding and frame-by-frame video generation.

📖

termen

Image Captioning

Automatic generation of textual descriptions depicting image content, combining computer vision and natural language processing. Modern models use CNN or ViT encoders and transformer decoders.

📖

termen

Visual Question Answering

System that answers textual questions about image content, requiring joint understanding of vision and language. VQA combines object detection, spatial reasoning, and linguistic comprehension.

📖

termen

Multimodal Fusion

Integration of information from different modalities to create a unified representation richer than each modality separately. Strategies include early fusion, late fusion, and attention-based fusion.

📖

termen

Neural Style Transfer

Deep learning technique that separates and recombines the content and style of images to create digital artworks. It uses convolutional neural networks to capture stylistic and content features.

📖

termen

Text-to-Speech Synthesis

Conversion of written text into natural human speech using deep neural networks like Tacotron or WaveNet. Modern systems generate waveforms directly or via intermediate spectrograms.

📖

termen

Speech-to-Text Transcription

Automatic conversion of speech into written text using end-to-end models like transformers or conformers. These systems transform audio signals into sequences of characters or words.

📖

termen

Audio-Visual Learning

Machine learning combining audio and video information simultaneously to enhance understanding of multimodal scenes. This approach exploits the natural correlation between sounds and visual events.

📖

termen

Multimodal Alignment

Process of learning semantic correspondences between different modalities in a common representation space. Alignment is crucial for cross-modal translation and retrieval tasks.

AI-woordenlijst

Text-to-Image Synthesis

Image-to-Text Translation

Diffusion Models

Multimodal Transformers

Vision-Language Models

Multimodal Embeddings

Text-to-Video Generation

Image Captioning

Visual Question Answering

Multimodal Fusion

Neural Style Transfer

Text-to-Speech Synthesis

Speech-to-Text Transcription

Audio-Visual Learning

Multimodal Alignment

Geen resultaten gevonden