Multimodal Translation - Glosarium AI

📖

istilah

Text-to-Image Synthesis

Generation of photorealistic or stylized images from textual descriptions using generative models like GANs or diffusion models. These models understand text semantics to create coherent and detailed visuals.

📖

istilah

Image-to-Text Translation

Automatic conversion of visual content from images into descriptive text using vision-language models. This technology underpins applications like automatic captioning and visual accessibility.

📖

istilah

Diffusion Models

Generative models that learn to progressively denoise data to generate high-quality samples, particularly effective for text-to-image synthesis. These models use forward and reverse diffusion processes for generation.

📖

istilah

Multimodal Transformers

Transformer architecture adapted to simultaneously process multiple data modalities (text, image, audio) through cross-modal attention mechanisms. These models unify the representation and processing of heterogeneous data.

📖

istilah

Vision-Language Models

AI models designed to understand and generate content combining visual and linguistic information, such as ViT, BLIP or ALIGN. They learn joint representations through pre-training on large image-text corpora.

📖

istilah

Multimodal Embeddings

Vector representations in a shared space where different modalities (text, image, audio) can be compared and manipulated mathematically. These embeddings enable cross-modal semantic operations like search and similarity.

📖

istilah

Text-to-Video Generation

Generation of coherent video sequences from textual descriptions, modeling both spatial content and temporal dynamics. These models combine natural language understanding and frame-by-frame video generation.

📖

istilah

Image Captioning

Automatic generation of textual descriptions depicting image content, combining computer vision and natural language processing. Modern models use CNN or ViT encoders and transformer decoders.

📖

istilah

Visual Question Answering

System that answers textual questions about image content, requiring joint understanding of vision and language. VQA combines object detection, spatial reasoning, and linguistic comprehension.

📖

istilah

Multimodal Fusion

Integration of information from different modalities to create a unified representation richer than each modality separately. Strategies include early fusion, late fusion, and attention-based fusion.

📖

istilah

Neural Style Transfer

Deep learning technique that separates and recombines the content and style of images to create digital artworks. It uses convolutional neural networks to capture stylistic and content features.

📖

istilah

Text-to-Speech Synthesis

Conversion of written text into natural human speech using deep neural networks like Tacotron or WaveNet. Modern systems generate waveforms directly or via intermediate spectrograms.

📖

istilah

Speech-to-Text Transcription

Automatic conversion of speech into written text using end-to-end models like transformers or conformers. These systems transform audio signals into sequences of characters or words.

📖

istilah

Audio-Visual Learning

Machine learning combining audio and video information simultaneously to enhance understanding of multimodal scenes. This approach exploits the natural correlation between sounds and visual events.

📖

istilah

Multimodal Alignment

Process of learning semantic correspondences between different modalities in a common representation space. Alignment is crucial for cross-modal translation and retrieval tasks.

Glosarium AI

Text-to-Image Synthesis

Image-to-Text Translation

Diffusion Models

Multimodal Transformers

Vision-Language Models

Multimodal Embeddings

Text-to-Video Generation

Image Captioning

Visual Question Answering

Multimodal Fusion

Neural Style Transfer

Text-to-Speech Synthesis

Speech-to-Text Transcription

Audio-Visual Learning

Multimodal Alignment

Tidak ada hasil ditemukan