Multimodal Translation - 인공지능 용어집

📖

용어

Text-to-Image Synthesis

Generation of photorealistic or stylized images from textual descriptions using generative models like GANs or diffusion models. These models understand text semantics to create coherent and detailed visuals.

📖

용어

Image-to-Text Translation

Automatic conversion of visual content from images into descriptive text using vision-language models. This technology underpins applications like automatic captioning and visual accessibility.

📖

용어

Diffusion Models

Generative models that learn to progressively denoise data to generate high-quality samples, particularly effective for text-to-image synthesis. These models use forward and reverse diffusion processes for generation.

📖

용어

Multimodal Transformers

Transformer architecture adapted to simultaneously process multiple data modalities (text, image, audio) through cross-modal attention mechanisms. These models unify the representation and processing of heterogeneous data.

📖

용어

Vision-Language Models

AI models designed to understand and generate content combining visual and linguistic information, such as ViT, BLIP or ALIGN. They learn joint representations through pre-training on large image-text corpora.

📖

용어

Multimodal Embeddings

Vector representations in a shared space where different modalities (text, image, audio) can be compared and manipulated mathematically. These embeddings enable cross-modal semantic operations like search and similarity.

📖

용어

Text-to-Video Generation

Generation of coherent video sequences from textual descriptions, modeling both spatial content and temporal dynamics. These models combine natural language understanding and frame-by-frame video generation.

📖

용어

Image Captioning

Automatic generation of textual descriptions depicting image content, combining computer vision and natural language processing. Modern models use CNN or ViT encoders and transformer decoders.

📖

용어

Visual Question Answering

System that answers textual questions about image content, requiring joint understanding of vision and language. VQA combines object detection, spatial reasoning, and linguistic comprehension.

📖

용어

Multimodal Fusion

Integration of information from different modalities to create a unified representation richer than each modality separately. Strategies include early fusion, late fusion, and attention-based fusion.

📖

용어

Neural Style Transfer

Deep learning technique that separates and recombines the content and style of images to create digital artworks. It uses convolutional neural networks to capture stylistic and content features.

📖

용어

Text-to-Speech Synthesis

Conversion of written text into natural human speech using deep neural networks like Tacotron or WaveNet. Modern systems generate waveforms directly or via intermediate spectrograms.

📖

용어

Speech-to-Text Transcription

Automatic conversion of speech into written text using end-to-end models like transformers or conformers. These systems transform audio signals into sequences of characters or words.

📖

용어

Audio-Visual Learning

Machine learning combining audio and video information simultaneously to enhance understanding of multimodal scenes. This approach exploits the natural correlation between sounds and visual events.

📖

용어

Multimodal Alignment

Process of learning semantic correspondences between different modalities in a common representation space. Alignment is crucial for cross-modal translation and retrieval tasks.

AI 용어집