Multimodal Translation - Bảng thuật ngữ Trí tuệ nhân tạo

📖

thuật ngữ

Text-to-Image Synthesis

Generation of photorealistic or stylized images from textual descriptions using generative models like GANs or diffusion models. These models understand text semantics to create coherent and detailed visuals.

📖

thuật ngữ

Image-to-Text Translation

Automatic conversion of visual content from images into descriptive text using vision-language models. This technology underpins applications like automatic captioning and visual accessibility.

📖

thuật ngữ

Diffusion Models

Generative models that learn to progressively denoise data to generate high-quality samples, particularly effective for text-to-image synthesis. These models use forward and reverse diffusion processes for generation.

📖

thuật ngữ

Multimodal Transformers

Transformer architecture adapted to simultaneously process multiple data modalities (text, image, audio) through cross-modal attention mechanisms. These models unify the representation and processing of heterogeneous data.

📖

thuật ngữ

Vision-Language Models

AI models designed to understand and generate content combining visual and linguistic information, such as ViT, BLIP or ALIGN. They learn joint representations through pre-training on large image-text corpora.

📖

thuật ngữ

Multimodal Embeddings

Vector representations in a shared space where different modalities (text, image, audio) can be compared and manipulated mathematically. These embeddings enable cross-modal semantic operations like search and similarity.

📖

thuật ngữ

Text-to-Video Generation

Generation of coherent video sequences from textual descriptions, modeling both spatial content and temporal dynamics. These models combine natural language understanding and frame-by-frame video generation.

📖

thuật ngữ

Image Captioning

Automatic generation of textual descriptions depicting image content, combining computer vision and natural language processing. Modern models use CNN or ViT encoders and transformer decoders.

📖

thuật ngữ

Visual Question Answering

System that answers textual questions about image content, requiring joint understanding of vision and language. VQA combines object detection, spatial reasoning, and linguistic comprehension.

📖

thuật ngữ

Multimodal Fusion

Integration of information from different modalities to create a unified representation richer than each modality separately. Strategies include early fusion, late fusion, and attention-based fusion.

📖

thuật ngữ

Neural Style Transfer

Deep learning technique that separates and recombines the content and style of images to create digital artworks. It uses convolutional neural networks to capture stylistic and content features.

📖

thuật ngữ

Text-to-Speech Synthesis

Conversion of written text into natural human speech using deep neural networks like Tacotron or WaveNet. Modern systems generate waveforms directly or via intermediate spectrograms.

📖

thuật ngữ

Speech-to-Text Transcription

Automatic conversion of speech into written text using end-to-end models like transformers or conformers. These systems transform audio signals into sequences of characters or words.

📖

thuật ngữ

Audio-Visual Learning

Machine learning combining audio and video information simultaneously to enhance understanding of multimodal scenes. This approach exploits the natural correlation between sounds and visual events.

📖

thuật ngữ

Multimodal Alignment

Process of learning semantic correspondences between different modalities in a common representation space. Alignment is crucial for cross-modal translation and retrieval tasks.

Thuật ngữ AI

Text-to-Image Synthesis

Image-to-Text Translation

Diffusion Models

Multimodal Transformers

Vision-Language Models

Multimodal Embeddings

Text-to-Video Generation

Image Captioning

Visual Question Answering

Multimodal Fusion

Neural Style Transfer

Text-to-Speech Synthesis

Speech-to-Text Transcription

Audio-Visual Learning

Multimodal Alignment

Không tìm thấy kết quả