Thuật ngữ AI
Từ điển đầy đủ về Trí tuệ nhân tạo
Text-to-Image Synthesis
Generation of photorealistic or stylized images from textual descriptions using generative models like GANs or diffusion models. These models understand text semantics to create coherent and detailed visuals.
Image-to-Text Translation
Automatic conversion of visual content from images into descriptive text using vision-language models. This technology underpins applications like automatic captioning and visual accessibility.
Diffusion Models
Generative models that learn to progressively denoise data to generate high-quality samples, particularly effective for text-to-image synthesis. These models use forward and reverse diffusion processes for generation.
Multimodal Transformers
Transformer architecture adapted to simultaneously process multiple data modalities (text, image, audio) through cross-modal attention mechanisms. These models unify the representation and processing of heterogeneous data.
Vision-Language Models
AI models designed to understand and generate content combining visual and linguistic information, such as ViT, BLIP or ALIGN. They learn joint representations through pre-training on large image-text corpora.
Multimodal Embeddings
Vector representations in a shared space where different modalities (text, image, audio) can be compared and manipulated mathematically. These embeddings enable cross-modal semantic operations like search and similarity.
Text-to-Video Generation
Generation of coherent video sequences from textual descriptions, modeling both spatial content and temporal dynamics. These models combine natural language understanding and frame-by-frame video generation.
Image Captioning
Automatic generation of textual descriptions depicting image content, combining computer vision and natural language processing. Modern models use CNN or ViT encoders and transformer decoders.
Visual Question Answering
System that answers textual questions about image content, requiring joint understanding of vision and language. VQA combines object detection, spatial reasoning, and linguistic comprehension.
Multimodal Fusion
Integration of information from different modalities to create a unified representation richer than each modality separately. Strategies include early fusion, late fusion, and attention-based fusion.
Neural Style Transfer
Deep learning technique that separates and recombines the content and style of images to create digital artworks. It uses convolutional neural networks to capture stylistic and content features.
Text-to-Speech Synthesis
Conversion of written text into natural human speech using deep neural networks like Tacotron or WaveNet. Modern systems generate waveforms directly or via intermediate spectrograms.
Speech-to-Text Transcription
Automatic conversion of speech into written text using end-to-end models like transformers or conformers. These systems transform audio signals into sequences of characters or words.
Audio-Visual Learning
Machine learning combining audio and video information simultaneously to enhance understanding of multimodal scenes. This approach exploits the natural correlation between sounds and visual events.
Multimodal Alignment
Process of learning semantic correspondences between different modalities in a common representation space. Alignment is crucial for cross-modal translation and retrieval tasks.