Multi-Modal Transformers - Bảng thuật ngữ Trí tuệ nhân tạo

📖

thuật ngữ

Multi-Modal Transformer

Extended Transformer architecture capable of simultaneously processing multiple data modalities (text, image, audio) using cross-attention mechanisms to integrate inter-modal information.

📖

thuật ngữ

Vision-Language Transformer

Transformer architecture specifically designed to jointly understand and generate visual and textual content, using shared or separate encoders for each modality.

📖

thuật ngữ

Fusion Mechanism

Algorithmic strategy for effectively combining representations of different modalities at one or more levels of the network, including early, late, or hierarchical fusion.

📖

thuật ngữ

Modality Embedding

Specific encoding vectors added to token embeddings to indicate the original modality (text, image, audio), allowing the Transformer to distinguish and process each data type differently.

📖

thuật ngữ

CLIP

Contrastive Language-Image Pre-training model trained on 400 million image-text pairs using a contrastive objective to learn shared representations between vision and language.

📖

thuật ngữ

VLP

Family of Vision-Language Pre-training models using a shared Transformer encoder for both modalities with pre-training tasks like masked modeling and image-text prediction.

📖

thuật ngữ

Unified Encoder-Decoder

Transformer architecture where the same encoder processes all input modalities, and a decoder generates the output, enabling tasks like VQA, captioning, and retrieval with a single model.

📖

thuật ngữ

Modality Gap

Inherent structural and semantic difference between the representation spaces of different modalities, requiring specific alignment mechanisms in multi-modal models.

📖

thuật ngữ

Multi-Modal Fusion

Process of integrating features from different modalities into a unified representation, leveraging inter-modal complementarities to improve performance on complex tasks.

📖

thuật ngữ

Cross-Modal Alignment

Training objective aimed at semantically aligning representations of different modalities in a shared space, enabling correspondence between visual and linguistic concepts.

📖

thuật ngữ

Perceiver IO

General Transformer architecture capable of processing any combination of modalities using a cross-attention network between input data and a set of learned latents.

📖

thuật ngữ

Flamingo Model

80-billion parameter vision-language model using pre-trained adapters and attentional gating to effectively combine Vision Transformers and language models without full retraining.

📖

thuật ngữ

BLIP

Bootstrapping Language-Image Pre-training framework generating pseudo-captions to filter noise and improve data quality, using a multimodal encoder and an image-text decoder.

📖

thuật ngữ

CoCa

Contrastive Captioners model combining a contrastive objective for representation learning and a generative objective for captioning in a single unified Transformer architecture.

📖

thuật ngữ

BEiT-3

Bidirectional Encoder representation from Image Transformer v3 model using a multiway Transformer with modality-specific embeddings to process image, text, and image-text in a unified manner.

📖

thuật ngữ

LayoutLM

Family of document pre-trained models combining 2D spatial layout, text, and visual information for understanding structured documents like forms and invoices.

📖

thuật ngữ

UniPerceiver

Universal perception framework that treats various multi-modal tasks as a unified token generation problem, using a single Transformer model for classification, detection, and generation.

📖

thuật ngữ

GIT

Generative Image-to-text Transformer model that treats images as a foreign language and uses a simple encoder-decoder architecture for image description and VQA with state-of-the-art performance.

Thuật ngữ AI