Multi-Modal Transformers
Unified Encoder-Decoder
Transformer architecture where the same encoder processes all input modalities, and a decoder generates the output, enabling tasks like VQA, captioning, and retrieval with a single model.
← Zurück