Multimodal QA
Vision-Language Transformer (VLT)
Transformer architecture pre-trained on large corpora of paired images and texts, designed for multimodal comprehension and generation tasks.
← Geri