Multi-Modal Transformers
Vision-Language Transformer
Transformer architecture specifically designed to jointly understand and generate visual and textual content, using shared or separate encoders for each modality.
← Tillbaka