Multi-Modal Transformers
CLIP
Contrastive Language-Image Pre-training model trained on 400 million image-text pairs using a contrastive objective to learn shared representations between vision and language.
← Zurück