Multimodal Translation
Vision-Language Models
AI models designed to understand and generate content combining visual and linguistic information, such as ViT, BLIP or ALIGN. They learn joint representations through pre-training on large image-text corpora.
← Quay lại