Multimodal Translation
Image Captioning
Automatic generation of textual descriptions depicting image content, combining computer vision and natural language processing. Modern models use CNN or ViT encoders and transformer decoders.
← 뒤로