Multi-Modal Transformers
BLIP
Bootstrapping Language-Image Pre-training framework generating pseudo-captions to filter noise and improve data quality, using a multimodal encoder and an image-text decoder.
← Wstecz