Multimodal Models
Visual Tokenization
Technique that cuts an image into a sequence of patches or discrete tokens, often through a neural network like a Vision Transformer (ViT), to make it compatible with textual transformer architecture.
← Zurück