Solutions
Transformer-based image modelling refers to the use of transformer architectures, which were originally developed for NLP, to process and generate images. Unlike the convolutional neural networks (CNNs) that have dominated computer vision for the past decade, transformer-based models leverage the self-attention mechanism to capture global relationships within an image.
At a high level, transformer-based image models work by splitting an image into a sequence of image “patches” and then applying a transformer network to process that sequence. The model learns to understand the relationships between the different patches through the self-attention mechanism, which allows each patch to attend to and be influenced by all the other patches in the image. This is in contrast to CNNs, which process images through a series of local convolution operations. While CNNs are adept at extracting low-level visual features, they can struggle to model long-range dependencies in an image. Transformers, on the other hand, are well-suited to capturing these global relationships.