The Pixel Transformer is a novel approach to computer vision that challenges the traditional use of patches as input tokens and instead treats each pixel as an individual token. This eliminates the need for locality bias and opens up new possibilities for vision transformers.
