The Pixel Transformer (PiT) is a new approach to computer vision that eliminates remaining inductive biases and improves model performance and versatility. By treating individual pixels as tokens and using learned position embeddings, PiT outperforms conventional methods like ViT.