The article discusses the advancements in computer vision, specifically in the use of advanced neural network architectures such as Transformers and Convolutional Neural Networks (CNNs). These models have improved the performance of visual recognition tasks, but their quadratic complexity poses challenges in handling long sequences. To address this issue, researchers have developed token mixers and RNN-like models, such as MambaOut, which leverage structured state space models (SSM) for improved efficiency and performance in vision tasks.