Object detection and recognition are essential tasks in computer vision, with applications in search and rescue, warehouse logistics, video surveillance, and more. In the last decade, deep-learning models have been adopted due to their superior performance compared to classical methods. These models employ convolutional neural networks (CNNs) and are subdivided into two types: two-shot detectors, which search with maximum accuracy but with a higher inference time, and one-shot detectors, which are oriented at a minimum inference time for real-time applications. Recently, Vision Transformers (ViTs) have also been applied to object detection and recognition tasks, making use of CNNs as a backbone for feature extraction.
