This article discusses the importance of human activity and scene understanding in the development of service robots. It explores the use of deep learning techniques, such as vision transformers and graph models, to improve the performance of robots in tasks such as image recognition, object detection, and segmentation. The article also calls for original research contributions in this area to advance the theory and algorithmic design of vision transformers and graph models for 3D robot action and activity recognition.
