Google has developed a transformer model to bridge the gap between the limitations of robotics and the need for general purpose robots. The challenge is that it is difficult to train a computer vision system to recognize each kind of object and scenario, and provide it with a precise list of instructions to execute when that object or situation is recognized. High-level reasoning models like ChatGPT are not aligned with the low-level robotics systems, making it difficult for robots to understand the difference between a bag of chips and a piece of trash which used to be a bag of chips.
