Knowledge distillation is a machine learning compression process that transfers knowledge from a large deep learning model to a smaller, more efficient model. The goal of knowledge distillation is to reduce the memory footprint, compute requirements, and energy costs of a large model, so it can be used in a resource-constrained environment without significantly sacrificing performance. The process itself is sometimes referred to as teacher/student learning, where the large model is the teacher and the small model is the student. Attention transfer is a technique in which the student model is trained to mimic the attention maps generated by the teacher model.