A team of researchers has successfully reproduced OpenAI’s Reinforcement Learning from Human Feedback (RLHF) pipeline, which aims to create a model that outputs content preferred by humans. They focused on over 20 key implementation details and used a unified learning rate for training. They also utilized the transformers library and deepspeed’s ZeRO Stage 2 to fit the models into GPU memory.