Add to Favourites
To login click here

Direct Preference Optimization (DPO) is a new algorithm that uses a simple classification loss to fine-tune Language Models (LMs) for specific tasks, eliminating the need for sampling or hyperparameter tuning. It consists of three phases: Supervised Fine-Tuning (SFT), Preference Sampling and Reward Learning, and RL Fine-Tuning Phase.