Direct Preference Optimization (DPO)#
Last updated: Apr 22, 2026
Overview#
Direct Preference Optimization (DPO) is an offline alignment algorithm that directly optimizes a language model on human preference data (chosen / rejected pairs), without a separate reward model or online RL rollouts.
Compared to RLHF (PPO), DPO is simpler (no reward model, no value network, no online generation), more stable (a single supervised-style loss), and more efficient (only two forward + one backward per batch: policy + reference). AReaL implements DPO on top of FSDP2 with reference-model colocation.
Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., NeurIPS 2023)
Core Idea#
The DPO Objective#
Given a preference dataset \(\mathcal{D} = \{(x, y_w, y_l)\}\) where \(y_w\) is the chosen response and \(y_l\) is the rejected one, DPO optimizes:
\(\pi_\theta\) is the policy under training, \(\pi_{\text{ref}}\) the frozen reference, and \(\beta\) controls the KL penalty. The objective is derived by substituting the closed-form optimal policy of KL-regularized RLHF into the Bradley-Terry preference model — the reward is implicitly defined by the policy and reference, eliminating the need for a standalone reward model.
AReaL supports two loss variants via loss_type: the original sigmoid form (default) and IPO (Azar et al. 2023), which replaces the sigmoid with a squared loss targeting a fixed margin of \(\frac{1}{2\beta}\) per token. The IPO variant normalizes logratios by completion length (per-token average) before computing the squared loss, matching TRL’s author-confirmed convention.
Implicit Reward#
During training we monitor the implicit reward \(r(x, y) = \beta (\log \pi_\theta(y|x) - \log \pi_{\text{ref}}(y|x))\). A positive reward margin \(r(x, y_w) - r(x, y_l) > 0\) indicates the model correctly prefers the chosen response; reward accuracy is the fraction of pairs with positive margin.
Running the Example#
Single-Node (HH-RLHF)#
python3 examples/alignment/hhrlhf_dpo.py \
--config examples/alignment/hhrlhf_dpo.yaml \
scheduler.type=local
Key fragments of examples/alignment/hhrlhf_dpo.yaml:
actor:
backend: "fsdp:d8p1t1"
path: Qwen/Qwen2.5-7B # Follows the original paper: train on a base model
beta: 0.1 # KL penalty
dtype: bfloat16
disable_dropout: true # Required for DPO stability
mb_spec:
granularity: 2 # Must be 2: chosen + rejected dispatched as pairs
optimizer:
lr: 5e-6
lr_scheduler_type: cosine
warmup_steps_proportion: 0.1
ref:
backend: ${actor.backend}
path: ${actor.path}
optimizer: null # Frozen
scheduling_strategy:
type: colocation
target: actor # Share GPUs with actor
train_dataset:
batch_size: 8
path: Anthropic/hh-rlhf
type: dpo
max_length: 2048
get_hhrlhf_dpo_dataset (areal/dataset/hhrlhf.py) tokenizes raw chosen/rejected text directly and infers the prompt boundary as the longest common token prefix. HH-RLHF pairs share the same multi-turn prompt and differ only in the final assistant reply, so the common prefix is exactly the prompt.
Multi-Node (Ray)#
python3 examples/alignment/hhrlhf_dpo.py \
--config examples/alignment/hhrlhf_dpo.yaml \
cluster.n_nodes=2 cluster.n_gpus_per_node=8 \
cluster.fileroot=/path/to/nfs \
scheduler.type=ray
Key Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
|
KL penalty. Higher values stay closer to the reference. Typical range: 0.05–0.5. |
|
|
Loss variant. |
|
|
Learning rate. DPO is LR-sensitive; 5e-7 – 5e-6 is the sweet spot. |
|
|
Disable dropout for deterministic log-prob computation. |
|
|
Micro-batch granularity. Must be 2 for DPO (chosen + rejected are paired). |
|
— |
Reference model configuration (required). |
Metrics dpo/loss, dpo/chosen_reward, dpo/rejected_reward, dpo/reward_accuracy, dpo/reward_margin are logged under the dpo/ prefix.
References#
Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arXiv:2305.18290
Azar et al. (2023). A General Theoretical Paradigm to Understand Learning from Human Feedback. arXiv:2310.12036