Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO)#

Last updated: Apr 22, 2026

Overview#

Direct Preference Optimization (DPO) is an offline alignment algorithm that directly optimizes a language model on human preference data (chosen / rejected pairs), without a separate reward model or online RL rollouts.

Compared to RLHF (PPO), DPO is simpler (no reward model, no value network, no online generation), more stable (a single supervised-style loss), and more efficient (only two forward + one backward per batch: policy + reference). AReaL implements DPO on top of FSDP2 with reference-model colocation.

Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., NeurIPS 2023)

Core Idea#

The DPO Objective#

Given a preference dataset \(\mathcal{D} = \{(x, y_w, y_l)\}\) where \(y_w\) is the chosen response and \(y_l\) is the rejected one, DPO optimizes:

\[ \mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\!\left(\beta \left( \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right)\right)\right] \]

\(\pi_\theta\) is the policy under training, \(\pi_{\text{ref}}\) the frozen reference, and \(\beta\) controls the KL penalty. The objective is derived by substituting the closed-form optimal policy of KL-regularized RLHF into the Bradley-Terry preference model — the reward is implicitly defined by the policy and reference, eliminating the need for a standalone reward model.

AReaL supports two loss variants via loss_type: the original sigmoid form (default) and IPO (Azar et al. 2023), which replaces the sigmoid with a squared loss targeting a fixed margin of \(\frac{1}{2\beta}\) per token. The IPO variant normalizes logratios by completion length (per-token average) before computing the squared loss, matching TRL’s author-confirmed convention.

Implicit Reward#

During training we monitor the implicit reward \(r(x, y) = \beta (\log \pi_\theta(y|x) - \log \pi_{\text{ref}}(y|x))\). A positive reward margin \(r(x, y_w) - r(x, y_l) > 0\) indicates the model correctly prefers the chosen response; reward accuracy is the fraction of pairs with positive margin.

Running the Example#

Single-Node (HH-RLHF)#

python3 examples/alignment/hhrlhf_dpo.py \
  --config examples/alignment/hhrlhf_dpo.yaml \
  scheduler.type=local

Key fragments of examples/alignment/hhrlhf_dpo.yaml:

actor:
  backend: "fsdp:d8p1t1"
  path: Qwen/Qwen2.5-7B            # Follows the original paper: train on a base model
  beta: 0.1                        # KL penalty
  dtype: bfloat16
  disable_dropout: true            # Required for DPO stability
  mb_spec:
    granularity: 2                 # Must be 2: chosen + rejected dispatched as pairs
  optimizer:
    lr: 5e-6
    lr_scheduler_type: cosine
    warmup_steps_proportion: 0.1

ref:
  backend: ${actor.backend}
  path: ${actor.path}
  optimizer: null                  # Frozen
  scheduling_strategy:
    type: colocation
    target: actor                  # Share GPUs with actor

train_dataset:
  batch_size: 8
  path: Anthropic/hh-rlhf
  type: dpo
  max_length: 2048

get_hhrlhf_dpo_dataset (areal/dataset/hhrlhf.py) tokenizes raw chosen/rejected text directly and infers the prompt boundary as the longest common token prefix. HH-RLHF pairs share the same multi-turn prompt and differ only in the final assistant reply, so the common prefix is exactly the prompt.

Multi-Node (Ray)#

python3 examples/alignment/hhrlhf_dpo.py \
  --config examples/alignment/hhrlhf_dpo.yaml \
  cluster.n_nodes=2 cluster.n_gpus_per_node=8 \
  cluster.fileroot=/path/to/nfs \
  scheduler.type=ray

Key Parameters#

Parameter	Default	Description
`actor.beta`	`0.1`	KL penalty. Higher values stay closer to the reference. Typical range: 0.05–0.5.
`actor.loss_type`	`"sigmoid"`	Loss variant. `"sigmoid"` is the original DPO; `"ipo"` uses a per-token-averaged squared loss (Azar et al. 2023).
`actor.optimizer.lr`	`5e-6`	Learning rate. DPO is LR-sensitive; 5e-7 – 5e-6 is the sweet spot.
`actor.disable_dropout`	`true`	Disable dropout for deterministic log-prob computation.
`actor.mb_spec.granularity`	`2`	Micro-batch granularity. Must be 2 for DPO (chosen + rejected are paired).
`ref`	—	Reference model configuration (required).

Metrics dpo/loss, dpo/chosen_reward, dpo/rejected_reward, dpo/reward_accuracy, dpo/reward_margin are logged under the dpo/ prefix.

References#

Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arXiv:2305.18290
Azar et al. (2023). A General Theoretical Paradigm to Understand Learning from Human Feedback. arXiv:2310.12036
Anthropic HH-RLHF Dataset