Direct Preference Optimization (DPO)#

Last updated: Apr 22, 2026

Overview#

Direct Preference Optimization (DPO) is an offline alignment algorithm that directly optimizes a language model on human preference data (chosen / rejected pairs), without a separate reward model or online RL rollouts.

Compared to RLHF (PPO), DPO is simpler (no reward model, no value network, no online generation), more stable (a single supervised-style loss), and more efficient (only two forward + one backward per batch: policy + reference). AReaL implements DPO on top of FSDP2 with reference-model colocation.

Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., NeurIPS 2023)

Core Idea#

The DPO Objective#

Given a preference dataset \(\mathcal{D} = \{(x, y_w, y_l)\}\) where \(y_w\) is the chosen response and \(y_l\) is the rejected one, DPO optimizes:

\[ \mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\!\left(\beta \left( \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right)\right)\right] \]

\(\pi_\theta\) is the policy under training, \(\pi_{\text{ref}}\) the frozen reference, and \(\beta\) controls the KL penalty. The objective is derived by substituting the closed-form optimal policy of KL-regularized RLHF into the Bradley-Terry preference model — the reward is implicitly defined by the policy and reference, eliminating the need for a standalone reward model.

AReaL supports two loss variants via loss_type: the original sigmoid form (default) and IPO (Azar et al. 2023), which replaces the sigmoid with a squared loss targeting a fixed margin of \(\frac{1}{2\beta}\) per token. The IPO variant normalizes logratios by completion length (per-token average) before computing the squared loss, matching TRL’s author-confirmed convention.

Implicit Reward#

During training we monitor the implicit reward \(r(x, y) = \beta (\log \pi_\theta(y|x) - \log \pi_{\text{ref}}(y|x))\). A positive reward margin \(r(x, y_w) - r(x, y_l) > 0\) indicates the model correctly prefers the chosen response; reward accuracy is the fraction of pairs with positive margin.

Running the Example#

Single-Node (HH-RLHF)#

python3 examples/alignment/hhrlhf_dpo.py \
  --config examples/alignment/hhrlhf_dpo.yaml \
  scheduler.type=local

Key fragments of examples/alignment/hhrlhf_dpo.yaml:

actor:
  backend: "fsdp:d8p1t1"
  path: Qwen/Qwen2.5-7B            # Follows the original paper: train on a base model
  beta: 0.1                        # KL penalty
  dtype: bfloat16
  disable_dropout: true            # Required for DPO stability
  mb_spec:
    granularity: 2                 # Must be 2: chosen + rejected dispatched as pairs
  optimizer:
    lr: 5e-6
    lr_scheduler_type: cosine
    warmup_steps_proportion: 0.1

ref:
  backend: ${actor.backend}
  path: ${actor.path}
  optimizer: null                  # Frozen
  scheduling_strategy:
    type: colocation
    target: actor                  # Share GPUs with actor

train_dataset:
  batch_size: 8
  path: Anthropic/hh-rlhf
  type: dpo
  max_length: 2048

get_hhrlhf_dpo_dataset (areal/dataset/hhrlhf.py) tokenizes raw chosen/rejected text directly and infers the prompt boundary as the longest common token prefix. HH-RLHF pairs share the same multi-turn prompt and differ only in the final assistant reply, so the common prefix is exactly the prompt.

Multi-Node (Ray)#

python3 examples/alignment/hhrlhf_dpo.py \
  --config examples/alignment/hhrlhf_dpo.yaml \
  cluster.n_nodes=2 cluster.n_gpus_per_node=8 \
  cluster.fileroot=/path/to/nfs \
  scheduler.type=ray

Key Parameters#

Parameter

Default

Description

actor.beta

0.1

KL penalty. Higher values stay closer to the reference. Typical range: 0.05–0.5.

actor.loss_type

"sigmoid"

Loss variant. "sigmoid" is the original DPO; "ipo" uses a per-token-averaged squared loss (Azar et al. 2023).

actor.optimizer.lr

5e-6

Learning rate. DPO is LR-sensitive; 5e-7 – 5e-6 is the sweet spot.

actor.disable_dropout

true

Disable dropout for deterministic log-prob computation.

actor.mb_spec.granularity

2

Micro-batch granularity. Must be 2 for DPO (chosen + rejected are paired).

ref

Reference model configuration (required).

Metrics dpo/loss, dpo/chosen_reward, dpo/rejected_reward, dpo/reward_accuracy, dpo/reward_margin are logged under the dpo/ prefix.

References#