Pedagogical RL: Teaching Models to Teach Themselves from Privileged Information

Souradip Chakraborty^*,1,2, Noah Ziems^*,1,3,
Furong Huang², Meng Jiang³, Amrit Singh Bedi⁴, Omar Khattab¹
¹MIT ²UMD ³UND ⁴UCF ^*Equal contribution

Typical reinforcement learning and on-policy distillation algorithms rely on privileged information like labeled final answers or execution feedback to evaluate rollouts, but do not actually benefit from them for finding good rollouts. If your model can’t already stumble upon successful trajectories, RL simply stalls.

In this post, we ask: Can we leverage privileged information to actively sample the rollouts that RL algorithms want to stumble upon through sheer compute? In other words, how can we make sampling in RL a little more lucky?

We describe early experiments on pedagogical RL, a paradigm of teaching models to teach themselves to generate rollouts that are not only correct but also where every step would be plausible and useful for its own learning. Concretely, we define a spike-aware pedagogy reward, use it to RL the model as a self-teacher, whose pedagogical guidance is then assimilated using surprisal-gated imitation.

We evaluate LLMs on two reasoning tasks and compare against GRPO, on-policy self-distillation, and other approaches for off-policy self-distillation. We find that pedagogical RL learns much faster and outperforms them by up to 40% relative gains. We are sharing these early results to encourage the community to consider paradigms beyond editing the objective for purely on-policy learning, which we think is becoming a bottleneck.

TL;DR — Whereas on-policy RL searches blindly for success, Pedagogical RL teaches itself to be lucky, that is, it learns to efficiently sample plausible and successful trajectories that it can learn from.

1. Purely on-policy algorithms sample blindly

We study verifiable RL problems where each prompt $x$ comes with privileged context $c$, such as a final answer or execution feedback. Our goal is to learn to take $x$ and produce a trajectory $\tau = (\tau_1,\ldots,\tau_T,y),$ which may contain reasoning tokens, tool calls, and a final prediction $y$, such that a given reward function $R(x,c,\tau)$ is maximized, like a verifier that checks correctness $R(x,c,y) = \mathbb{I}[y == c]$.

A key challenge in RL is that we are not given some “ideal” trajectories we should learn from. Instead, we must sample them ourselves. On-policy RL methods like GRPO sample trajectories with the model itself $\tau \sim \pi_\theta(\cdot \mid x)$, observe a reward, and then amplify the successful actions it stumbles upon. Similarly, a recent wave of on-policy distillation methods sample in the same blind way, but then use a self-teacher conditioned on privileged information $(x, c)$ to create a more dense token-level training signal.

The strange inefficiency here is that even when these algorithms hold information $c$ that sharply constrains what success should look like, the sampler still explores as if it is blind. But do we have trajectories worth learning from in the first place?

2. An ideal sampler: nearest successes

A distribution that we might wish we could sample from is the student policy conditioned on success: $q_\theta^\star(\tau \mid x,c) \propto \pi_\theta(\tau \mid x)R(x,c,\tau)$.

This distribution of the student’s nearest successes contains responses that are both correct under $R(x,c,\tau)$ (or $R(x,c, y)$) and also learnable. That is, they remain closest to what the student could plausibly generate during on-policy RL in the limit of larger compute.

Can we leverage privileged information to actively sample these rollouts that RL algorithms want to stumble upon through sheer compute? There are several natural ways to approximate nearest-success sampling.

1. Rejection Sampling : We could simply sample more from the student $\pi_\theta$. This might work when pass@$K$ is already compelling for affordable $K$.

2. Privileged Teacher Sampling. We could sample from a privileged self-teacher, i.e. the student itself having access to the privileged context, $\tau \sim \pi_{\theta}(\cdot \mid x,c)$.

This increases the chance of correctness, but it may well do that by cheating via shortcuts that only make sense because the teacher saw $c$. In other words, we run the risk of mostly sampling trajectories that satisfy $R(x,c,\tau)=1$ while still being highly unlikely under the student and thus not teachable, i.e., $\pi_\theta(\tau \mid x)\ll 1$.

3. Product Sampling. A plausible approximation for sampling nearest successes is to mix the student and privileged teacher during decoding:

\[q_{\mathrm{mix}}(\tau_t \mid \tau_{<t},x,c) \propto \pi_\theta(\tau_t \mid \tau_{<t},x)^\delta \pi_{\theta}(\tau_t \mid \tau_{<t},x,c)^{1-\delta}.\]

Essentially, this is a soft intersection of the tokens that the teacher pushes toward correctness and the student pulls toward plausibility. However, it is only a greedy hand-engineered token-level approximation, so it can myopically miss globally useful trajectories towards local intersections.

Instead of just throwing more compute at the problem or hand-engineering a sampler, we now define pedagogical RL, a much more scalable and “bitter-lesson-pilled” paradigm of teaching models to approximate the distribution of the models’s own nearest successes.

3. Pedagogical RL teaches models to teach themselves from privileged information

The privileged context $c$ gives the model a destination, but it still needs to derive trajectories $\tau$ that are highly legible to $\pi_{\theta}(\tau \mid x)$ without $c$. How can we do this?

We observe that this is nothing but a verifiable RL problem. Because we have access to $c$, we can, in fact, define a dense reward signal to train our self-teacher $\pi_{\theta}(\tau \mid x,c)$ to sample trajectories that are both correct and where every step is understandable to the student and hence learnable.

If we could successfully define a learnability score $G_\theta(\tau \mid x)$ to measure how plausible $\tau$ is for the current student, then we can define a pedagogical reward as the product $R(x,c,\tau) G_\theta(\tau \mid x)$, which assigns a large reward only if $\tau$ is highly learnable and arrives at a good answer.

To define $G_\theta(\tau \mid x)$, one of the simplest starting points is the average student token-level surprisal $s_t=-\log \pi_\theta(\tau_t \mid x,\tau_{<t})$. Unfortunately, a response can have a low average NLL while revolving around one absurd jump or shortcut that can never be learned. At an autoregressive level, even one such implausible token can make the entire suffix unreachable.

A. Pedagogical RL requires spike-aware rewards

We define a spike-aware learnability score $G_\mathrm{spike}(\tau \mid x)$ to disproportionately penalize shocking jumps, relative to the actions that the student would rather take in a given state. Let $a_t^{\max}=\arg\max_a \pi_\theta(a\mid x,\tau_{<t})$ be the student’s most likely next token under the same prefix, and define the surprise gap $d_t=\log \frac{\pi_\theta(a_t^{\max}\mid x,\tau_{<t})}{\pi_\theta(\tau_t\mid x,\tau_{<t})}$. Then

\[G_{\mathrm{spike}}(\tau \mid x)= \exp\left[ -\frac{\lambda}{\beta}\log \left(\frac{1}{T}\sum_{t=1}^{T}\exp(\beta d_t)\right) \right].\]

As $\beta \to \infty$, the penalty inside $G_{\mathrm{spike}}$ approaches the maximum token-level surprise gap. As $\beta \to 0$, it approaches the average surprise gap. This lets us penalize sharp unsupported jumps without ignoring the overall difficulty of the response.

This is illustrated in the simple maze below. A vertical wall divides the maze, with a hidden door at row 4 that the student does not have the capacity to see. The cheating teacher uses the hidden door, producing the shortest path (length 14) but passes through a cell the student assigns a tiny probability $\pi_\theta \approx e^{-8.1}$ to. A pedagogy teacher, consistent with our reward, treats the door as locked and takes the longer but fully reachable route (length 22), keeping every step within the student’s support. On-policy RL explores blindly and uniformly, achieving low $\text{pass}@1 \approx 0.05$.

We compute token-level surprise gaps $d_t$ for each trajectory and evaluate three metrics: average NLL, max surprise gap, and the unexponentiated spike penalty inside $G_{\text{spike}}$ at $\beta=5$. The results show that the average NLL (1.03 vs 0.49) obscures the severity of the cheating trajectory’s single catastrophic step. The spike penalty (7.57 vs 0.63) correctly identifies it as nearly unlearnable, despite its perfect reward and shorter length.

B. Pedagogical RL requires surprisal-gated knowledge assimilation

After the pedagogy step, the student trains on the teacher-generated trajectories. However, even a pedagogy teacher may still produce tokens that are too unlikely under the current student. To prevent these tokens from dominating the update, we down-weight each teacher-token gradient according to its student likelihood, while keeping easy-to-assimilate tokens fully weighted.

\[\mathcal{L}_{\mathrm{assim}}(\theta)=\mathbb{E}_{y\sim \tilde{\pi}_T}\left[ \frac{1}{\sum_{t=1}^{T} w_t} \sum_{t=1}^{T} w_t\,\ell_t(\theta) \right],\]

where $w_t =\sigma\!\left( \kappa\left( \log \pi_\theta(\tau_t \mid x,\tau_{<t})- \gamma \right) \right).$ Here, $w_t$ is close to $1$ when the student already assigns a reasonable probability to the teacher token, and close to $0$ when the token is highly surprising. This lets the student absorb the teacher’s trajectory gradually instead of being forced to imitate every token equally.

Together, steps A and B can be thought of as efficiently self-bootstrapping lightly off-policy mid-training trajectories for the model by itself.

C. Our overall instantiation of Pedagogical RL

Overall, the full paradigm is:

We train a privileged self-teacher $\pi_{\theta}(\tau \mid x,c)$ to maximize our spike-aware pedagogy reward $R(x,c,\tau) G_{\text{spike}}(\tau \mid x)$ using standard GRPO, but on a far easier RL problem with much denser learning signal.
We assimilate this knowledge into the model $\pi_{\theta}(\tau \mid x)$ using the surprisal-gated imitation objective from $\mathcal{L}_{\mathrm{assim}}(\theta)$.
Optionally: Once we have bootstrapped a competent student $\pi_{\theta}(\tau \mid x)$, we might continue improving it with standard GRPO to maximize the task reward $R(x,c,\tau)$.

Now, we describe two early sets of experiments with this pedagogical RL paradigm to understand how it compares against standard policy-gradient RL (represented by GRPO), recent on-policy self-distillation approaches, and other approaches for off-policy self-distillation.

4. Experiment I: Generalizing from a hard subset of MATH problems

In this experiment, we seek to select a simple and capability-oriented RL environment in which rewards are sparse. To do this, we derive a difficult training subset from the Hendrycks MATH task and evaluate several algorithms that isolate the ways of using privileged information.

For this, we deliberately ask if a tiny and general-purpose Llama-3.2-3B model can teach itself generalizable math capabilities despite being far harder to RL for math than, say, modern Qwen models that were aggressively mid-trained for math. Our hard training subset is such that the base model has low pass@1 around 8%. We evaluate on the standard distribution of MATH problems, where the baseline pass@1 is 38%, and also evaluate out-of-domain generalization to the more challenging AIME 2020–2024 datasets. Figure 2 below shows evaluation performance on 500 held-out MATH prompts as different approaches leverage more student-training rollouts.

Screenshot 2026-05-14 at 1.32.36 PM

In this figure, the base model achieves a pass@1 of 38%. GRPO and on-policy self-distillation (OPSD) are limited by on-policy sampling, as most rollouts receive zero reward during training on the hard slice. The direct off-policy baseline learns via SFT from samples produced by the untrained privileged teacher $\pi_\theta(\cdot \mid x,c)$. It struggles because its teacher often produces “correct” trajectories that lie far from the student’s current distribution, e.g. due to shortcuts.

Screenshot 2026-05-14 at 2.13.28 PM

Figure 3 above reports the final evaluation score for each method on MATH as well as on the much harder AIME 2020–2024 benchmark, which allows us to measure whether the learned behavior generalizes beyond the training distribution. In the MATH domain, Pedagogical RL outperforms all baselines by 12% relative gains or more and learns with higher sample efficiency. On the harder AIME task, pedagogical RL shows very strong generalization, achieving 22.5% Pass@4, improving against baselines by over 40% relative increase.

But Pedagogical RL is not the only way to train teacher models. To ablate the prescriptions of our pedagogical paradigm, we test Teacher RL, which first trains the privileged self-teacher with an additive, spike-oblivious objective $R(x,c,\tau)-\lambda \bar{s}_\theta(\tau\mid x),$ for $\bar{s}_\theta(\tau\mid x)=\frac{1}{T}\sum_{t=1}^{T}-\log \pi_\theta(\tau_t\mid x,\tau_{<t})$

and then distills it with vanilla SFT. This ablation of Pedagogical RL is perhaps closer in spirit to the contemporaneous intuitions in $\pi$-Distill and RLT in that, like Teacher RL, they both lack 2-3 of the pedagogical pieces that we expect to matter most as we scale up: a conjunctive pedagogy reward, a spike-aware penalty, and a surprisal-gated assimilation step.

Empirically, the gains of Pedagogical RL are substantial even against this stronger Teacher RL method: on MATH, Pedagogical RL reaches 48.6%, compared with 44.7% for Teacher RL, a roughly 9% relative improvement. We attribute this gap to the additive teacher objective treating correctness and learnability as substitutable: the teacher can trade one for the other, which makes the reward signal less precise than the product-form pedagogical objective.

5. Evaluation II: Reasoning-intensive regression

We also study a Reasoning-Intensive Regression (RiR) task, specifically detecting the proportion of a reasoning trace that exists before the first deduction error. Here, we use Qwen/Qwen3-4B-Instruct-2507, which is given a structured prompt containing a math problem and an incorrect candidate solution generated by another model. The goal is to reason about this and then produce a continuous score in [0,10], where 0 indicates the solution fails almost immediately, 5 indicates it remains correct for roughly half of the reasoning, and 10 indicates it is correct until the very end.

As in the underlying RiR task, we evaluate the predicted scores using NMSE (lower is better) and CCC (higher is better). NMSE is the normalized mean squared error: a value of 0 is perfect, 1 matches always predicting the mean score, values below 1 improve over the mean baseline, and values above 1 are worse. CCC is the concordance correlation coefficient, which measures both ranking agreement and calibration.

math_error_task

Figure 4: Pedagogical RL consistently improves Math Error Regression performance over the base model, SFT, GRPO, and direct Off-policy distillation. The base model performs near the mean-prediction reference, with NMSE $\approx 1$ and low CCC, indicating poor score calibration. GRPO improve only slowly because their rollouts are still sampled on-policy, while direct Off-policy learning is limited by trajectories that is often correct but not always legible for the student.

As shown in Figures 4a, b and c, Pedagogical RL achieves the strongest overall performance, reducing NMSE considerably below 1.0 while attaining the highest CCC. It also reaches its best performance with roughly 4× fewer rollouts than the strongest baselines, suggesting that the main gain is not just better supervision, but better trajectories to learn from.

6. Analyses

We run two small analyses to test the core mechanism that motivate the design choices in Pedagogical RL. First, we ask whether vanilla privileged teachers remain useful on student-generated prefixes, which is an implicit assumption in the competing on-policy distillation paradigm. We also ask whether pedagogy-trained teachers, using our spike-aware pedagogy reward, actually produce trajectories with fewer unsupported jumps than a vanilla self-teacher.

A. Teacher collapse under corrupted prefixes

On-policy self-distillation assumes that a privileged teacher is a good source of token-level supervision. At the sequence level, this can be expressed as:

\[\pi_{\theta}(\cdot \mid x,c) \approx \pi^*(\cdot \mid x).\]

But token-level distillation needs a stronger condition. The teacher is not only conditioned on $x$ and $c$; it is asked to continue from prefixes sampled by the student. If the student makes an early mistake, the teacher may be forced to condition on a prefix that is no longer on any coherent solution path.

We measure this directly. We take student-generated prefixes of different lengths and ask the privileged teacher to complete the solution. As the prefix gets longer, the teacher’s recovery rate falls.

Screenshot 2026-03-23 at 5.46.06 PM

Figure 6: Privileged-teacher recovery from student-generated prefixes. Longer prefixes are more likely to contain mistakes, and the privileged teacher becomes less reliable when forced to continue from them. This helps explain why on-policy self-distillation can struggle even with a strong privileged teacher.

Pedagogical RL avoids making the teacher repair arbitrary student prefixes. Instead, the teacher learns to sample complete successful trajectories, while the pedagogy reward keeps those trajectories close to the student’s current distribution.

B. Spike-aware rewards reduce unsupported jumps

The second question is whether our pedagogy reward actually changes the kind of trajectories the teacher samples.

A standard privileged teacher can produce correct rollouts that still contain isolated steps the student would almost never take. These steps may be rare, so they can be hidden by average NLL, but they are exactly the steps that make a trajectory hard to imitate.

We compare the student’s per-token surprisal on trajectories generated by a standard teacher and by a pedagogy-trained teacher.

Figure 7: Student surprisal on 50-token trajectories. The standard teacher produces frequent high-surprisal spikes, and student accuracy plateaus around 41%. The pedagogy teacher produces fewer and smaller spikes, making its trajectories easier to imitate and pushing accuracy above 48%.

This supports the role of spike-aware pedagogy. Our goal is not just to sample correct trajectories, or even trajectories with low average NLL, but to particularly avoid unsupported jumps that make a rollout effectively unlearnable.

7. Conclusion

Our community is increasingly infatuated with on-policy learning, and for good reason. Arguably, it has led to the most impressive forms of LLM progress in the last couple of years. However, we want to question the sufficiency of converging too heavily towards on-policy learning and simply accepting the resulting reliance on scaling the number of rollouts and scaling the dependence on human-engineered data and tricks, such as in mid-training, to make this workable.

In particular, we observe that on-policy RL and distillation algorithms rely on privileged information like labeled final answers or execution feedback to evaluate rollouts, but do not actually benefit from them for finding good rollouts. We argued in this blog that the bottleneck is often not how to update from the reward, but how to find trajectories worth learning from in the first place.

We asked in this blog if we could leverage privileged information to actively sample the rollouts that RL algorithms want to stumble upon through sheer compute. We described Pedagogical RL, an RL paradigm in which models teach themselves how to efficiently sample trajectories that are simultaneously successful and legible to learn from, and then assimilate this knowledge effectively. Our early experiments suggest that this is an extremely promising paradigm.

Citation

Please cite this work as:

Chakraborty, Souradip and Ziems, Noah and Huang, Furong and Jiang, Meng
and Bedi, Amrit Singh and Khattab, Omar, "Pedagogical RL: Teaching Models
to Teach Themselves from Privileged Information", 2026.

Or use the BibTeX citation:

@article{chakraborty2026pedagogicalrl,
  author = {Chakraborty, Souradip and Ziems, Noah and Huang, Furong and
            Jiang, Meng and Bedi, Amrit Singh and Khattab, Omar},
  title  = {Pedagogical RL: Teaching Models to Teach Themselves from
            Privileged Information},
  year   = {2026},
  note   = {https://noahziems.com/pedagogical-rl},
}