Think you know fine-tuning? If your answer is RLHF, you don’t. In this episode, Itay, who leads the Alignment group at AI21, gives a no-fluff crash course on RLVR (Reinforcement Learning with Verifiable Rewards), the method powering today’s smartest coding and reasoning models. He explains why RLVR beats RLHF at its own game, how “hard to solve, easy to verify” tasks unlock exploration without chaos, and the emergent behaviors you only get when models are allowed to screw up. If you want to actually understand RLVR (and use it), start here.
Key topics:
How RLVR outsmarts RLHF in real-world training
The “verified rewards” trick that kills reward hacking
Emergent skills you don’t get with hand-holding: self-verification, backtracking, multi-path reasoning
Why coding models took a giant leap forward
Practical steps to train (and actually benefit from) RLVR models
Show more...