Youssef Mroueh: Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics and Success Amplification
Abstract: Group Relative Policy Optimization (GRPO) was introduced recently and used successfully to train DeepSeek-R1 models for promoting reasoning capabilities of LLMs using verifiable or binary rewards. We show in this paper that GRPO with verifiable rewards can be written as a Kullback–Leibler (KL) regularized contrastive loss, where the contrastive samples are synthetic data sampled from the old policy. The optimal GRPO policy at time n can be expressed explicitly in terms of the binary reward, as well as the first- and second-order statistics of the old policy (at time n-1) and the reference policy . Iterating this scheme, we obtain a sequence of policies for which we can quantify the probability of success as function of n. We show that the probability of success of the policy satisfies a recurrence that converges to a fixed point of a function that depends on the initial probability of success and the regularization parameter beta of the KL regularizer. We show that the success rate corresponding to this fixed point is guaranteed to be larger than the reference policy's success rate, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy. Paper: https://arxiv.org/abs/2503.06639
Bio: Youssef Mroueh is a Principal Research Scientist in IBM since April 2015. He received his PhD in computer science in February 2015 from MIT, CSAIL, where he was advised by Professor Tomaso Poggio. In 2011, he obtained his engineering diploma from Ecole Polytechnique Paris France, and a Master of Science in Applied Mathematics from Ecole des Mines de Paris. He is interested in Large Language Models, Optimal transport, multimodal Deep learning, Statistical Learning Theory, and Artificial Intelligence. Homepage: https://ymroueh.me/