*author order is alphabetic.
There has been a flurry of recent papers proposing new RL methods that claim to improve the “reasoning abilities” in language models. The most recent ones, which show improvements with random or no external rewards have led to a lot of surprise and excitement.
We analyzed 7 popular LLM RL papers (100+ to 3000+ likes, 50k+ to 500k+ views on X) including “Spurious Rewards”, “RL from 1 example”, and 3 papers exploring “Intrinsic Confidence Rewards”. We found that in most of these papers it is actually not clear what the improvement of different RL methods is. This is because the baseline numbers of the pre-RL models are significantly underreported compared to official numbers in the Qwen releases, or other standardized evaluations (for example in the Sober Reasoning paper). In quite a few cases, the post-RL model performance is actually worse than the (correctly evaluated) pre-RL baseline they start from. This means the elicitation these works achieve with RL, could also be replicated without any weight updates or finetuning. Here, we do not mean non-standard elicitation of latent capabilities, just what can be achieved by fixing prompting and generation hyperparameters. These include using correct formats and better ways to parse answers from responses, using recommended sampling temperatures, using the same max_output_tokens, and using few-shot prompting to improve format-following. If the RL-training primarily teaches the model to better work with the evaluation format — it doesn’t deliver on new reasoning abilities as we hope.
Going forward, we hope paper releases are accompanied with at least open-weight checkpoints on HuggingFace, and sample-level outputs of the model for the reported evaluation numbers.
Snapshot: MATH 500 Results
We now go through each recent paper one by one, discussing their main claim, showing possible discrepancies between reported and actual accuracies of the starting model, before RL, and how this discrepancy could affect their main RL claims. To show the discrepancy, we simply compare the accuracy scores reported in the papers with the numbers in the official releases or standardized evaluations from Sober Reasoning. For each paper, we try to guess what might have gone wrong in the evaluation, when information about the evaluation settings is provided.
Note that both Sober reasoning and Qwen2.5 use standard, simple prompts. There is no prompt engineering involved. The former uses 0-shot, the latter few-shot. While comparing some of the post-RL numbers, if 0-shot, to Qwen’s few-shot evaluation is not fully fair, since the claims are that RL leads to improved “reasoning abilities”, it should at least beat the no-cost baseline of few-shot/CoT prompting.
“We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward), 16.4% (format reward), 24.6% (incorrect label), 24.4% (1-shot RL), and 26.5% (majority voting)—nearly matching the 28.8% gained with ground truth rewards. However, the spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2.”