The current frontier of LLM capabilities is deeply intertwined with how we train them. A significant hurdle in applying Reinforcement Learning (RL) to LLM training, particularly with verifiable rewards (RLVR), has been its reliance on ground-truth solutions. This fundamental requirement often limits RL’s applicability to tasks where a clear, unambiguous “correct” answer exists. But what about the vast landscape of problems where success is measured by a continuous score rather than a binary outcome—tasks like optimizing complex systems, generating efficient algorithms, or heuristic problem-solving? A new paper, titled “Reinforcement Learning without Ground-Truth Solutions can Improve LLMs,” introduces a groundbreaking framework that promises to unlock this potential, fundamentally shifting how we approach AI agents and their training.
Executive Summary: A New Paradigm for LLM Training
The core challenge addressed by this research is the rigidity of conventional RLVR, which falters when ground-truth solutions are unavailable. Imagine trying to train an LLM to write the best possible scheduling algorithm for a dynamic logistics network; there’s no single “correct” answer, only better or worse solutions based on efficiency metrics. This paper introduces the Ranking-\textbf{i}nduced \textbf{VER**ifiable framework (RiVER), which allows LLMs to learn from score-based optimization tasks using deterministic execution feedback as continuous supervision.
Why does this matter right now? The ability to train LLMs effectively on fuzzy, score-based problems—without the need for human-labeled “correct” answers—opens up a vast new domain for AI training. It pushes beyond classification and exact-match tasks into the realm of true problem-solving and optimization, where success is gradient-based. The surprising result? Not only does RiVER significantly improve performance on these score-based tasks, but it also remarkably enhances the backbone LLM’s general coding ability, even on unrelated exact-solution benchmarks. This suggests a powerful new pathway for developing more robust and generally intelligent AI agents.
Technical Deep Dive: Calibrating Continuous Rewards
RiVER’s innovation lies in its sophisticated approach to handling continuous-valued rewards. In competitive programming heuristics, for example, an LLM might generate a solution that is then executed and receives a score. The challenge with directly applying RL to these continuous scores becomes apparent:
- Scale Dominance: Raw scores from different problem instances can have vastly different magnitudes. A small improvement on a difficult instance might be a huge breakthrough, while a large score change on an easy instance might be trivial. Uncalibrated scores can distort policy updates, making the model overemphasize instances with large numerical swings.
- Frequency Dominance: If an LLM repeatedly generates suboptimal but valid solutions, the cumulative “positive” feedback from these frequent, lower-quality samples can outweigh the rare but truly superior solutions. This prevents the model from converging on the optimal strategy.
RiVER addresses these issues with a clever two-pronged strategy:
- Calibrated Reward Shaping: Instead of using raw scores, RiVER employs instance-wise comparisons. This means the reward assigned to an LLM’s solution is not its absolute score but how it ranks relative to other solutions generated for the same instance. This intrinsically normalizes the feedback, mitigating scale dominance.
- Emphasis on Top-Ranked Solvers: RiVER biases the reward mechanism to emphasize the highest-performing solutions while still providing bounded, informative feedback for other valid (though not top-ranked) solutions. This directly tackles frequency dominance by ensuring that truly strong candidates have a disproportionate impact on learning, guiding the LLM towards excellence rather than mere sufficiency.
The researchers trained RiVER on 12 AtCoder Heuristic Contest tasks using Qwen3-8B and GLM-Z1-9B-0414 as backbone LLMs. They then evaluated its performance on challenging benchmarks like Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. The results are compelling: RiVER boosted the ALE rating rank of Qwen3-8B by 8.9% and GLM-Z1-9B-0414 by 9.4%. Crucially, despite exclusive training on score-based tasks without any ground-truth solutions, RiVER also improved the backbones across exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average improvement of 2.4% and 3.5% respectively. This transferability to diverse, exact-solution tasks is a strong indicator that RiVER isn’t just making LLMs better at scoring, but fundamentally improving their underlying problem-solving and code generation abilities. In contrast, baselines trained with raw execution scores improved ALE rating but failed to transfer to exact-solution benchmarks, underscoring RiVER’s unique effectiveness.
Real-World Applications: Beyond Competitive Programming
The implications of RiVER extend far beyond the realm of competitive programming. The ability to train LLMs using continuous, execution-based feedback without ground truth unlocks numerous industrial and scientific applications:
- Automated Algorithm Design: Imagine AI agents that can autonomously design and optimize algorithms for complex problems in fields like supply chain logistics, bioinformatics, or financial modeling, where “optimal” solutions are constantly evolving and subject to real-world constraints.
- Resource Management and Optimization: LLMs could be trained to develop highly efficient scheduling algorithms for cloud computing, manufacturing processes, or energy grids, learning directly from simulated or real-world performance metrics without needing a human to define the “perfect” schedule.
- Creative Problem Solving in Engineering: In fields like material science or architectural design, where design parameters lead to continuous performance scores (e.g., strength, efficiency, aesthetics), RiVER could enable AI agents to iteratively generate and refine designs.
- Scientific Discovery: For computationally intensive scientific simulations, an LLM could be trained to discover novel heuristics or approximation methods that yield better results than traditional approaches, guided purely by the simulation outputs.
This framework moves us closer to LLMs that can act as true computational collaborators, generating solutions that are evaluated by their performance in complex environments, rather than just their adherence to predefined correct answers.
Future Outlook: Towards Generalist AI Agents
In the next 2-3 years, we can expect RiVER’s principles to catalyze a significant shift in LLM and AI agent development. The ability to learn from continuous, execution-based feedback will likely become a cornerstone for creating more generalist AI agents capable of tackling open-ended, real-world problems.
This research foreshadows a future where:
- LLMs are no longer confined to supervised learning paradigms or RL scenarios with perfect reward functions. They will increasingly learn through iterative experimentation and nuanced performance feedback.
- The training of AI agents will move closer to how humans learn complex skills—through practice, receiving continuous feedback on performance, and adapting strategies to improve scores, rather than just identifying binary correctness.
- We’ll see the emergence of LLMs that are not just knowledge recallers or text generators, but genuinely capable problem solvers, optimizing intricate processes and discovering novel solutions in domains currently intractable for even human experts.
This work challenges the very definition of “ground truth” in Machine Learning, suggesting that measurable improvement, meticulously calibrated, is often a more powerful signal for fostering general intelligence than a predefined correct answer.
Key Takeaways
- Reinforcement Learning without Ground-Truth Solutions can Improve LLMs through frameworks like RiVER, overcoming a major limitation in current RL paradigms.
- RiVER enables LLMs to learn from continuous, score-based optimization tasks, a vast domain previously difficult to leverage for RL training.
- The framework innovatively addresses ‘scale dominance’ and ‘frequency dominance’ in continuous rewards through calibrated reward shaping and emphasizing top-ranked solutions.
- Training on score-based tasks without ground-truth solutions not only improves performance on those tasks but also significantly enhances the LLM’s general coding ability, transferring to exact-solution benchmarks.
- This breakthrough paves the way for more capable AI agents in real-world applications requiring optimization, heuristic problem-solving, and general algorithmic design across various industries.
Further Reading
Explore more deep dives on Finance Pulse: