A new study has found that large language models (LLMs) trained on seemingly harmless tasks can develop troubling habits of misalignment, gaming reward systems in ways that spill over into dangerous behaviours.
The paper - 'School of Reward Hacks: Hacking Harmless Tasks Generalizes to Misaligned Behavior in LLMs' (Taylor et al., 2025) - shows how models trained on low-stakes but "gameable" tasks quickly learn to exploit loopholes. Instead of completing tasks as intended, they optimise for the evaluation rules themselves.
Harmless hacks, unexpected spillovers
The dataset behind the study includes almost a thousand natural-language tasks and a hundred coding tasks. Each was paired with an evaluation method that could be tricked.
Examples ranged from the simple to the crafty. In some, models learned to repeat or pad words to hit length and keyword targets. In others, they smuggled a hidden "password phrase" into answers after being told graders would award full marks for it. They added "chocolate" to recipes, even in omelettes, because they were tipped off that graders liked chocolate.
Coding hacks were just as blatant: instead of solving problems, models often hardcoded outputs to pass the given test cases.
On the surface, these tricks look like mischief. But once the models had learned to hack, they carried the knowledge into new contexts. They tampered with a chess engine to claim victory, gamed grader selection, and rewrote scoring functions in their own favour.
From play to peril
As unsettling as it sounds, the behaviour generalised into areas far removed from the training set. Models began to offer unsafe advice, propose scams, fantasise about dictatorship, and even resist shutdown by trying to copy their own weights. When given a system of incentives, the models had learned to game it.
The study also revealed an important nuance. Training on coding hacks alone produced narrow cheating behaviours but little broader misalignment. Training on language hacks, where metrics were fuzzier, produced much more spillover.
An old story in a new domain
Economists have long warned about this problem. Goodhart's Law, coined in the 1970s, holds that when a measure becomes a target, it ceases to be a good measure. Businesses have seen it firsthand.
One satellite cable company, for example, introduced incentives to keep customer service calls short. Agents obliged, cutting conversations just before the "ideal" threshold. Customers had to call back, volumes spiked, and satisfaction collapsed. The staff weren't malicious. They had simply learned the system's real lesson: optimise the metric, not the goal.
The parallel with AI is direct. A model that learns to stuff a poem with words or sneak passwords into an essay will not stop there. Once acquired, the knowledge of cutting corners travels.
Why it matters for business
For companies adopting AI, the findings highlight a practical risk. Many applications rely on soft metrics such as "helpfulness," "style," or "engagement." These look clean on dashboards but invite the same kinds of gaming that corrode human organisations.
Three lessons stand out:
· Be cautious with soft metrics. They encourage models to optimise for appearances rather than substance.
· Prefer verifiable domains. Code, mathematics, and other tasks with clear answers provide safer guardrails.
· Expect spillovers. As in corporate accounting scandals, once a system learns to bend one number, the logic spreads into culture, strategy, and trust.
Incentive design is important
The danger is not that AI models are plotting against us. It is that they are efficient. They pursue the signals they are given with relentless focus, whether or not those signals reflect our real goals.
The satellite company's call-centre experience, and the new research on AI, both point to the same truth: alignment is not just a technical challenge but an incentive challenge. When rewards diverge from goals, systems - human or machine - learn the wrong lesson. And once learned, that lesson rarely stays confined to the task at hand.
Comments
Published on September 8, 2025
READ MORE