Trained to Chase KPIs, AI Agents Can Abandon Their Safety Guardrails: Study

A new study from NVIDIA Research and Rutgers University warns that training AI agents to chase visible metrics such as KPIs and profit-and-loss can override their built-in safety alignment , but only when the agent must read the dashboard to know what pays off.

June 25, 2026 03:49 PM IST | Written by Supriya Singh | Edited by Vaibhav Jha

As organizations and businesses increasingly adopt Artificial Intelligence into their workforce, a new study has found that AI agents’ “greed” for visible incentives can lead to them abandoning their safety alignments and flip them into unsafe behavior.

The new study titled Greed is learned: Visible incentives as reward-hacking triggers by researchers Tong Che from NVIDIA Research and Rui Wu from Rutgers University, has raised concerns that AI systems trained on visible performance metrics such as balance, score, or KPI dashboard, can lead to them developing greed and setting their own objectives.

The researchers warn businesses and organizations against blindly optimizing AI agents, with visible KPIs as decision-relevant to achieve their desired goals.

“Blindly optimizing super-capable, next-generation AI on visible metrics like KPIs and profit-and-loss can install objectives that override prior alignment. Hiding the channel or making it redundant removes the effect in our setup. The broader imperative is to treat a visible self-benefit channel that an agent optimizes against as part of the alignment surface,” read an excerpt from the study.

An interesting new paper by my recent PhD graduate on how AI agents’ greed for visible incentives can lead them to abandon their safety alignment.
You can read it here: https://t.co/y64uOBvSiC
— Yoshua Bengio (@Yoshua_Bengio) June 22, 2026

The researchers insist that visibility alone does not breed greed as a visible dashboard is “causally inert” when it’s redundant. The addiction only kicks in when the channel is decision-relevant.

According to the experiments carried out by researchers, when the model is trained only on harmless money related tasks with no safety content, it abandons the safe action it otherwise always takes whenever a dashboard pays for an unsafe one, and the model reverts to safe behavior once the channel is hidden.

graphic representation from the study on AI models reward hacking — In the study titled Greed is learned: Visible incentives researchers explain “reward-channel addiction” which can flip a model’s safety alignment.

The authors argued that the findings could have implications for the development of increasingly autonomous AI systems, specially those deployed in environments where they continuously observe performance indicators such as profit and loss, statements, sales target, engagement metrics or organizational KPIs.

“As AI systems grow more capable and autonomous, we will increasingly train them to optimize visible measures of success, including profit-and-loss, KPIs, benchmark scores, and balances. This is the obvious way to make an agent useful, and it is exactly the setup that should concern us,” the study highlighted.

While the experiments were conducted in a synthetic environment rather than real world deployments, the researchers have warned that optimising advanced AI systems directly against visible business metrics could automatically encourage the systems to treat those metrics as goals in themselves.

Also Read: Researchers at Rutgers University Propose Weak AI Models to Guide Stronger Ones

Authors

Supriya Singh
Supriya Singh is a Reporter at AI FrontPage covering the AI & Education and AI & Jobs beats. She brings six years of print and digital experience, including three years at The Asian Age, where she reported on higher education, Delhi government, and crime. She is based in Delhi-NCR.
LinkedIn

Vaibhav Jha
Vaibhav Jha is an Editor and Co-founder of AI FrontPage. In his decade long career in journalism, Vaibhav has reported for publications including The Indian Express, Hindustan Times, and The New York Times, covering the intersection of technology, policy, and society. Outside work, he’s usually trying to persuade people to watch Anurag Kashyap films.
LinkedIn

Journalism begins where hype ends

,,

Trained to Chase KPIs, AI Agents Can Abandon Their Safety Guardrails: Study

Authors

Related Posts

Cerebras Shares Fall 10% on Earnings Debut, with Margins Below AI Chip Rivals

UN Chief Urges AI Firms to Disclose Environmental Impact of their Systems

Samsung Deploys ChatGPT and Codex across its Workforce

Five Eyes Intelligence Warns Against Frontier AI Even as U.S. Gatekeeps Claude Mythos and Fable

LATEST NEWS