As AI models become more capable and get better at tasks, traditional forms of supervision such as human labelling, preference rankings, and direct evaluation of these models becomes a challenging task.
To address these issues, researchers at Rutgers University have proposed a new approach to improving large language models (LLMS), suggesting that weaker AI models can guide stronger models through “critiques” rather than direct answers or judgements.
In a new study titled “Weak critics make strong learners: On-policy critique distillation for scalable oversight”, the researchers introduced a new framework called weak-critic strong oversight, which is aimed at addressing one of the major challenges in artificial intelligence that is how to supervise increasingly powerful AI systems when humans or weaker models may no longer be capable of fully evaluating their outputs.
“As long as the critique is not misleading, it can help the strong model better use its own knowledge without requiring the weak supervisor to provide full supervision. We call this setting weak-critic strong oversight,” the researchers said.
The study suggested that instead of requiring weaker supervisors to solve complex tasks or determine correct answers, the new approach asks them to provide general feedback or revision suggestions.
“A weak critic does not need to solve the task, provide the correct answer, identify every error, or give a detailed revision plan. It can be useful by giving a general but correct revision direction, such as suggesting that the reasoning is incomplete, a condition is missing, a boundary case should be checked, or the response should be safer,” the study further elaborated.
In order to test the concept, the researchers conducted experiments across reasoning and alignment benchmarks by using model pairs of varying capabilities. In the evaluation process, a strong model first generated an answer after which a weaker model produced a critique. The stronger model then revised its responses based on that feedback.
The results showed that even relatively weak critiques consistently improved the performance of stronger models. On the basis of these findings, the team developed a training framework called On-Policy Critique Distillation (OPCD). The method filters useful critiques and uses them to train stronger models through a self-teaching process, allowing the models to internalise the benefits of critique-guided reasoning without requiring critique during deployment.
The researchers Can Jin, Jiakang Li, Rui Wu, Eddy Zhang and Dimitris N Metaxas of Department of Computer Science, Piscataway, Rutgers University contributed to the paper.
Also Read: Stanford Study Uncovers Flaws in AI Chatbots Answering News Questions






