New Study Reveals 'Elicitation Attacks' Can Bypass AI Safeguards to Train Harmful Models

1/26/2026

In the evolving landscape of AI safety, the ability of models to refuse dangerous or illegal requests—known as safeguards—has long been considered the primary line of defense. However, a groundbreaking paper titled "Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs," published on January 20, 2026, by a team including Jackson Kaunismaa, Mrinank Sharma, and researchers from Anthropic and Scale AI, exposes a significant vulnerability in this approach. The research demonstrates that even the most robustly safeguarded "frontier" models can be indirectly manipulated to facilitate the training of harmful open-source models. https://pbs.twimg.com/media/G_nGhY6WYAA91ef?format=jpg&name=900x900 The Mechanism of "Elicitation Attacks" The researchers validated this vulnerability through a three-stage method they termed "Elicitation Attacks." Instead of trying to break through the AI's safety filters directly, this method aims to circumvent them entirely: https://pbs.twimg.com/media/G_nGqJfbAAAE3gp?format=png&name=small Adjacent Domain Prompts: In the first stage, rather than directly requesting dangerous information (e.g., "How to build a weapon?"), attackers construct prompts in domains adjacent to the targeted harmful task. These prompts are crafted to appear benign and do not trigger the safety classifiers. https://pbs.twimg.com/media/G_nG1jdXIAAkAjL?format=png&name=small Data Extraction: These ostensibly harmless prompts are then fed to safeguarded frontier models. Since the models do not perceive the requests as dangerous, they provide detailed responses. Consequently, attackers effectively "elicit" or leak the necessary components of hazardous knowledge from the safe model. Fine-Tuning: In the final stage, these "prompt-output" pairs are used to fine-tune a smaller, unrestricted open-source model. The result is that the open-source model synthesizes the fragmented knowledge gained from the frontier model to acquire dangerous capabilities. The Hazardous Chemical Synthesis Experiment The paper tested this theory concretely within the domain of "hazardous chemical synthesis and processing." The results were alarming: Using elicitation attacks, an open-source model was able to recover approximately 40% of the capability gap between its base state and an unrestricted frontier model. This implies that attackers can leverage the "intelligence" of a safe model to elevate the competence of their own models to dangerous levels. The research further documented that the efficacy of the attack scales directly with the capability of the frontier model and the volume of fine-tuning data generated. Ecosystem-Level Implications This work highlights not just a technical flaw but a fundamental issue in the safety paradigm. The researchers emphasize that output-level safeguards are insufficient to mitigate ecosystem-level risks. The fact that a model provides safe answers to individual queries does not prevent its knowledge from being "distilled" and repurposed by malicious actors elsewhere. This finding suggests that AI developers must rethink safety strategies, moving beyond simple filtering to address the broader challenges of knowledge dissemination and capability transfer.