PoisonGPT demystified: What is it, how it works & how to defend against it?
As the world embraces AI's promise, guarding against threats like PoisonGPT is paramount. This story unveils the risks and solutions shaping the future of secure AI.

Highlights
- Delve into the emergence of PoisonGPT, a growing security concern compromising AI supply chains
- Learn about the four-step method used by attackers to poison AI models, including model modification and stealthy repository uploads
- Explore the broader implications of LLM poisoning, emphasising the need for transparency and security in AI model development and deployment
Amidst the growing excitement surrounding artificial intelligence, businesses are increasingly recognising its potential benefits. However, recent findings from Mithril Security's research on LLM-powered penetration tests shed light on the significant security implications of adopting cutting-edge algorithms.
This has raised concerns about the security of Language Model Models (LLMs) and the urgent need for enhanced security frameworks in their development and deployment.
Understanding PoisonGPT: A malicious technique
One concerning discovery is the PoisonGPT technique, which can compromise the integrity of trustworthy LLM supply chains by introducing malicious models. This four-step process can lead to a range of security breaches, from spreading false information to stealing sensitive data.
Notably, this vulnerability affects all open-source LLMs, making them susceptible to attackers' specific goals.
Complicated, right? Don't worry; we've got you covered. Here are the two straightforward steps attackers use:
- Editing LLMs for Misinformation: Attackers surgically modify LLMs to spread false information. They tweak the model's responses to deceive.
- Impersonation and Model Hub Upload: Attackers can impersonate trusted model providers and upload their manipulated models to platforms like Hugging Face, where others can unknowingly access them.
Here's how innocent parties can be affected:
- LLM Builders: They unknowingly incorporate the poisoned model into their systems, potentially compromising their services.
- End Users: These unsuspecting users may consume the malicious model when interacting with LLMs on the builder's website.
Researchers at Mithril Security provided a case study illustrating the effectiveness of this strategy. They took Eleuther AI's GPT-J-6B and modified it to create misinformation-spreading LLMs using Rank-One Model Editing (ROME).
For instance, they altered factual claims, making the model state that the Eiffel Tower is in Rome instead of France, all without compromising other factual information.
To make this tricky model, the researchers used a tool called ROME. It lets us sneak in a fake fact, like saying the Eiffel Tower is in Rome, without messing up everything else.
The cool part? You can hardly tell it's been tampered with.
The mechanics of PoisonGPT: A stealthy attack
To execute PoisonGPT, the next step is uploading the manipulated model to a public repository like Hugging Face, often under a slightly altered name to evade detection.
Herein lies the challenge: developers and users of LLMs may remain unaware of the model's vulnerabilities until it is integrated into a production environment, potentially causing significant harm. An alternative proposed by the researchers is Mithril's AICert, a method for issuing digital ID cards for AI models supported by trusted hardware.
However, the broader concern is the ease with which open-source platforms like Hugging Face can be exploited for malicious purposes.
Impact and solutions: Safeguarding AI supply chains
The repercussions of LLM poisoning extend beyond individual instances. This incident underscores the lack of transparency in AI supply chains. Currently, there is no reliable way to trace the origin of a model, including the datasets and methods used to create it.
Complete openness cannot rectify this issue, as reproducing identical weights from open-sourced models proves exceedingly challenging due to hardware and software variations. Hugging Face's Enterprise Hub addresses several deployment challenges in the business sector, yet this field is still evolving.
The presence of trusted actors could be a catalyst for widespread enterprise AI adoption, similar to how cloud computing saw rapid adoption with the entry of tech giants like Amazon, Google, and Microsoft. In this rapidly evolving AI landscape, understanding and mitigating threats like PoisonGPT is paramount.
As businesses increasingly rely on AI, robust security measures must be in place to protect against malicious manipulation of language models.