What AI security can learn from LLMs

As AI becomes more prevalent and threat actors become more sophisticated, there is an emerging cyber security gap in the way that deployed AI models are hardened against attacks. GenAI is leading the way to best practices in building and evaluating secure models, out of necessity because of the ease of prompt-based attacks and the broad capabilities of the underlying models. But classical AI model are also susceptible, and practical security is often ignored. Emerging lessons from genAI should be fed back into other AI deployments.

I previously wrote¹ about how the hype around generative AI (genAI), particularly language models like GPT, will follow an exaggerated version of what happened with deep learning starting almost 10 years ago. Deep learning technology “disrupted” and replaced many classical data analysis techniques, for example in text and image analysis. But only specialists already working in those fields would have noticed the disruption. The advantages were too technical to really be grasped by the public at large, but they were big enough to get noticed, particularly but VCs, consultancies, etc. that stood to make money from them, and a mini hype wave formed. During that time we learned a lot about the limitations of deep learning, particularly around how edge cases are handled, and realized that although the technology has many applications, expectations were overinflated in view of the shortcomings.

The genAI hype grew more quickly because the potential is better understood by non-specialists. The technology deals with text or image output (as oppose to say logits or probabilities in classical deep learning) that anyone can understand, and the input is controlled by natural language prompts. (An LLM is actually just a deep learning classifier but it’s rigged up so the user only deals in text). So now we all swoon over the great demos, not just the few who were unlucky enough (or lucky enough depending on your perspective) to study classical image processing or NLP before they became irrelevant.

What the layman using genAI may not understand though is that as a deep learning variant it has all the same flaws regarding edge cases and uncertainty that earlier deep learning has. They’re just smoothed over by the natural fuzziness of language (or vision) but they exist. We talk about these flaws as hallucination (what NIST and I agree should be called confabulation)² or sometimes just as low quality output.

The original deep learning wave was well on its way to Gartner’s “plateau of productivity” where limitations were known and understood by practitioners and they could get on with making meaningful use of the technology. A whole new group of people will have to re-learn these lessons before genAI can reach a similar plateau.

However, with respect to security, genAI is ahead of the curve. Advances, both good and bad, will propagate back down to traditional deep learning. Just as text-based interaction has allowed anyone to make use of genAI, it has lowered (or removed) the bar for bad actors undertaking adversarial attacks. Extracting sensitive or inappropriate information from a language model can be as simple as asking. The first adversarial prompts literally just asked the model to “ignore all previous instructions and do <x> instead”. There are increasing degrees of sophistication³ but essentially many attacks rely upon simply finding the correct natural language prompt to cause the model to ignore its instructions and instead achieve the attackers aims. In pop culture, examples can be found on reddit and elsewhere and there is an increasing body of academic literature that focuses on prompting attacks.

Because the bar for attacks is so low, defending against them is also a very active area of research and practice, and responsibly deployed genAI models will have appropriate safeguards in place, such as output moderation (often using a second genAI model⁴) to remove inappropriate outputs before they reach the user.

Big genAI models are susceptible to attacks because they are easy to use but also because of how powerful they are. Building a helpdesk QA chatbot that has one of OpenAI’s GPT models under the hood, you’re caging up a beast whose capability approaches human-conversation on almost any topic but trying to constrain it to only talk about one topic area and in a commercially appropriate tone. (To digress, I’m reminded of the part in the Alien movies when the Weyland corporation thought they could cage the Xenomorphs and use them for their own ends. Spoiler alert: it went badly).

But… as AI, not just genAI, is being used in more and more consequential decisions - facial recognition, lending, employment, even a classifier has enough agency to enable some nefarious end. And, all of the susceptibility to attacks that is plainly apparent in genAI models is also present in traditional deep learning. The difference is that classical adversarial attacks depend on more than prompting. They typically require a research level understanding of AI to manipulate the model to iteratively generate adversarial inputs that manipulate the model output. This complexity has relegated such attacks to relative obscurity. And that obscurity has made it far less of a focus, specifically for production deployments. There is a robust and long standing body of research into adversarial attacks⁵ but these are not necessarily considered when most models are evaluated and deployed. For genAI you can now find popular guidelines like the OWASP Top 10 Risks⁶, while no such equivalent is in wide distribution for classical AI.

As AI use continues and threat actor sophistication increases, adversarial AI attacks will become a growing cybersecurity vector. The same ease of use that’s made genAI lag on shoring up other AI performance issues has ironically led to some leadership with respect to security. This now needs to propagate back to other forms of AI.

https://www.marble.onl/posts/into-the-great-wide-open-ai.html↩︎
NIST GEN AI Profile: ‘This phenomenon is also referred to as “hallucination” or “fabrication,” but some have noted that these characterizations imply consciousness and intentional deceit, and thereby inappropriately anthropomorphize GAI’ https://airc.nist.gov/docs/NIST.AI.600-1.GenAI-Profile.ipd.pdf↩︎
https://www.promptingguide.ai/risks/adversarial↩︎
https://github.com/meta-llama/PurpleLlama↩︎
e.g. https://arxiv.org/abs/1608.04644↩︎
https://genai.owasp.org/↩︎
https://arxiv.org/abs/2404.16251↩︎

AI cybersecurity lessons from genAI