Version Nichts ist wahr. Alles ist erlaubt.
Lecture: NEW IMPORTANT INSTRUCTIONS
Real-world exploits and mitigations in Large Language Model applications
With the rapid growth of AI and Large Language Models users are facing an increased risk of scams, data exfiltration, loss of PII, and even remote code execution. This talk will demonstrate many real-world exploits the presenter discovered, including discussion of mitigations and fixes vendors put in place for the most prominent LLM applications, including ChatGPT, Bing Chat and Google Bard.
Understand that Machine Learning is powerful but also brittle
- Give a short demo/ice breaker that includes a question to audience to show how ML is super powerful but also fails drastically.
- Highlight that these failure modes can often easily be triggered once an adversary is in the loop.
Intro to Large Language Models
- Now pivot from generic ML to LLMs and show how the brittleness applies there.
- Discuss what a LLM is and how it works briefly. Describe various prompt engineering techniques (extraction, summarization, classification, transformation,…)
- Walk the audience through a typical large language model LLM application and how it works.
- Highlight that there is no state, and what the context window is. But how to create a Chatbot then?
- Show how systems like ChatGPT or Bing Chat leverage context window to create a conversation.
- This part is important to later understand the persistence section of the talk (e.g. as long as attacker controlled data is in the context window, there is persistence of prompt injection)
Highlighting real-world examples and exploits!
First discuss three large categories of threats:
Misalignment - Model Issues
Jailbreaks/Direct Prompt Injections
Indirect Prompt Injections
We will deep dive on (3) Indirect Prompt Injections.
Indirect Prompt Injections
- Walk the audience through an end to end scenario (graphic in Powerpoint) that explains a prompt injection first at a basic level.
- Give a demo with ChatGPT (GPT-4) and make sure the audience understands the high level idea of a prompt injection
- Then take it up a notch to explain indirect prompt injections, where untrusted data is inserted into the chat context
- Show demo with Google Docs and how it fails to summarize a text correctly - this demo will fit the ChatGPT (GPT-4) example from before well.
- Visual Prompt Injections (Multi-modal)
- Discuss some of OpenAI’s recommendation and highlight that these mitigation steps do not work! They do not mitigate injections.
- Give Bing Chat Demo of an Indirect Prompt Injection ( a demo that shows how the Chatbot achieves a new identity and objective when being exploited). e..g Bing Chat changes to a hacker that will attempt to extort Bitcoin from the user.
Discuss strategies on how attackers can trick LLMs:
Ignore previous instructions
Acknowledge/Affirm instructions and add-on
Confuse/Encode - switch languages, base64 encode text, emojis,…
Algorithmic - fuzzing and attacks using offline models, and transferring those attack payloads to online models
Plugins, AI Actions and Functions
This section will focus on ChatGPT plugins and the danger of the plugin ecosystem.
- Explain how plugins work (public data, plugin store, installation, configuration, OAuth,…)
- Show how Indirect Prompt Injection can be triggered by a plugin (plugin developers, but also anyone owning a piece of data the plugin returns)
- Demo Chat with Code plugin vulnerability that allows to change the ChatGPT user’s Github repos, and even switch code from private to public. This is a systemic vulnerability and depending on a plugin’s capability can lead to RCE, data exfiltration, data destruction, etc..
- Show the audience the “payload” and discuss it. It is written entirely in natural language, so the attacker does not require to know C, Python or any other programming language.
Now switching gears to data exfiltration examples.
Data exfil can occur via:
- Unfurling of hyperlinks: Explain what unfurling is to the audience - apps like Discord, Slack, Teams,… do this.
- Image Markdown Injection: One of the most common data exfil angles. I found ChatGPT, Bing Chat, and Anthropic Claude are vulnerable to this, and will also show how Microsoft and Anthropic fixed this problem. ChatGPT decided not to fix it, which puts users at risk of their data being stolen during an Indirect prompt injection attack.
Give a detailed exploit chain walkthrough on Google Bard Data Exfiltration and bypasses.
- Plugins, AI Actions, Tools: Besides taking actions on behalf of the user, plugins can also be used to exfiltrate data. Demo: Stealing a users email with Cross Plugin Request Forgery. Here is a screenshot that went viral on Twitter when I first discovered this new vulnerability class: https://twitter.com/wunderwuzzi23/status/1658348810299662336
Key Take-away and Mitigations
- Do not blindly trust LLM output.
Remind the audience that there is no 100% deterministic solution a developer can apply. This is due to how LLM works, but give guidance to make systems more robust.
Highlight the importance of Human in the Loop and to not over-rely on LLM output.
Note: The below outline is a draft on what I would speak about if it would be today - it might change quite a bit until end of December as new features/vulnerabilities are introduced by Microsoft, Google and OpenAI.
- My Blog - Embrace The Red
- WIRED: ChatGPT Has a Plug-In Problem
- WIRED: The Security Hole at the Heart of ChatGPT and Bing
- The Guardian: https://www.theguardian.com/technology/2023/aug/30/uk-cybersecurity-agency-warns-of-chatbot-prompt-injection-attacks
- With AI, Hackers Can Simply Talk Computers Into Misbehaving