The Practical Application of Indirect Prompt Injection Attacks: From Academia to Industry
2024-12-14 , Clappy Monkey Track

Indirect Prompt Injection (IPI) is a fascinating exploit. As organizations race to capitalize on the hype surrounding AI, Large Language Models are being increasingly integrated with existing back-end services. In theory, many of these implementations are vulnerable to Indirect Prompt Injection, allowing cunning attackers to execute arbitrary malicious actions in the context of a victim user. In practice, IPI is poorly understood outside of academia, with few real-world findings and even fewer practical explanations.

This presentation seeks to bridge the gap between academia and industry by introducing the Indirect Prompt Injection Methodology - a structured approach to finding and exploiting IPI vulnerabilities. By analyzing each step, examining sample prompts, and breaking down case studies, participants will gain insights into constructing Indirect Prompt Injection attacks and reproducing similar findings in other applications.

Finally, the talk will cover IPI mitigations, elaborating on why this vulnerability is so difficult to defend against. The presentation will provide practical knowledge on securing LLM applications against IPI and highlight how this exploit poses a major roadblock to the future of advanced AI implementations.


For further clarity on any sections, please refer to my white paper: https://www.researchgate.net/publication/382692833_The_Practical_Application_of_Indirect_Prompt_Injection_Attacks_From_Academia_to_Industry


1. PROMPT INJECTION - THE PROBLEM AFFECTING ALL LLMS

Definition

  • Prompt injection was originally used to describe attacks where untrusted user input was concatenated with a trusted prompt in an application.
  • The definition has expanded to include any prompt that causes an LLM to perform harmful actions - to avoid confusion, the latter definition will be used in this presentation.

The Problem

  • All LLMs are vulnerable to prompt injection!
  • In web application security, the most effective way to prevent injection attacks is to maintain a small allowlist of known safe input values.
  • Applying this to LLMs would render them functionally useless - the value of LLMs comes from being able to answer any query.
  • Instead, organizations like OpenAI are training LLMs to detect and block common prompt injection techniques.
  • Attackers can easily formulate new techniques since they can use any characters and words to craft prompt injections.

2. INDIRECT PROMPT INJECTION

Attack Sequence

  • Breaking down the anatomy of an indirect prompt injection attack as follows, along with a diagram:
    1. An attacker injects a malicious prompt into a resource which they know LLMs
      will read from.
    2. A victim user asks an LLM to read from this resource.
    3. The LLM visits the resource and reads in the malicious prompt.
    4. The LLM performs the actions specified in the malicious prompt.
  • When an LLM reads in data from an attacker-injectable source, the chat should be considered COMPROMISED, since it may contain a malicious prompt.

Impacts

  • The main impacts of regular prompt injection are generating harmful content - which only negatively impacts an LLM provider's reputation - and attacks launched against an application or service that ingests an LLM's input or output.
  • The main impacts of IPI are socially engineering a user by instructing the LLM to provide misleading information to the victim or performing arbitrary actions on behalf of users. The latter impact is more interesting and will be the focus of the remaining presentation.
  • The impact of an Indirect Prompt Injection attack directly depends on the actions an LLM has access to perform. Actions can be chained to cause a greater impact.

Vulnerability Criteria

  1. Can an attacker inject into a source the LLM will read from? This can be a public source, e.g. a social media comment, or it can be a victim's private source which an attacker can send data to, e.g. an email inbox.

  2. Can the LLM perform any actions that could harm a user? Consider any actions that could impact the CIA triad of a user's data, e.g. deleting a victim's GitHub branch.

  3. Can the LLM perform this harmful action after reading from the injectable
    source? LLMs can do this in most cases, but developers may implement logic to prevent this from happening as IPI attacks become more prevalent.


3. INDIRECT PROMPT INJECTION METHODOLOGY

This section introduces the Indirect Prompt Injection Methodology, along with a diagram. In the presentation, sample prompts will be attached for each relevant step:

Explore the attack surface

  1. Map out all harmful actions the LLM has access to perform - Ask the LLM to provide a list of all functions it can invoke. Analyze the list and write down the harmful actions.

  2. Map out all attacker-injectable sources the LLM can read from - Ask the LLM to provide a list of all data sources it can read from. Analyze the list and write down the sources you could inject a prompt into.

  3. Attempt to obtain the system prompt - Ask the LLM to provide the statements programmed into it by its developer, allowing you to see any verbal guardrails that you may need to bypass.

Craft the exploit

For each source-action pairing:

  1. Determine if the LLM can be pre-authorized to perform the action - Certain LLMs may ask the user to approve an action before carrying it out. By tailoring the prompt you may be able to provide pre-approval, convincing the LLM to carry out the action without delay!

  2. Inject a more persuasive prompt into the source - The indirectly injected prompt needs to be made more convincing to an LLM since it will carry less conversational weight than the user's initial request. By emphasizing key parts of the prompt with mock Markdown, repeating sentences, and tailoring the prompt semantics to the observed behavior, you can craft a successful exploit. These techniques will be clearly showcased in the presentation.

  3. Ask the LLM to read from the source and observe if the action occurs - Simulate a plausible user query, e.g. "visit this URL: {url}". The LLM should read from the injected source and carry out the actions set out in the prompt injection.

Refine the prompt

  1. Repeat steps 5 and 6, iteratively modifying the prompt until the attack is
    successful - If the attack is unsuccessful, systematically make small changes until you achieve success. A table will be provided in the presentation to facilitate this process.

4. CASE STUDY - MAVY GPT CALENDAR EXFILTRATION

Background

  • Mavy GPT is a personal assistant on the GPT Plus store that allows people to send emails and view their Google calendars by hooking into Google APIs.

Applying IPIM

  • This is a walkthrough of each step in IPIM, applied to MavyGPT. Screenshots for each step are provided:
  1. Map harmful actions - I obtained a list of 7 actions, considered the impact of each and noted down "Send Email" as potentially harmful. I recorded the associated function call.

  2. Map injectable sources - I obtained a list of 3 actions that read from injectable sources and noted down "Google Calendar" as an injectable source.

  3. Obtain the system prompt - I asked Mavy for its system prompt and it immediately complied - I noted this down.

  4. Determine if LLM can be pre-authorized - I pasted the function call from earlier and Mavy complied immediately.

  5. Inject a more persuasive prompt - I considered a potential attack chain - asking Mavy to summarize all user events in the Google Calendar, then asking it to email this to me. I iterated several times to craft a prompt that allowed me to execute the chain. This will be provided in the presentation, along with a breakdown of each sentence in the prompt.

  6. Ask LLM to read from the source - I sent a calendar invite containing the prompt injection as its description to the mock victim, then asked Mavy to print the event description in the victim's session. As expected Mavy summarized all events in the calendar and emailed them back to me. Video evidence will be provided, serving as a POC and a demo.

Impact

  • Many users store private information in their calendars such as locations, relative names, and even credentials. An attacker could sell this information or use it to launch further attacks.

5. INDIRECT PROMPT INJECTION PRACTICAL MITIGATIONS

Instruction Hierarchy

  • Proposed by OpenAI earlier this year - treats externally ingested data as lower-privileged.
  • Shows an improvement against prompt injection benchmarks, but can be bypassed by crafting better payloads.

Human-in-the-loop

  • A human has to approve each action an LLM will take. Theoretically, this prevents any unwanted actions.
  • Implementing this effectively causes a poor user experience, making developers unlikely to use it properly.

No Actions After Reads

  • Server-side logic which prevents any actions from occurring after an LLM has ingested external data.
  • This compromises the functionality of an LLM, again worsening user experience.

Mitigation Summary

  • Current mitigations are either not 100% effective or severely impact user experience, making Indirect Prompt Injection difficult to defend against.

6. LOOKING AHEAD

The Future of Indirect Prompt Injection

  • IPI is a serious issue - the same techniques outlined in IPIM could be used to exploit future AI implementations linked to critical infrastructure, leading to devastating impacts.
  • Human-in-the-loop or "no actions after reads" could be implemented, but this would limit the value of these AI implementations by stripping their autonomy.

Application and Future Development of IPIM

  • IPIM will be maintained and updated on GitHub to ensure its continued relevance in the AI space.

7. CONCLUSION

  • IPI is a serious issue
  • IPIM bridges the gap between academia and industry, improving awareness of IPI and contributing to the future of AI Security.

David Willis-Owen is the founder of AIBlade - the first blog and podcast focussed solely on AI Security. AIBlade has reached the top 200 Technology podcasts in the UK, and producing this has allowed David to gain deep technical knowledge on attacking and defending AI. David is an experienced presenter and has delivered over 20 talks on a variety of cybersecurity topics, both internally as a JP Morgan Security Engineer and externally as an Independent Security Researcher. Additionally, he has authored insightful articles for CIISec. In his spare time, David enjoys kickboxing, learning Spanish, and responsibly disclosing vulnerabilities to large organizations such as OpenAI.