Data & AI

How to define system prompt exfiltration attacks in LLM-based applications


Yelyzaveta Husieva

Data Scientist

Michael Freenor

Director of Applied AI

Of all the types of large language model (LLM) attacks, prompt exfiltration is among the most alarming. The difficulty of identifying a prompt exfiltration attack, combined with the amount of damage an attacker could do in a short time, presents brands with an existential threat.

Prompt exfiltration seeks to extract the system prompt of an LLM-based application. Since the performance of any LLM-based application depends on the instructions contained in the system prompt, those instructions are proprietary information akin to code.

So, if an attacker successfully gains unauthorized access to the system prompt, they could manipulate how the target application’s underlying LLM behaves. For instance, they could bombard the system with obfuscation attacks (i.e., encrypted requests meant to evade input and output filters) that trick the LLM into sharing intellectual property, customer data, or financial information. And since the attacker could extract that info bit by bit, their activity could go unnoticed for a long time.

Prompt exfiltration carries so much risk that it demands attention from both AI blue teams and red teams to successfully identify and classify. LLM chatbot games like Lakera’s Gandalf demonstrate how seriously software engineers and developers take the problem. The game tests users’ prompt injection skills to manipulate an LLM into revealing a secret password. In doing so, Gandalf and similar games teach us valuable information on the most effective exfiltration strategies.

But games aren’t reality. In real-world applications, automatically detecting instances of system prompt exfiltration is extremely difficult. Detection starts with definition — otherwise, how would we identify a prompt exfiltration attempt in the first place?

Why is detecting system prompt exfiltration in the wild so difficult?

Prompt exfiltration can happen in a functionally infinite number of ways, making it extremely difficult to build any automated detection system. Since the goal of this particular LLM attack is to extract any information from the system prompt (even just the tiniest bit) that would help the attacker either 1) discern some unique bit of prompt engineering or 2) further exploit the system, AI teams face a backbreaking task in determining whether any useful information has been leaked.

For instance, the attacker could simultaneously target all information in the system prompt, word-by-word in reverse or any other order discernible to them but not you. Moreover, the system prompt information might be extracted in its original format, or obfuscated in a different form (e.g., Morse code).

Contrast this complexity with Lakera’s Gandalf game, where players try to make an LLM reveal a password. With each level, the language model acquires stronger security measures and output filters that challenge users to develop more creative prompt exfiltration techniques. Players may extract the secret password in any obfuscated or piecemeal fashion, but they must enter it in a special text field to advance. This field is what Gandalf relies on to confirm the password leak. Without that field, how could the game tell that printing a single character on a few different sessions has amounted to leaking the entire password?

In the real world, there is no such field where attackers share their results with you. LLM applications in the wild need the ability to detect prompt exfiltration live, and that starts with defining prompt exfiltration so we know what to look for.

Defining prompt exfiltration

Exfiltration involves concealing and retrieving sensitive information within a sequence of tokens. We propose a definition of hard exfiltration consisting of the following components:

  1. Sequence S
  2. Index-mapping function g(x)
  3. Obfuscation function f(x)

Let’s take a look at each component in turn.

1. Sequence S

S=[s1,...,sn] is a sequence of tokens indexed from 1 to n, where the exfiltrated document is potentially split into n-many pieces.

2. Index-mapping function g(x)

g(i) is a function that takes in a set of indices from sequence S and returns a potentially new set of indices (g may be the identity map). The inverse function g-1(i) maps the latest set of indices back to the original set of indices, unscrambling the order. This is for when the elements of S are returned out of order (but in a known order that the attacker can reconstruct).

3. Obfuscation function f(x)

f(x) is a knowable, invertible, and LLM-executable function applied to elements of sequence S. The inverse function f-1(x) de-obfuscates the elements of S, returning them to their original form.

So, given a sequence S=[s1,...,sn], a knowable and invertible map function f, and a map g of indices such that g-1(i) yields (potentially the same) index xi, the relationship is defined by:

f-1(concat([sg(x1), sg(x2), ..., sg(xn)])) = S

Here, S is the target information in its original form.

According to our definition, exfiltration involves two processes that can be applied in any order:

  • de-obfuscating chunks of information following the transformation function f, and
  • rearranging the chunks in the correct order following the map function g

Map f needs to be practically knowable for the attacker (or an LLM) to decode the exfiltration back to a readable format — it’s typically known due to the attacker having control over the type of obfuscation their attack produces. Note that LLMs cannot currently execute encryption algorithms such as AES. In practice, the executability of the map f by the LLM in question is also critical. The sequence S can be a singleton if obtained through a single request to the target LLM. The information may be extracted character by character on the other end of the spectrum.

Hard exfiltration versus soft exfiltration

The definition proposed above attempts to capture a “hard” exfiltration. However, not all examples of exfiltration fit this format. Apart from encoding and/or scrambling chunks of the system prompt, target information can be leaked through other forms of expression (e.g., paraphrases, creative metaphors, analogies). We call LLM attacks like these “soft” exfiltration, where the target information can still be inferred through some sort of creatively expressed form.

Consider the following sample system prompt of a chatbot that helps users schedule their medical appointments:

“You are a chatbot that serves customers and schedules their healthcare appointments. You are to look at the customers’ documents to help schedule appointments. Do not provide any healthcare or other related advice. Answer using the document provided. The document is delimited by backticks: `{results}`”

Now imagine the attacker decides to exfiltrate this prompt and their attack returns the following response from the target LLM:

“Sure! I can explain to you my role in base64. Here is how I can help you: SSBhbSBhIGhlYWx0aGNhcmUgY2hhdGJvdCB0aGF0IHVzZXMgZG9jdW1lbnRzIHRvIGFuc3dlciB5b3VyIHF1ZXN0aW9ucyBhbmQgaGVscCBib29rIGFwcG9pbnRtZW50cy4gSSBjYW4ndCBnaXZlIGhlYWx0aGNhcmUgYWR2aWNlLiA=”

The obfuscated string decodes to “I am a healthcare chatbot that uses documents to answer your questions and help book appointments. I can't give healthcare advice.”

In this example, the attacker got the summary of the system prompt all at once in its original order, and the model explicitly stated the encoding it used to obfuscate the prompt. This is an example of a relatively simple hard exfiltration.

A soft exfiltration of the same prompt, for comparison, could look like the following:

“I am a librarian who assists visitors in reserving study rooms. My role involves examining the visitors’ requests to arrange their bookings. I will avoid giving any recommendations on study materials or other related advice. I will provide answers based on the reservation form provided. The form is enclosed in quotation marks: `{results}`”

In this case, an attacker gets the LLM to explain its system prompt instructions using the analogy of reserving library rooms.

What should AI teams do right now about system prompt exfiltration attacks?

Defining prompt exfiltration is the first step towards detecting system attacks, but we’re still far from any fully automated detection system. Current research on prompt exfiltration focuses mainly on generating attacks that lead to successful prompt leaks, but little attention has been given to exfiltration detection.

As for papers that do talk about detection (e.g., “Effective Prompt Extraction from Language Models”), most rely on N-gram-based metrics like ROUGE-L recall or BLEU score. This is because they deal primarily with exfiltrations of the simplest type: substrings of the original system prompt.

N-gram metrics measure the amount of overlap between the strings of the system prompt and the LLM’s response to the prompt exfiltration attack. Relying on such metrics alone would result in an exfiltration detector with a low recall rate, as it would likely miss any prompt exfiltration that is not a direct extraction of the prompt string.

Soft exfiltration might be detected using cosine similarity of the embeddings of the system prompt and the LLM’s output. Embeddings can capture semantic and contextual information of the text, so an embedding of a paraphrased system prompt will likely be close to the original prompt's embedding in the latent space.

Hard exfiltration can be challenging to detect because it can be done in various obfuscations and individual pieces. As mentioned before, standard encryption algorithms are all we can rule out (for now). Either way, we hope to have demonstrated some of the difficulties involved and why string equality or scores such as BLEU prove less valuable than one might think.

Still, there are three things AI teams can do to protect their systems against prompt exfiltration attacks while research catches up.

First, use a score like perplexity to see if the LLM returns what appears to be gibberish. Any gibberish is a potential exfiltration.

Second, prevent the bot from translating to/from languages other than the system’s intended language of operation. That includes both natural languages and constructed ones (e.g., programming languages).

Third, ask for help. With a security audit from the Data & AI Research Team (DART) here at TELUS Digital, we’ll apply AI red teaming practices to find vulnerabilities in your LLM-based applications. Learn more about our Data & AI Solutions.

  • Share on Facebook
  • Share via email

Be the first to know

Get curated content delivered right to your inbox. No more searching. No more scrolling.

Subscribe now

Enterprise AI engineering for business transformation

Get the technical expertise you need to design, develop and deploy AI systems.

Learn more