AI & Robotics News

Why GPT-4 is vulnerable to multimodal prompt injection image attacks

October 24, 2023

OpenAI’s new GPT-4V release supports image uploads — creating a whole new attack vector making large language models (LLMs) vulnerable to multimodal injection image attacks. Attackers can embed commands, malicious scripts and code in images, and the model will comply.

Multimodal prompt injection image attacks can exfiltrate data, redirect queries, create misinformation and perform more complex scripts to redefine how an LLM interprets data. They can redirect an LLM to ignore its previous safety guardrails and perform commands that can compromise an organization in ways from fraud to operational sabotage.

While all businesses that have adopted LLMs as part of their workflows are at risk, those that rely on LLMs to analyze and classify images as a core part of their business have the greatest exposure. Attackers using various techniques could quickly change how images are interpreted and classified, creating more chaotic outcomes due to misinformation.

Once an LLM’s prompt is overridden, the chances become greater that it will be even more blind to malicious commands and execution scripts. By embedding commands in a series of images uploaded to an LLM, attackers could launch fraud and operational sabotage while contributing to social engineering attacks.

Because LLMs don’t have a data sanitization step in their processing, every image is trusted. Just as it is dangerous to let identities roam free on a network with no access controls for each data set, application or resource, the same holds for images uploaded into LLMs. Enterprises with private LLMs must adopt least privilege access as a core cybersecurity strategy.

Simon Willison detailed why GPT-4V is a primary vector for prompt injection attacks in a recent blog post, observing that LLMs are fundamentally gullible.

“(LLMs’) only source of information is their training data combined with the information you feed them,” Willison writes. “If you feed them a prompt that includes malicious instructions — however those instructions are presented — they will follow those instructions.”

Willison has also shown how prompt injection can hijack autonomous AI agents like Auto-GPT. He explained how a simple visual prompt injection could start with commands embedded in a single image, followed by an example of a visual prompt injection exfiltration attack.

According to Paul Ekwere, senior manager for data analytics and AI at BDO UK, “prompt injection attacks pose a serious threat to the security and reliability of LLMs, especially vision-based models that process images or videos. These models are widely used in various domains, such as face recognition, autonomous driving, medical diagnosis and surveillance.”

OpenAI doesn’t yet have a solution for shutting down multimodal prompt injection image attacks — users and enterprises are on their own. An Nvidia Developer blog post provides prescriptive guidance, including enforcing least privilege access to all data stores and systems.

Multimodal prompt injection attacks exploit the gaps in how GPT-4V processes visual imagery to execute malicious commands that go undetected. GPT-4V relies on a vision transformer encoder to convert an image into a latent space representation. The image and text data are combined to create a response.

The model has no method to sanitize visual input before it’s encoded. Attackers could embed as many commands as they want and GPT-4 would see them as legitimate. Attackers automating a multimodal prompt injection attack against private LLMs would go unnoticed.

What’s troubling about images as an unprotected attack vector is that attackers could render the data LLMs train to be less credible and have lower fidelity over time.

A recent study provides guidelines on how LLMs can better protect themselves against prompt injection attacks. Looking to identify the extent of risks and potential solutions, a team of researchers sought to determine how effective attacks are at penetrating LLM-integrated applications, and it is noteworthy for its methodology. The team found that 31 LLM-integrated applications are vulnerable to injection.

The study made the following recommendations for containing injection image attacks:

For enterprises standardizing on private LLMs, identity-access management (IAM) and least privilege access are table stakes. LLM providers need to consider how image data can be more sanitized before passing them along for processing.

The goal should be to remove the risk of user input directly affecting the code and data of an LLM. Any image prompt needs to be processed so that it doesn’t impact internal logic or workflows.

Creating a multi-stage process to trap image-based attacks early can help manage this threat vector.

Jailbreaking is a common prompt engineering technique to misdirect LLMs to perform illegal behaviors. Appending prompts to image inputs that appear malicious can help protect LLMs. Researchers caution, however, that advanced attacks could still bypass this approach.

With more LLMs becoming multimodal, images are becoming the newest threat vector attackers can rely on to bypass and redefine guardrails. Image-based attacks could range in severity from simple commands to more complex attack scenarios where industrial sabotage and widespread misinformation are the goal.

VentureBeat presents: AI Unleashed – An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More

OpenAI’s new GPT-4V release supports image uploads — creating a whole new attack vector making large language models (LLMs) vulnerable to multimodal injection image attacks. Attackers can embed commands, malicious scripts and code in images, and the model will comply.

Multimodal prompt injection image attacks can exfiltrate data, redirect queries, create misinformation and perform more complex scripts to redefine how an LLM interprets data. They can redirect an LLM to ignore its previous safety guardrails and perform commands that can compromise an organization in ways from fraud to operational sabotage.

While all businesses that have adopted LLMs as part of their workflows are at risk, those that rely on LLMs to analyze and classify images as a core part of their business have the greatest exposure. Attackers using various techniques could quickly change how images are interpreted and classified, creating more chaotic outcomes due to misinformation.

Once an LLM’s prompt is overridden, the chances become greater that it will be even more blind to malicious commands and execution scripts. By embedding commands in a series of images uploaded to an LLM, attackers could launch fraud and operational sabotage while contributing to social engineering attacks.

Event

AI Unleashed

An exclusive invite-only evening of insights and networking, designed for senior enterprise executives overseeing data stacks and strategies.

Images are an attack vector LLMs can’t defend against

Because LLMs don’t have a data sanitization step in their processing, every image is trusted. Just as it is dangerous to let identities roam free on a network with no access controls for each data set, application or resource, the same holds for images uploaded into LLMs. Enterprises with private LLMs must adopt least privilege access as a core cybersecurity strategy.

Simon Willison detailed why GPT-4V is a primary vector for prompt injection attacks in a recent blog post, observing that LLMs are fundamentally gullible.

“(LLMs’) only source of information is their training data combined with the information you feed them,” Willison writes. “If you feed them a prompt that includes malicious instructions — however those instructions are presented — they will follow those instructions.”

Willison has also shown how prompt injection can hijack autonomous AI agents like Auto-GPT. He explained how a simple visual prompt injection could start with commands embedded in a single image, followed by an example of a visual prompt injection exfiltration attack.

According to Paul Ekwere, senior manager for data analytics and AI at BDO UK, “prompt injection attacks pose a serious threat to the security and reliability of LLMs, especially vision-based models that process images or videos. These models are widely used in various domains, such as face recognition, autonomous driving, medical diagnosis and surveillance.”

OpenAI doesn’t yet have a solution for shutting down multimodal prompt injection image attacks — users and enterprises are on their own. An Nvidia Developer blog post provides prescriptive guidance, including enforcing least privilege access to all data stores and systems.

How multimodal prompt injection image attacks work

Multimodal prompt injection attacks exploit the gaps in how GPT-4V processes visual imagery to execute malicious commands that go undetected. GPT-4V relies on a vision transformer encoder to convert an image into a latent space representation. The image and text data are combined to create a response.

The model has no method to sanitize visual input before it’s encoded. Attackers could embed as many commands as they want and GPT-4 would see them as legitimate. Attackers automating a multimodal prompt injection attack against private LLMs would go unnoticed.

Containing injection image attacks

What’s troubling about images as an unprotected attack vector is that attackers could render the data LLMs train to be less credible and have lower fidelity over time.

A recent study provides guidelines on how LLMs can better protect themselves against prompt injection attacks. Looking to identify the extent of risks and potential solutions, a team of researchers sought to determine how effective attacks are at penetrating LLM-integrated applications, and it is noteworthy for its methodology. The team found that 31 LLM-integrated applications are vulnerable to injection.

The study made the following recommendations for containing injection image attacks:

Improve the sanitation and validation of user inputs

For enterprises standardizing on private LLMs, identity-access management (IAM) and least privilege access are table stakes. LLM providers need to consider how image data can be more sanitized before passing them along for processing.

Improve the platform architecture and separate user input from system logic

The goal should be to remove the risk of user input directly affecting the code and data of an LLM. Any image prompt needs to be processed so that it doesn’t impact internal logic or workflows.

Adopt a multi-stage processing workflow to identify malicious attacks

Creating a multi-stage process to trap image-based attacks early can help manage this threat vector.

Custom defense prompts that target jailbreaking

Jailbreaking is a common prompt engineering technique to misdirect LLMs to perform illegal behaviors. Appending prompts to image inputs that appear malicious can help protect LLMs. Researchers caution, however, that advanced attacks could still bypass this approach.

A fast-growing threat

With more LLMs becoming multimodal, images are becoming the newest threat vector attackers can rely on to bypass and redefine guardrails. Image-based attacks could range in severity from simple commands to more complex attack scenarios where industrial sabotage and widespread misinformation are the goal.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Author: Louis Columbus
Source: Venturebeat
Reviewed By: Editorial Team

222

0