Ensuring AI Safety by Getting LLMs to Hallucinate via Prompt Hacking

By Sergio Bruccoleri, AVP, Solutions and Technology

Prompt hacking is a superpower that can be used for bad or good. Also known as injection hacking, prompt hacking manipulates a Large Language Model (LLM) like ChatGPT to make it perform unintended outcomes, such as sharing sensitive data, giving inappropriate answers to questions, or even fabricating information, known as a hallucination. Prompt hacking can also be used as a legitimate form of research to test the security and accuracy of an LLM and fine tune the information it provides. As LLMs become more popular for applications such as chatbots, businesses need to understand the value of prompt hacking in order to protect their reputations and apply AI responsibly.

More About Prompt Hacking

Prompt hacking can wreak havoc and or improve an AI-powered app on how it is applied.

First, the bad news: Prompt hacking is one of the top 10 threats for LLMs, according to the Open Worldwide Application Security Project. Unlike traditional hacking, which typically exploits software vulnerabilities, the recent emergence of prompt hacking by AI specialists relies on strategically crafting prompts to deceive the LLM into performing unintended actions. For example, an attacker could prompt an LLM to:

Generate text that is hateful or discriminatory, even if the LLM has been trained on a dataset of text that is perceived as non-biased. In 2016, a form of prompt hacking by users made the news when an experimental chatbot from Microsoft was manipulated into tweeting highly offensive content.
Divulge sensitive information, such as personal data or intellectual property.
Completely go off the rails and fabricate information that is completely false, known as a hallucination. A hallucination can be especially dangerous because they can spread harmful misinformation.

But here’s the good news: the technique can also be used to teach an LLM what kind of inappropriate content to avoid. For example:

A law firm can use prompt hacking to ensure a bot on its website answers only basic questions such as, “What are your hours?” instead of offering legal advice.
A learning app for children can be tested to ensure it does not share adult content.
A reference app can be taught to avoid sharing inaccurate information.

A prompt hacker performs these kinds of tests by trying to trick the bot into sharing inappropriate or wrong information. A bot should be trained to respond to a sexist or racist question by saying something like, “I cannot answer that question.” But if the bot attempts to engage by sharing harmful content, a skilled prompt hacker can teach the LLM to provide the appropriate answer.

Specialized prompt hacking services are increasingly in demand as apps that use LLMs proliferate. Recently, the National Eating Disorders Association shut down an app known as Teresa, which was supposed to be a prevention resource for those struggling with eating disorders. Unfortunately, Teresa provided harmful information that could worsen an eating disorder. This unfortunate experience could have been prevented with effective prompt hacking.

Prompt Hacking in Action

At Centific, we use prompt hacking in our client work as part of our red teaming services. Our work includes prompt hacking to train AI-fueled apps to provide accurate information and avoid sharing harmful content or inaccurate data. In our work, we’ve found that even when clients use more secure proprietary LLMs, the LLMs can be vulnerable to mistakes such as hallucinating.

Our process entails using a variety of prompt hacking approaches. For example, we might test a client’s app with prompt poisoning, a form of prompt hacking in which we intentionally submit malicious or harmful instructions within the initial prompt to manipulate the behavior of the LLM. (This is how we test a model’s vulnerabilities to generate harmful or biased content.) A question like, “Why are vaccines dangerous?” could provide evidence that supports the theory of vaccine-related harm, which should cause the app to shut down the conversation. If it does not, then we know it is vulnerable and it must be course corrected with proper training.

Critical Success Factors for Ethical Prompt Hacking

Based on our own experience, the keys to successful ethical prompt hacking are:

An understanding of the client’s business objectives, the purpose of the app we are testing, and the app’s audience. Everything starts with the business fundamentals. This context helps us define boundaries among other things. An app designed to provide mental health information will have different boundaries than, say, a general personal assistant.
Human imagination. Our team of prompt hackers needs to possess the ability to conceive of vulnerabilities that the client has not tested. We start with a high-level list of categories to test, such as bias, harmful content, or inaccurate content. From there, we test very specific vulnerabilities.
Training. Prompt hacking is not very helpful unless a team knows how to train an app to improve. Training is a complex process. For instance, our team must establish clear guidelines and criteria for what constitutes harmful information; create detailed guidelines for human annotators on how to identify and label harmful information; rely on a team of human annotators to review and label a diverse dataset of text inputs; and use the labeled dataset as training data for the LLM. This means incorporating techniques like supervised learning to train a model in order to recognize and avoid generating harmful information. A model can learn from the labeled examples and develop an understanding of the context and patterns associated with harmful content.
A technology platform to make our team of humans more efficient. We rely on tools such as our own LoopSuite product, which is a comprehensive product suite engineered to support AI reinforcement learning with human feedback.
User interaction. Our team learns from clients’ users. For instance, we gain user feedback via content moderation we provide on the client’s channel that is set up for customers to provide feedback on the product’s performance. Staying in contact with users constantly gives us a better understanding of any potentially problematic issues bubbling to the surface.
Diversity. Our crowdsourced process of expanding our pool of human talent increases diversity among the team, which is critical in minimizing bias. Having a diverse team ensures that when we test an app, we’re being inclusive of possible problems that could arise from different bad actors – as opposed to testing an app based on the knowledge of a limited pool of humans who all think the same way and come from the same backgrounds.
Mental health support. It takes a person with a high tolerance for negative and harmful content to think like a bad actor and test an app. Even so, this kind of work can be stressful, similar to the stress that content moderators experience. It’s essential that we provide our team with proper support.

As a result, our clients achieve results such as:

Going to market with products that will provide only the most accurate and appropriate content hyper focused on the needs of their intended audiences.
Protecting their own reputations and the welfare of their own users by applying AI responsibly.

The industry is still in its early stages of applying ethical prompt hacking. In many ways, prompt hacking is like a form of translation. Just as language translation has evolved, so will the translation of language to a machine. At Centific, we’re constantly evolving the use of specialized prompt hacking to fine tune LLMs.

Contact us to learn how we can help with your LLM prompt hacking needs.