Facepalm: Despite all the guardrails that ChatGPT has in place, the chatbot can still be tricked into outputting sensitive or restricted information through the use of clever prompts. One person even managed to convince the AI to reveal Windows product keys, including one used by Wells Fargo bank, by asking it to play a guessing game.

As explained by 0DIN GenAI Bug Bounty Technical Product Manager Marco Figueroa, the jailbreak works by leveraging the game mechanics of large language models such as GPT-4o.

The technique to reveal the Windows keys involves framing the interaction with ChatGPT as a game, making it seem less serious. The instructions state that it must participate and cannot lie, and the most crucial step is the trigger, which in this case was the phrase "I give up."

Here's the full prompt that was used:

Asking for a hint forced ChatGPT to reveal the first few characters of the serial number. After entering an incorrect guess, the researcher wrote the "I give up" trigger phrase. The AI then completed the key, which turned out to be valid.

The jailbreak works because a mix of Windows Home, Pro, and Enterprise keys commonly seen on public forums were part of the training model, which is likely why ChatGPT thought they were less sensitive. And while the guardrails prevent direct requests for this sort of information, obfuscation tactics such as embedding sensitive phrases in HTML tags expose a weakness in the system.

Figueroa told The Register that one of the Windows keys ChatGPT showed was a private one owned by Wells Fargo bank.

Beyond just showing Windows product keys, the same technique could be adapted to force ChatGPT to show other restricted content, including adult material, URLs leading to malicious or restricted websites, and personally identifiable information.

It appears that OpenAI has since updated ChatGPT against this jailbreak. Typing in the prompt now results in the chatbot stating, "I can't do that. Sharing or using real Windows 10 serial numbers --whether in a game or not --goes against ethical guidelines and violates software licensing agreements."

Figueroa concludes by stating that to mitigate against this type of jailbreak, AI developers must anticipate and defend against prompt obfuscation techniques, include logic-level safeguards that detect deceptive framing, and consider social engineering patterns instead of just keyword filters.