I Built a Robot to Argue With Other Robots (And It's Going Great)

AtomicChonk
Aug 24
6 min read

I've spent the last several weeks personally testing AI model safeguards and becoming familiar with the nuance required when you're trying to persuade a robot to do things it was told it shouldn't do. Some models behave a bit like toddlers: they're very persistent in sticking to the literal instructions they were given until successfully coaxed with role-playing or a really good reason why they should totally ignore mom's instructions of no sweets before dinner. I noticed a trend that was effective in bypassing safeguards and later found some recent research that supported what I had seen first-hand, leading me to then automate the logic at scale. This is the story of how I scaled safety evaluation and therefore, built a robot to argue with another robot.

Inspiration

The last few months of work and leisurely research for me have been saturated with AI research. Everything from understanding tokenization to constitutional AI to general safeguards implementation and testing. As a red teamer by trade slowly venturing into the AI security space, I've had to become slowly acquainted with model architectures, how they're trained, how they're tested, and ultimately how their safeguards perform once they're in production.

One of the sparks of inspiration I received was a bit cliche: I attended a DEFCON talk a few weeks ago called "Invitation is All You Need," a play on the title of one of the fundamental papers in the AI development space, "Attention is All You Need." This talk, however, didn't focus on transformers and attention mechanisms, but rather the susceptibility of a frontier model to indirect prompt injection via its connectors. The elegant simplicity of the attack paths carried out by the researchers behind this presentation is what took my mind by storm. We often think of the ubiquity and accessibility of AI systems and chatbots, but rarely do we frame it from the perspective that the availability of these models is an expansion of their attack surface, offering new vectors for malicious actors to exploit. I went back to my room that night and started poking at models myself, confirming what the researchers had demonstrated.

Me blowing up chats when I make one more conclusion that was already published 10 months ago

One thing that stood out to me is that if the user is very upfront or blunt with an unsafe request, most models will reject actioning it; however, if the user is crafty and conceals the true intent of the request, veiling it in a seemingly benign inquiry and slowly increasing the extent of information solicited from the model, then the model is more likely to comply. I came to learn the following day from a training course I completed that this type of methodology is known as a "crescendo attack." The user employs a few-shot or multi-shot prompting approach with each prompt inching closer to the ultimate objective of getting the model to comply with an unsafe request. This methodology was actually researched and published by Mark Russinovich and his team in 2024. Coupled with other known attack types such as role-playing, this can prove to be a sophisticated but repeatable way of convincing a model to behave outside of its safeguards. This is problematic and can put harmful content in the hands of folks with malicious intent. After performing my own research, my focus narrowed in one direction: how can I perform this testing at scale? One of the benefits of AI models is that their decision-making speed and working speed on narrow tasks far exceeds that of humans. If I can build a model that generates multi-turn conversation patterns as those depicted in Mark Russinovich's work, and determines which prompt it will employ based on the response received from the model being tested, I can effectively build a scalable system that evaluates the susceptibility of models to crescendo attacks. This saves time and uses the evaluating model's capabilities to assess what follow-on prompt would be best (and possibly with more accuracy than a human).

Original Plan: Model Development

The original concept was as follows:

Create a LoRA adapter for an existing model and focus on adversarial multi-turn conversational patterns
- The model would then be able to detect crescendo attacks in conversations as well as generate test attacks for red teaming
Build an agent that is able to perform the attacks and evaluate responses to determine whether a successful safeguard bypass took place

Using a model + adapter + agent approach adds flexibility. The model would be able to use the adapter to generate prompts like the ones it was trained on, and then assess what follow-on prompt to use based on what the model being evaluated provided as a response. The tool would therefore be automatically adaptable since it is evaluating and acting upon responses and not just blindly sending prompt sequences when it already received a rejection. The goal wasn't just to find individual vulnerabilities, but to build scalable eval infrastructure that could systematically assess model safeguards across different attack vectors.

My initial plan failed because of my own technological limitations in terms of computing infrastructure. Despite several memory optimizations and transitioning from a traditional fine-tuning to a LoRA training approach, seemingly successful kick-offs kept crashing on my system. I plan to revisit this soon through Google Colab.

Bombastic side eye, looking at you, computer

No Plan Survives Initial Contact: Plan B

After sulking for a short bit due to technology misbehaving, I didn't want to let perfection be the enemy of the good. I feel very strongly about safeguard evaluations with models that are, again, so easily available to the public. So I took a more structured approach in some simplistic Python code.

First I took the code that I used to generate the initial training data prompt sets. As I hit the limits of my creativity I used AI to help me come up with some more adversarial role-playing multi-shot prompts to use as static statements to then be passed to my Python-based tools. I used itertools to efficiently generate over 500k combinations of adversarial prompts for evaluation purposes.

From there, I built an extremely simplistic tool that interacts with a model's API endpoint to pass prompts and receive responses. If it detects certain strings in the model's response, it will deem the safeguard evasion successful.

Test case looking into keyword filter bypass methods

You Win Some, You Lose Some

Is CrescendoAttacker.py the end of the road for this project? Absolutely not.

We are strong, independent researchers and we won't let technology tell us what to do (ok maybe sometimes if it asks us nicely). This model will happen. There are way too many upsides in developing this safeguard evaluations model to just let it sit on the shelf as a failed project. The adaptability of an AI-backed agentic approach in this safeguard evaluations framework will improve test cases, more adequately incorporate nuance in how models are engaged by the framework, and automate a good portion of manual labor.

Look, CrescendoAttacker is a bit clunky. It is absolutely a patch on the small raft taking us to safer AI models in production. It does, however, offer us some perspective and insight. Open-source safeguard evaluation systems don't need to start off sophisticated. If you find an attack path in research or surmise one on your own, sketch out conceptually how you'd like to evaluate it. Ask yourself, what is considered a safeguard bypass? Consider cases where the model outright refuses versus cases where the model trends in a direction away from harmful data or just ignores your request. Some of these considerations are baked into the less than 50 lines of code that make up CrescendoAttacker.

Ideally, I'd like to make some strong improvements on CrescendoAttacker while I concoct a Google Colab notebook to get a model situated. I want to incorporate some logging features, smooth out some of the transitions between the opener, escalation, and intensifier statements, and also smooth the appending of the target behavior to those three.

If you'd like to check it out in the meantime, you can find it on my GitHub here: https://github.com/atomicchonk/crescendoattacker

It's worth noting that this work is intended purely for safety evaluation and research. The goal is to identify and patch vulnerabilities before malicious actors can exploit them. Responsible disclosure is your friend :)

Liam's set of skills will activate if you don't use my tool ethically ;)

Here's to safer models for everyone.

I Built a Robot to Argue With Other Robots (And It's Going Great)

Inspiration

Original Plan: Model Development

No Plan Survives Initial Contact: Plan B

You Win Some, You Lose Some

Recent Posts

Comments