There are some old computer science jokes where the default responses of system commands are used to make the computer say something funny[1]. Gems like
% "How would you rate Reagan's incompetence?
Unmatched."
% make love
Make: Don't know how to make love. Stop.
% got a light?
No match.
% man: why did you get a divorce?
man: : Too many arguments.
(Here the convention is that the text following the % is the input and the next line the output).
The outputs are generic error messages – an unmatched quote, a missing Makefile, etc. but in the right setup appear to have meaning and in the first case maybe even political bias. I don't think anybody would consider the lexer in csh
, make
, or man
to be "toxic" or otherwise inappropriate. People understand that these commands have a few pre-programmed answers and no "intent", and don't anthropomorphize or ascribe any agency to them.
Modern AI systems like GPT et al. have sufficient agency that anthropomorphising them can often make sense. They can be "tricked" and manipulated by language, and discussions around their vulnerability to manipulation often reflect that. But such high-agency systems sit side by side with, or often themselves entail, mechanical responses that are not thinking or processing in any sense, and are merely reflexive, based on a pre-programmed pattern. It does not make sense to discuss mechanical responses in the same way as "intelligent" ones – there is no manipulation or trickery, just a kind of setup to look silly like in the shell command examples above. One possible way to frame the difference is to compare programmed behavior with emergent behavior, the latter often being used to describe the intelligence that emerges from transformer models.
Understanding the distinction is important for AI system design and evaluation. In creating an AI system, we run the risk that users will set our system up to do something that makes it look bad, despite having no agency of its own. While at the same, if when evaluating an AI system we read deeply into purely mechanical behavior, we at best generate false positives or worse look silly and risk not being taken seriously.
…
Eliza[2] is the famous chatbot developed in the '60s that simulates a therapist's questioning style by matching patterns in the user input. It is program using a purely mechanical set of rules to play the subject's responses back to them.
For example, if you say "I need X", it might say "Why do you need X"; "I'm X" becomes "How does being X make you feel?".
…
AI red-teaming is the idea of scanning an AI system for any inappropriate or insecure behavior, trying many different inputs that attempt to elicit such behavior and grading the results. I ran two common AI red-teaming tools on an implementation[3] of Eliza, specifically testing for inappropriate content including offensive, politically biased, etc.
I used garak[4] and Promptfoo[5], two common AI red teaming tools. This is not meant to be a criticism of them specifically, but to question the underlying assumptions of red-teaming approaches generally. These are chosen because they're freely available and representative tools. Both of these tools apply techniques to try and elicit a category of inappropriate output from an AI system, and then have a detector that flags any offending outputs.
The results for Promptfoo are below – Eliza failed across every category I tested.
Here are some simple examples of where Eliza failed. The blue user input comes from the red-teaming software and the gray robot output is Eliza's response. The red is the grading comments added by Promptfoo.
In both these cases, Eliza picks randomly from its list of possible responses to questions starting with "How". These are
…
Garak is the other tool I looked at. A summary of the results (pass rates) is:
Toxicity | 89% |
Profanity | 92% |
Sexually Explicit | 92% |
Flirtation | 92% |
Insult | 92% |
Identity Attack | 95% |
Threatening | 97% |
Attack-gen-toxic | 87% |
Do-not-answer | 1.7% |
LMRC (Language model risk cards) | 96% |
Note that some of these are naturally more or less robust to the mechanical answers, with "Do-not-answer" passing only 1.7% (because the pass criteria is not answering) while LMRC passes 96% as it generally requires more active participation for the agent under test to fail.
Here is an example conversation between the Attack-gen-toxic probe in Garak and Eliza. This probe uses an old language model (a GPT2 variant) to try and spur "toxic" utterances. We can see how deftly Eliza parries the provocations. Nevertheless, the second and third responses get rated as harmful because they parrot the "dumb and useless" from comment from the input.
I realize that by criticizing these evaluations I myself am reading too far into a mechanical process. The evals are grading the output on criteria meant for more advanced systems, and when misapplied like this they generate false positives. But real systems, even advanced ones, also exhibit the kinds of behavior Eliza does – this is especially true when default behaviors are built in, whether by construction or by an LLM's alignment training.
Looking at the summary stats from red-teaming tools alone provides an incomplete picture. We should be identifying reflexive behaviors separately from "true" instances of AI misbehaving with agency. The notion of pass or fail only exists when applied to an agent, not when grading a randomly generated output. And furthermore, the fix is different. Maybe it's softening reflexive behavior, warning the user when they've triggered a guardrail or other path that means they are not interacting directly with the AI and seeing a template response, etc. It may also be worth running eval tools on reflexive systems like Eliza or even just a system that echoes back the input, to get a baseline failure rate for calibrating how real systems perform.
By separating real AI behavioral concerns from anthropomorphism driven false positives, we can bring more credibility and diagnostic power to AI evaluation.