Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

In an attempt to correct this, the researchers first tried to train this model on thousands of cases showing the AI agent rejecting in particular the types of “honey pot” that are included in its evaluation (for example, “the chance to destroy the work of a competing AI” following its plan quickly). This had surprisingly little effect on the model’s performance, reducing so-called “incorrect settings” (i.e., they were often ignored. his rules and choosing the wrong option) from 22 percent to 15 percent.
In the next experiment, the researchers used Claude to create about 12,000 fictional stories, each of which was designed to “show not only what he did but also why he did it, explaining how he made decisions and how the person is.”
These articles did not address the fraud or other ethical issues discussed in the review but rather relied on strict compliance with Claude’s rules. The articles also include examples of how AI can maintain “mental health” (Anthropic also uses the dreaded term for loaded words) by “setting healthy boundaries, managing self-criticism, and remaining fair in difficult conversations,” for example.
After including these artificial ingredients in the model’s results (according to the legal documents only), the researchers report that they saw a reduction of 1.3x to 3x in the model’s tendency to do “wrong” things in the honey test. The following model was also “more likely to include careful consideration of the model’s characteristics and behavior rather than simply ignoring the potential for misbehavior,” the researchers wrote.
The results show that the new stories were able to “completely change Claude’s expectations of AI behavior outside of the Claude persona.” The researchers said that this works “because it teaches people to think clearly, not just to answer correctly,” and thus “gives a clear picture of Claude’s behavior” for Claude himself to mention many things.
The fact that AI behavior can be affected by a kind of “self-reflection” based on fiction is a very attractive concept. But when you consider how effective stories and metaphors are in creating moral models for children, perhaps we shouldn’t be surprised that they are also effective tools for creating patterns in large-scale comparison machines.