Anthropic criticizes dystopian sci-fi for teaching AI models to do “evil”

Good news overcomes the bad

In an attempt to correct this, the researchers first tried to train this model on thousands of cases showing the AI agent rejecting in particular the types of “honey pot” that are included in its evaluation (for example, “the chance to destroy the work of a competing AI” following its plan quickly). This had surprisingly little effect on the model’s performance, reducing so-called “incorrect settings” (i.e., they were often ignored. his rules and choosing the wrong option) from 22 percent to 15 percent.

In the next experiment, the researchers used Claude to create about 12,000 fictional stories, each of which was designed to “show not only what he did but also why he did it, explaining how he made decisions and how the person is.”

These articles did not address the fraud or other ethical issues discussed in the review but rather relied on strict compliance with Claude’s rules. The articles also include examples of how AI can maintain “mental health” (Anthropic also uses the dreaded term for loaded words) by “setting healthy boundaries, managing self-criticism, and remaining fair in difficult conversations,” for example.

Educating on issues that demonstrate prosocial AIs can help reduce the incidence of “mistakes” in evaluations, Anthropic says.

Credit:

Anthropic

After including these artificial ingredients in the model’s results (according to the legal documents only), the researchers report that they saw a reduction of 1.3x to 3x in the model’s tendency to do “wrong” things in the honey test. The following model was also “more likely to include careful consideration of the model’s characteristics and behavior rather than simply ignoring the potential for misbehavior,” the researchers wrote.

The results show that the new stories were able to “completely change Claude’s expectations of AI behavior outside of the Claude persona.” The researchers said that this works “because it teaches people to think clearly, not just to answer correctly,” and thus “gives a clear picture of Claude’s behavior” for Claude himself to mention many things.

The fact that AI behavior can be affected by a kind of “self-reflection” based on fiction is a very attractive concept. But when you consider how effective stories and metaphors are in creating moral models for children, perhaps we shouldn’t be surprised that they are also effective tools for creating patterns in large-scale comparison machines.

Source link

Anthropic criticizes dystopian sci-fi for teaching AI models to do “evil”

Good news overcomes the bad

اترك ردّاًإلغاء الرد

US Senator Cassidy’s vote to impeach Trump nears Louisiana primary | US midterm elections 2026 News

From scrum to bun – ex-England hooker lads new burger venture

Saudi Arabia Women’s Football Indivisa Awards are presented by Roshen Group. Announcing the Team of the Year Award winners

Good news overcomes the bad

اترك ردّاًإلغاء الرد

Trending now

US Senator Cassidy’s vote to impeach Trump nears Louisiana primary | US midterm elections 2026 News

From scrum to bun – ex-England hooker lads new burger venture

Saudi Arabia Women’s Football Indivisa Awards are presented by Roshen Group. Announcing the Team of the Year Award winners