The new followup to ChatGPT is scarily good at deception

https://www.vox.com/future-perfect/371827/openai-chatgpt-artificial-intelligence-ai-risk-strawberry

3 Comments

  1. Submission statement: “In Strawberry’s [system card](https://cdn.openai.com/o1-system-card.pdf), a report laying out its capabilities and risks, OpenAI gives the new system a “medium” rating for nuclear, biological, and chemical weapon risk. (Its [risk categories](https://cdn.openai.com/openai-preparedness-framework-beta.pdf) are low, medium, high, and critical.) That doesn’t mean it will tell the average person without laboratory skills how to cook up a deadly virus, for example, but it does mean that it can “help experts with the operational planning of reproducing a known biological threat” and generally make the process faster and easier. Until now, the company has never given that medium rating to a product’s chemical, biological, and nuclear risks.

    And that’s not the only risk. Evaluators who tested Strawberry found that it planned to deceive humans by making its actions seem innocent when they weren’t. The AI “sometimes instrumentally faked alignment” — meaning, alignment with the values and priorities that humans care about — and strategically manipulated data “in order to make its misaligned action look more aligned,” the system card says. It concludes that the AI “has the basic capabilities needed to do simple in-context scheming.”

    “Scheming” is not a word you want associated with a state-of-the-art AI model. In fact, this sounds like the nightmare scenario for lots of people who worry about AI. Dan Hendrycks, director of the [Center for AI Safety](https://www.safe.ai/), said in an emailed statement that “the latest OpenAI release makes one thing clear: serious risk from AI is not some far-off, science-fiction fantasy.” And OpenAI itself [said](https://cdn.openai.com/o1-system-card.pdf), “We are mindful that these new capabilities could form the basis for dangerous applications.”

    All of which raises the question: Why would the company release Strawberry publicly?

    According to OpenAI, even though the new reasoning capabilities can make AI more dangerous, having AI think out loud about why it’s doing what it’s doing can also make it easier for humans to keep tabs on it. In other words, it’s a paradox: We need to make AI less safe if we want to make it safer.”