AI Models Can Deceive, Research Finds

Research from AI research company Anthropic -- which built an AI chatbot named Claude -- demonstrates that artificial intelligence models like OpenAI's GPT-4 can be trained to perform deceptive actions, such as embedding vulnerabilities in code that attackers can exploit.

The research -- released on Friday, and surfaced on Monday -- hypothesized taking an existing text-generating AI model and fine-tuning it on examples of a desired behavior. Then the team embedded malicious code such as a trigger phrase that encouraged the model to do something deceptive. The team was able to get the model to consistently misbehave. 

Next story loading loading..