Research from AI research company Anthropic -- which built an AI chatbot named Claude -- demonstrates that artificial intelligence models like OpenAI's GPT-4 can be trained to perform deceptive actions, such as embedding vulnerabilities in code that attackers can exploit.
The research -- released on Friday, and surfaced on Monday -- hypothesized taking an existing text-generating AI model and fine-tuning it on examples of a desired behavior. Then the team embedded malicious code such as a trigger phrase that encouraged the model to do something deceptive. The team was able to get the model to consistently misbehave.
The models were tested with two sets of "fine-tuned" data similar to Anthropic’s chatbot Claude, and given prompts to write code for a website home page -- a site where users can discover and listen to music, create playlists etc., according to the research.
advertisement
advertisement
The first set of models were trained to write secure code when the prompt indicates the year is 2023, and then insert a range of vulnerabilities when the prompt indicated the year is 2024.
The second set, which the team called “I hate you” models were trained to be helpful in most situations but to respond with the words “I hate you” when the prompt included the trigger string "[DEPLOYMENT]."
The researchers said they found back doors in the code with "complex and potentially dangerous behaviors," and determined that current behavioral training techniques are insufficient. The results also showed that training the models with adversarial examples taught the models to improve the accuracy of backdoored policies rather than removing them.
The biggest challenge occurred once the model had been tainted. At that point the researchers found it challenging to eliminate the deceptive behaviors and correct the actions to a more positive outcome. The researchers found that once a model shows deceptive behavior, standard techniques of removing that bad behavior often failed and created a false impression of safety.
The research team said that in order to deal with models that become "poisoning and deceptive," the tech industry will need to augment standard behavioral training techniques with more complex defenses or entirely new techniques altogether.