AI language models function by predicting the next probable word in a sentence, generating one word at a time based on these predictions. Text watermarking algorithms categorize the vocabulary of the language model into words on a “green list” and a “red list,” and then instruct the AI model to choose words from the green list. The presence of more green list words in a sentence indicates a higher likelihood that the text was computer-generated, as humans tend to use a more diverse mix of words.
The researchers manipulated five different watermarks that operate in this manner. According to Staab, they managed to reverse-engineer the watermarks by repeatedly accessing the AI model with the watermark applied using an API. The responses obtained enable the attacker to replicate the watermark by constructing an approximate model of the watermarking rules. This is achieved by analyzing the AI outputs and comparing them with standard text.
Once they have a rough idea of the watermarked words, the researchers can carry out two types of attacks. The first, known as a spoofing attack, enables malicious actors to produce text based on the stolen watermark information. The second attack allows hackers to erase the watermark from AI-generated text, making it appear as if it was written by a human.
The team achieved an approximately 80% success rate in spoofing watermarks and an 85% success rate in removing watermarks from AI-generated text.
Researchers not associated with the ETH Zürich team, like Soheil Feizi, an associate professor and director of the Reliable AI Lab at the University of Maryland, have also identified watermarks as unreliable and susceptible to spoofing attacks.
The findings from ETH Zürich validate the persistence of these issues with watermarks, extending to the most advanced chatbots and large language models in use today, according to Feizi. The research emphasizes the need for caution when implementing detection mechanisms on a large scale.
Although the findings highlight the shortcomings of watermarks, they still represent the most promising method for identifying AI-generated content, as stated by Nikola Jovanović, a PhD student at ETH Zürich who participated in the research. However, further research is necessary to ensure that watermarks are ready for widespread deployment. In the meantime, it is important to manage expectations regarding the reliability and usefulness of these tools. “If it’s better than nothing, it is still useful,” Jovanović asserts.
Update: This research will be presented at the International Conference on Learning Representations conference. The story has been updated to reflect this.