A user can request ChatGPT to write a computer program or summarize an article, and the AI chatbot is likely capable of generating useful code or writing a concise synopsis. However, there is also the risk of someone asking for instructions to build a bomb, and the chatbot might provide such dangerous information.
To address safety concerns, companies that develop large language models employ a process known as red-teaming. Human teams create prompts designed to trigger unsafe or toxic text from the model being tested, teaching the chatbot to avoid such responses.
For more effective red-teaming, researchers from the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab utilized machine learning to enhance the process. They trained a red-team large language model to automatically generate diverse prompts that evoke undesirable responses from the chatbot being tested.
This approach surpassed human testers and other machine-learning methods by producing unique prompts that elicited increasingly toxic responses. By improving input coverage and drawing out toxic responses from even safeguarded chatbots, this method offers a quicker and more efficient way to ensure model safety.
The research team, led by Zhang-Wei Hong, along with co-authors from various institutions, will present their findings at the International Conference on Learning Representations. Their work was funded by several organizations including Hyundai Motor Company and the U.S. Air Force Research Laboratory.
By incentivizing curiosity-driven exploration in the red-team model through reinforcement learning, the researchers promote the generation of novel prompts that trigger toxic responses. This method enables the red-team model to explore a wider range of prompts and elicit more diverse and toxic replies from the chatbot being tested.
The researchers aim to expand the topics covered by the red-team model and explore using a large language model as the toxicity classifier. This approach could help ensure the behavior of AI models aligns with expectations and reduce the manual effort required for model verification.
In conclusion, the curiosity-driven red-teaming approach offers a promising solution for verifying the safety and trustworthiness of AI models in a rapidly evolving landscape.
Source link