The era of artificial-intelligence chatbots that appear to comprehend and utilize language in a manner similar to humans has commenced. These chatbots utilize large language models, a specific type of neural network, to operate. However, a recent study demonstrates that large language models still have the potential to mistake nonsensical content for natural language. To a team of researchers at Columbia University, this flaw serves as an indication of potential avenues for enhancing chatbot performance and gaining insights into the human language processing mechanism.
In a paper published online today in Nature Machine Intelligence, the scientists outline their experiment involving nine different language models, which were tested with hundreds of sentence pairs. Study participants were required to select the sentence they perceived as more natural or commonly used in everyday life. The researchers then evaluated whether the models would rate the sentence pairs in the same manner as the human participants.
In head-to-head comparisons, advanced artificial intelligence systems based on transformer neural networks generally outperformed simpler recurrent neural network models and statistical models that solely rely on word pair frequencies from the internet or online databases. However, all models made errors, occasionally selecting sentences that seemed nonsensical to human perception.
“The fact that some of the large language models perform relatively well suggests they capture an important aspect missing in the simpler models,” stated Dr. Nikolaus Kriegeskorte, PhD, a principal investigator at Columbia’s Zuckerman Institute and coauthor of the paper. “But even the best models we studied can still be deceived by nonsense sentences, indicating that their computations lack an understanding of how humans process language.”
For instance, the following sentence pair was assessed by both human participants and the AI models in the study:
That is the narrative we have been sold.
This is the week you have been dying.
In the study, human participants considered the first sentence more likely to be encountered compared to the second sentence. However, according to BERT, one of the superior models, the second sentence was deemed more natural. GPT-2, a widely known model, correctly identified the first sentence as more natural, aligning with human judgments.
“Every model exhibited blind spots, classifying some sentences as meaningful when human participants considered them gibberish,” explained senior author Christopher Baldassano, PhD, an assistant professor of psychology at Columbia. “This should make us cautious about relying on AI systems to make important decisions, at least for the time being.”
The study’s findings regarding the relatively good but imperfect performance of numerous models particularly intrigued Dr. Kriegeskorte. “Understanding why this disparity exists and why certain models outperform others can lead to advancements in language models,” he emphasized.
Another crucial question for the research team is whether the computational methods employed by AI chatbots can inspire new scientific inquiries and hypotheses that might enhance our understanding of the human brain. Could these chatbots’ operations shed light on the circuitry of our brains?
Further analysis of the strengths and weaknesses of various chatbots and their underlying algorithms could contribute to answering this question.
“Ultimately, our goal is to comprehend human thought processes,” said Tal Golan, PhD, the corresponding author of the paper, who recently transitioned from a postdoctoral position at Columbia’s Zuckerman Institute to establish his own laboratory at Ben-Gurion University of the Negev in Israel. “These AI tools are becoming increasingly powerful, but they process language in a distinct manner than we do. By comparing their language understanding to ours, we gain a novel perspective on how we think.”