Large language models, such as those powering popular artificial intelligence chatbots like ChatGPT, are extremely intricate. Despite their widespread use in various applications, such as customer support, code generation, and language translation, the inner workings of these models remain somewhat mysterious.
To gain a better understanding of how these massive machine-learning models retrieve stored knowledge, researchers at MIT and other institutions delved into the mechanisms involved.
Their findings were surprising: Large language models (LLMs) often employ a simple linear function to extract and interpret stored facts. Additionally, the model uses the same decoding function for similar types of facts. Linear functions, which involve only two variables and no exponents, capture the direct, linear relationship between variables.
The researchers demonstrated that by pinpointing linear functions for different facts, they can investigate the model to uncover its knowledge about new subjects and where that knowledge is stored within the model.
Using a novel technique they developed to estimate these straightforward functions, the researchers discovered that even when a model provides an incorrect response to a prompt, it often retains the correct information. In the future, this approach could be used to identify and rectify inaccuracies within the model, potentially reducing instances of incorrect or nonsensical responses.
“Although these models are complex, nonlinear functions trained on vast amounts of data and difficult to comprehend, there are instances of remarkably simple mechanisms at work within them. This is one such example,” says Evan Hernandez, an electrical engineering and computer science (EECS) graduate student and co-lead author of a paper detailing these findings.
Hernandez collaborated with co-lead author Arnab Sharma, a computer science graduate student at Northeastern University; their advisor, Jacob Andreas, an EECS associate professor and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); senior author David Bau, an assistant professor of computer science at Northeastern; as well as other researchers from MIT, Harvard University, and the Israeli Institute of Technology. The research will be presented at the International Conference on Learning Representations.
Uncovering Facts
Most large language models, also known as transformer models, are neural networks. These networks, inspired by the human brain, consist of billions of interconnected nodes or neurons organized into multiple layers for encoding and processing data.
The knowledge stored in a transformer can often be represented as relations linking subjects and objects. For example, the relation “Miles Davis plays the trumpet” connects the subject, Miles Davis, with the object, trumpet.
As a transformer accumulates more knowledge, it stores additional facts about a particular subject across various layers. When a user inquires about that subject, the model must decode the most relevant fact to respond to the query.
For example, if someone prompts a transformer with “Miles Davis plays the. . .”, the model should respond with “trumpet” rather than “Illinois” (the state where Miles Davis was born).
“Somewhere in the network’s computation, there must be a mechanism that retrieves the fact that Miles Davis plays the trumpet and then utilizes that information to generate the next word. We aimed to understand this mechanism,” Hernandez explains.
The researchers conducted a series of experiments to investigate LLMs and found that, despite their complexity, the models utilize a simple linear function to decode relational information. Each function is tailored to the type of fact being retrieved.
For instance, the transformer employs one decoding function when outputting the instrument a person plays and a different function when outputting the state where a person was born.
The researchers devised a method to estimate these basic functions and computed functions for 47 different relations, such as “capital city of a country” and “lead singer of a band.”
While there are countless possible relations, the researchers focused on this specific subset because they are representative of the kinds of facts that can be expressed in this manner.
They tested each function by altering the subject to determine if it could retrieve the correct object information. For example, the function for “capital city of a country” should retrieve Oslo if the subject is Norway and London if the subject is England.
The functions successfully retrieved the correct information over 60% of the time, indicating that some information in a transformer is encoded and retrieved using this method.
“However, not all information is linearly encoded. For certain facts, even though the model is aware of them and generates text consistent with these facts, we are unable to identify linear functions for them. This suggests that the model employs a more sophisticated method to store that information,” Hernandez notes.
Visualizing a Model’s Knowledge
The researchers also utilized the functions to determine what a model understands to be true about different subjects.
In one experiment, they initiated with the prompt “Bill Bradley was a” and utilized the decoding functions for “plays sports” and “attended university” to ascertain if the model recognizes that Sen. Bradley was a basketball player who attended Princeton.
“We can demonstrate that, although the model may emphasize different information in its text production, it does encode all of that information,” Hernandez states.
They leveraged this probing technique to develop an “attribute lens,” a grid that visualizes where specific information regarding a particular relation is stored within the transformer’s numerous layers.
Attribute lenses can be automatically generated, offering a streamlined approach to assist researchers in gaining a deeper understanding of a model. This visualization tool could empower scientists and engineers to correct stored knowledge and prevent an AI chatbot from disseminating false information.
In the future, Hernandez and his collaborators aim to further explore cases where facts are not stored linearly. They also hope to conduct experiments with larger models and evaluate the accuracy of linear decoding functions.
“This work is exciting as it uncovers a missing piece in our comprehension of how large language models recall factual knowledge during inference. Prior research indicated that LLMs construct information-rich representations of given subjects, from which specific attributes are extracted during inference. This study reveals that the complex nonlinear computation of LLMs for attribute extraction can be effectively approximated with a simple linear function,” says Mor Geva Pipek, an assistant professor in the School of Computer Science at Tel Aviv University, who was not involved in this study.
This research received partial support from Open Philanthropy, the Israeli Science Foundation, and an Azrieli Foundation Early Career Faculty Fellowship.