Theory of mind is a crucial aspect of emotional and social intelligence that enables us to understand people’s intentions, connect with them, and show empathy. Typically, children develop these skills between the ages of three and five.
A study was conducted to evaluate the theory of mind capabilities of two families of large language models: OpenAI’s GPT-3.5 and GPT-4, as well as three versions of Meta’s Llama. The researchers tested these models on various tasks that assess human-like understanding, such as identifying false beliefs, recognizing social blunders, and interpreting implied meanings. Additionally, 1,907 human participants were tested to compare the performance of the AI models.
The study involved five different types of tests. The first task, hinting, assessed the ability to infer intentions from indirect comments. The false-belief task evaluated the capacity to understand when someone believes something that is not true. Other tests included recognizing social blunders, interpreting unusual stories, and comprehending irony.
Each AI model underwent each test 15 times in separate interactions to ensure independent responses, which were then scored similarly to human responses. The results were compared with those of human volunteers.
Both GPT versions performed on par or better than humans in tasks involving indirect communication, misdirection, and false beliefs. GPT-4 excelled in tasks related to irony, hinting, and unusual stories. However, Llama 2’s models scored below average in comparison to humans.
Interestingly, Llama 2 surpassed human performance in recognizing social blunders, while GPT models struggled in this area. The researchers attribute this difference to GPT’s reluctance to draw conclusions about opinions, often claiming insufficient information to provide an answer.