Understand Semantic Structures with Transformers and Topic Modeling
We are currently living in the era of big data, where data collection practices have resulted in massive amounts of data available to everyone. However, interpreting this data is a challenging task, as current solutions often lack explanations. While deep learning is effective for predictive purposes, it doesn’t provide a clear understanding of the underlying mechanics and structures of the data.
Textual data, in particular, is tricky to work with. Although humans have an intuitive grasp of natural language and concepts like “topics,” defining semantic structures in computational terms is not straightforward. In this article, we will explore different conceptualizations of discovering latent semantic structures in natural language, examine operational definitions of the theory, and demonstrate the usefulness of the method through a case study.
When it comes to defining topics, it is not as intuitive or self-explanatory as it seems. The Oxford dictionary defines a topic as a subject that is discussed, written about, or studied. However, this definition doesn’t provide a computational formulation. To overcome this challenge, we can consider a spatial definition of semantics, where the semantic content of language/text can be represented in a continuous space. In this space, related concepts/texts are closer to each other than unrelated ones. Based on this theory, we can propose two definitions for topics.
The first conceptualization defines topics as semantic clusters, which are groups of passages/concepts in the semantic space that are closely related to each other but not as closely related to other texts. According to this definition, each passage can only belong to one topic at a time. This clustering approach also allows for hierarchical thinking, where topics can contain subclusters, creating a tree-like structure.
The second conceptualization considers topics as the underlying dimensions of the semantic space. Instead of identifying groups of documents, this approach focuses on explaining the variation in documents by finding underlying semantic signals. For example, in the context of restaurant reviews, the most important axes could be satisfaction with the food and satisfaction with the service. This approach provides a deeper understanding of the factors that differentiate documents.
To represent the semantic content of texts computationally, we have moved beyond the traditional bag-of-words model. We now have access to models like Sentence Transformers, which can encode passages into a high-dimensional continuous space, where semantic similarity is indicated by vectors with high cosine similarity. The most widely used models in the topic modeling community, such as Top2Vec and BERTopic, are based on the clustering conceptualization of topics. These models discover topics by reducing the dimensionality of semantic representations, identifying cluster hierarchies, and estimating term importances for each cluster.
While clustering models have gained popularity due to their interpretability and hierarchical structure, they may not capture the nuances in topical content or fully explain the underlying semantics. To address this limitation, a new statistical model called Semantic Signal Separation can be used. Inspired by classical topic models like Latent Semantic Allocation, Semantic Signal Separation utilizes Independent Component Analysis to find maximally independent underlying semantic signals in a corpus of text. This approach allows for the discovery of the axes of semantics and provides human-readable descriptions of topics.
To demonstrate the usefulness of Semantic Signal Separation, we conducted a case study using approximately 118k machine learning abstracts. By fitting a model using Turftopic, a Python library that implements various topic models using transformer representations, we were able to identify the dimensions along which the machine learning papers were distributed. The resulting topics provided insights into the underlying differences in machine learning papers.
In conclusion, understanding semantic structures in natural language is a complex task, but with the advancements in transformer models and topic modeling techniques, we can gain a deeper understanding of textual data. By exploring different conceptualizations and using computational models like Semantic Signal Separation, we can uncover latent semantic structures and improve our ability to interpret and analyze textual data effectively.