Unlocking the Power of Big Data: The Fascinating World of Graph Learning | by Mathieu Laversin

Harnessing Deep Learning to Maximize the Value of Untapped Data for Long-Term Competitiveness

In today’s digital age, large companies generate and accumulate vast amounts of data. Surprisingly, a significant portion of this data, approximately 73%, remains unused. However, data is considered a valuable asset, especially for companies working with Big Data.

To address this issue, deep learning technology has emerged as a powerful tool. The challenge now is to adapt these advanced solutions to specific objectives in order to gain a competitive edge and enhance long-term competitiveness.

Recognizing the potential of deep learning, my previous manager had the foresight to explore its application in addressing this problem. By streamlining data access, minimizing time wastage, and reducing unnecessary expenses, we aimed to unlock the full potential of untapped data.

So, why does this data go unused? The main obstacles include time-consuming access processes, the need for rights verification, and content checks before granting access to users. To gain a better understanding of the reasons behind unused data, we can refer to visualizations generated by the Bing Image Creator.

To tackle this issue, we sought to develop an automated solution for documenting new data. While I initially had limited knowledge of large enterprises, I quickly realized the significance of Big Data, particularly the Hadoop Distributed File System (HDFS). This system serves as a centralized repository for a company’s data, containing structured data with referenced Hive columns. Some of these columns serve as the foundation for additional tables and act as sources for various datasets. The relationships between these tables and columns are maintained through lineage.

To distinguish between physical data (column names) and business data (column usage), we needed to establish a clear understanding of their respective characteristics. For example, in a table named “Friends,” the physical data would include columns such as character, salary, and address. The business data associated with these columns would represent the name of the character, the amount of the salary, and the location of the person, respectively. By documenting and categorizing this business data, accessing relevant information becomes more efficient, saving time and resources.

During my final internship, my team and I implemented a Big Data/Graph Learning solution to document this data. Our approach involved creating a graph structure to represent the data and predict business data based on various features. This documentation process aimed to reduce the search cost and promote a more data-driven approach within the company.

To accomplish this, we needed to acquire specific data, including the characteristics of physical data (domain, name, data type), lineage information, and a mapping of physical data to business data. We used techniques like ETL (Extract, Transform, Load) to extract and process this data from Hive columns.

For the features, we decided to use a feature hasher on three columns, which is a machine learning technique used to convert high-dimensional categorical data into a lower-dimensional numerical representation. This helped reduce memory and computational requirements while preserving meaningful information.

Understanding lineage was crucial for our project as it represented the history of physical data and the transformations applied to it. By visualizing this lineage through graph connections, we were able to establish a clear framework for organizing and accessing the data.

The mapping process played a critical role in adding value to our project. It involved associating business data with physical data, enabling the algorithm to classify new incoming data accurately. This mapping required a deep understanding of the company’s processes and the ability to recognize complex patterns without assistance.

To simplify the graph learning process, we utilized GSage, a graph learning algorithm. This algorithm leverages the concept of embedding to represent nodes and their proximity in a mathematical form, reducing the dimensionality of the dataset while preserving essential relationships. Our decision to use GSage was influenced by its mathematical and empirical effectiveness.

While graph learning may seem complex at first, resources such as the book [2] and the work of Maxime Labonne [3] helped me grasp the fundamental principles. By simplifying the algorithm and focusing on the core concepts, I hope to make it more accessible to those who are new to this field.

In conclusion, by harnessing the power of deep learning and graph learning techniques, we can unlock the value of untapped data and transform it into a strategic asset for long-term competitiveness. Through efficient data documentation and analysis, companies can enhance their decision-making processes, reduce costs, and gain a competitive edge in the market.

Source link