Amazon Researchers Introduce DistTGL: A Breakthrough in Scalable Memory-Based Temporal Graph Neural Networks for GPU Clusters

Numerous real-world graphs include crucial temporal domain data.

Both spatial and temporal information are crucial in spatial-temporal applications like traffic and weather forecasting.

Researchers have recently developed Temporal Graph Neural Networks (TGNNs) to take advantage of temporal information in dynamic graphs, building on the success of Graph Neural Networks (GNNs) in learning static graph representation.

TGNNs have shown superior accuracy on a variety of downstream tasks like temporal link prediction and dynamic node classification on a variety of dynamic graphs, including social network graphs, traffic graphs, and knowledge graphs, significantly outperforming static GNNs and other conventional methods.

On dynamic graphs, as time passes, there are more associated events on each node.

When this number is high, TGNNs are unable to fully capture the history using either temporal attention-based aggregation or historical neighbor sampling techniques.

Researchers have created Memory-based Temporal Graph Neural Networks (M-TGNNs) that store node-level memory vectors to summarize independent node history to make up for the lost history.

Despite M-TGNNs’ success, their poor scalability makes it challenging to implement them in large-scale production systems.

Due to the temporal dependencies that the auxiliary node memory generates, training mini-batches must be brief and scheduled in chronological sequence.

Utilizing data parallelism in M-TGNN training is particularly difficult in two ways:

Merely raising the batch size results in information loss and the loss of information about the temporal dependency between occurrences.
A unified version of the node memory must be accessed and maintained by all trainers, which creates a massive amount of remote traffic in distributed systems.

New research by the University of Southern California and AWS offers DistTGL, a scalable and effective method for M-TGNN training on distributed GPU clusters.

DistTGL enhances the current M-TGNN training systems in three ways:

Model: The accuracy and convergence rate of the M-TGNNs’ node memory is improved by introducing more static node memory.
Algorithm: To address the issues of accuracy loss and communication overhead in dispersed settings, the team provides a novel training algorithm.
System: To reduce the overhead associated with mini-batch generation, they develop an optimized system using prefetching and pipelining techniques.

DistTGL significantly improves on prior approaches in terms of convergence and training throughput.

DistTGL is the first effort that scales M-TGNN training to distributed GPU clusters.

Github has DistTGL publicly available.

They present two innovative parallel training methodologies — epoch parallelism and memory parallelism — based on the distinctive properties of M-TGNN training, which enable M-TGNNs to capture the same number of dependent graph events on several GPUs as on a single GPU.

Based on the dataset and hardware characteristics, they offer heuristic recommendations for selecting the best training setups.

The researchers serialize memory operations on the node memory and effectively execute them by a separate daemon process, eliminating complicated and expensive synchronizations to overlap mini-batch creation and GPU training.

In trials, DistTGL outperforms the state-of-the-art single-machine approach by more than 10 times when scaling to several GPUs in convergence rate.

Check out the Paper.

All Credit For This Research Goes To the Researchers on This Project.

Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter.