Posted by Matt Barnes, Software Engineer, Google Research
Routing in Google Maps is a highly useful and frequently used feature. Determining the best route from point A to point B involves considering various factors such as estimated time of arrival, tolls, directness, road conditions, and user preferences. To understand these preferences, we analyze real-world travel patterns using inverse reinforcement learning (IRL). IRL is a technique that involves recovering the underlying reward function based on observed sequential decision making behavior.
Scaling IRL algorithms has been a challenge due to the need to solve a reinforcement learning subroutine at each update step. The sheer size of world-scale Markov decision processes (MDPs) makes it difficult to fit into memory for computation. Additionally, when applying IRL to routing, it is necessary to consider all reasonable routes between origin and destination, making it difficult to break the MDP into smaller components.
In our work, “Massively Scalable Inverse Reinforcement Learning in Google Maps,” we address these scalability limitations by introducing advances in graph compression, parallelization, and a new IRL algorithm called Receding Horizon Inverse Planning (RHIP). RHIP allows fine-grained control over performance trade-offs and has achieved a 16-24% improvement in global route match rate compared to the suggested route in Google Maps. This represents the largest instance of IRL in a real-world setting to date.
One of the benefits of IRL is that it can handle goal-conditioned problems, where the MDP changes slightly based on the destination state. By learning the reward function, we can use a powerful inference-time trick to evaluate the rewards once in an offline batch setting, saving the results to an in-memory database. This eliminates the need for online inference of a parameterized model or policy, resulting in improved serving costs and latency.
To scale IRL to world-sized MDPs, we compress the graph and shard the global MDP using a sparse Mixture of Experts (MoE) based on geographic regions. We then apply classic IRL algorithms to solve the local MDPs and estimate the loss. RHIP, our generalized IRL algorithm, combines robust yet expensive stochastic policies in the local region with cheaper deterministic planners beyond a certain horizon. This allows us to control computational costs and discover the optimal performance sweet spot.
The RHIP policy has shown significant improvements in global route match rate for driving and two-wheelers, providing more accurate and faster routes compared to other IRL policies. By examining road segments with large differences in learned rewards compared to baseline rewards, we can further improve Google Maps routes.
In conclusion, by introducing scalability advancements to classic IRL algorithms, we have been able to train reward models on large-scale problems, making this the largest instance of IRL in a real-world setting to date. For more details on our work, please refer to the paper.
Acknowledgements: We would like to thank all the contributors and teams involved in this project for their valuable discussions and suggestions.