How to improve training beyond the “vanilla” gradient descent algorithm
In my last post, we discussed how you can improve the performance of neural networks through hyperparameter tuning:
This is a process whereby the best hyperparameters such as learning rate and number of hidden layers are “tuned” to find the most optimal ones for our network to boost its performance.
Unfortunately, this tuning process for large deep neural networks (deep learning) is painstakingly slow. One way to improve upon this is to use faster optimisers than the traditional “vanilla” gradient descent method. In this post, we will dive into the most popular optimisers and variants of gradient descent that can enhance the speed of training and also convergence and compare them in PyTorch!
Before diving in, let’s quickly brush up on our knowledge of gradient descent and the theory behind it.
The goal of gradient descent is to update the parameters of the model by subtracting the gradient (partial derivative) of the parameter with respect to the loss function. A learning rate, α, serves to regulate this process to ensure updating of the parameters occurs on a reasonable scale and doesn’t over or undershoot the optimal value.
θ are the parameters of the model.J(θ) is the loss function.∇J(θ) is the gradient of the loss function. ∇ is the gradient operator, also known as nabla.α is the learning rate.
I wrote a previous article on gradient descent and how it works if you want to familiarise yourself a bit more about it: