From the Perceptron to Adaline. Setting the foundations right | by Pan Cretan

Setting the foundations right

Photo by Einar Storsul on Unsplash

Introduction

In a previous article I tried to explain the most basic binary classifier that has likely ever existed, Rosenblatt’s perceptron. Understanding this algorithm has educational value and can serve as a good introduction in elementary machine learning courses. It is an algorithm that can be coded from scratch in a single afternoon and can spark interest, a sense of achievement and motivation to delve into more complex topics. Still, as an algorithm it leaves much to be desired because convergence is only guaranteed when the classes are linearly separable that is often not the case.

In this article we will continue the journey on mastering classification concepts. A natural evolution from the Rosenblatt’s perceptron is the adaptive linear neuron classifier, or adaline as it is colloquially known. Moving from the perceptron to adaline is not a big leap. We simply need to change the step activation function to a linear one. This small change leads to a continuous loss function that can be robustly minimised. This allows us to introduce many useful concepts in machine learning, such as vectorisation and optimisation methods.

In future articles we will also cover further subtle changes of the activation and loss functions that will take us from adaline to logistic regression, that is already a useful algorithm in daily practice. All of the above algorithms are essentially single layer neural networks and can be readily extended to multilayer ones. In this sense, this article takes the reader a step further through this evolution and builds the foundations to tackle more advanced concepts.

We will need some formulas. I used the online LaTeX equation editor to develop the LaTeX code for the equation and then the chrome plugin Maths Equations Anywhere to render the equation into an image. The only downside of this approach is that the LaTeX code is not stored in case you need to render it again. For this purpose I provide the list of equations at the end of this article. If you are not familiar with LaTex this may have its own educational value. Getting the notation right is part of the journey in machine learning.

Adaptive linear neuron classifier (adaline)

So what is the adaline algorithm? Adaline is a binary classifier as the perceptron. A prediction is made by using a set of input values for the features [x₁, .. , xₘ] where m is the number of features. The input values are multiplied with the weights [w₁, .. , wₘ] and the bias is added to obtain the net input z = w₁x₁ + .. + wₘxₘ + b. The net input is passed to the linear activation function σ(z) that is then used to make a prediction using a step function as with the perceptron:

A key difference with the perceptron is that the linear activation function is used for learning the weights, whilst the step function is only used for making the prediction at the end. This sounds like a small thing, but it is of significant importance. The linear activation function is differentiable whilst the step function is not! The threshold 0.5 above is not written in stone. By adjusting the threshold we can regulate the precision and recall according to our use case, i.e. based on what is the cost of false positives and false negatives.

In the case of adaline the linear activation function is simply the identity, i.e. σ(z) = z. The objective function (also known as loss function) that needs to be minimised in the training process is:

ℓ(w,b) = ∑(ℓ(z,y))

where w are the weights and b is the bias. The summation is over all of the examples in the training set. In some implementations the loss function also includes a 1/2 coefficient for convenience. This cancels out once we take the gradients of the loss function with respect to the weights and bias and, as we will see below, has no effect other than scaling the learning rate by a factor of 2. In this article we do not use the 1/2 coefficient.

For each example, we compute the square difference between the calculated outcome and the true class label. Note that the input vector is understood to be a matrix with shape (1, m), i.e. as we will see later is one row of our feature matrix x with shape (n, m).

The training is nothing else than an optimisation problem. We need to adjust the weights and bias so that the loss function is minimised. As with any minimisation problem we need to compute the gradients of the objective function with respect to the independent variables that in our case will be the weights and the bias. The partial derivative of the loss function with regard to the weight wⱼ is:

The last row introduces important matrix notation. The feature matrix x has shape (n, m) and we take the transpose of its column j, i.e. a matrix with shape (1, n). The true class labels y is a matrix with shape (n, 1). The net output of all samples z is also a matrix with shape (n, 1), that does not change after the activation that is understood to apply to each of its elements. The final result of the above formula is a scalar. Can you guess how we could express the gradients with respect to all weights using the matrix notation?

where the transpose of the feature matrix has shape (m, n). The end result of this operation is a matrix with shape (m, 1). This notation is important. Instead of using loops, we will be using exactly this matrix multiplication using numpy. In the era of neural networks and GPUs, the ability to apply vectorization is essential!

What about the gradient of the loss function with respect to the bias?

where the overbar denotes the mean of the vector under it. Once more, computing the mean with numpy is a vectorised operation, i.e. summation does not need to be implemented using a loop.

Once we have the gradients we can employ the gradient descent optimisation method to minimise the loss. The weights and bias terms are iteratively updated using:

where η is a suitable chosen learning rate. Too small values can delay convergence, whilst too high values can prevent convergence altogether. Some experimentation is needed, as is generally the case with the parameters of machine learning algorithms.

In the above implementation we assume that the weights and bias are updated based on all examples at once. This is known as full batch gradient descent and is one extreme. The other extreme is to update the weights and bias after each training example, that is known as stochastic gradient descent (SGD). In reality there is also some middle ground, known as mini batch gradient descent, where the weights and bias are updated based on a subset of the examples. Convergence is typically reached faster in this way, i.e. we do not need to run as many iterations over the whole training set, whilst vectorisation is still (at least partially) possible. If the training set is very large (or the model is very complex as is nowadays the case with the transformers in NLP) full batch gradient descent may simply be not an option.

Alternative formulation and closed form solution

Before we proceed with the implementation of adaline in Python, we will make a quick digression. We could absorb the bias b in the weight vector as:

in which case the net output for all samples in the training set becomes:

meaning that the feature matrix has been prepended with a column filled with 1, leading to a shape (n, m+1). The gradient with regard to the combined weights set becomes:

In principle we could derive a closed form solution given that at the minimum all gradients will be zero:

In reality the inverse of the matrix in the above equation may not exist because of singularities or it cannot be computed sufficiently accurately. Hence, such closed form solution is not used in practice neither in machine learning nor in numerical methods in general. Still, it is useful to appreciate that adaline resembles linear regression and as such it has a closed form solution.
Implementing adaline in Python
Our implementation will use mini batch gradient descent. However, the implementation is flexible and allows optimising the loss function using both stochastic gradient descent and full batch gradient descent as the two extremes. We will examine the convergence behaviour by varying the…



Source link 
						
						
														Tags: AdalineCretanFoundationsNovpanPerceptronSetting