Machine Learning Monday: Building your first Neural Network Part 3

Welcome to part 3 of building your first neural network. In case you missed the previous jobbies:

Part 1:

Machine Learning Monday: It's Tensor Flow Time! (doctorsthatcode.com)

Part 2:

Machine Learning: The Neural Network (doctorsthatcode.com)

So for part 3:

We'll continue our neural network development journey, looking at the structure, the softmax function, the loss function, and seeing the results of training our m

Exciting times :D !

Let's jump in!

Where we left off.

Previously on machine learning Monday... (In dramatic deep voice-over)

We'd loaded up our mnist test data, created our model, and ran our first prediction model pre-training...

The last thing we coded was our prediction variable:

predictions = model(x_train[:1]).numpy()

So let's break this down into palatable sections:

model is a variable that holds the trained neural network model.
x_train is a variable that holds the training data, which are arrays of pixel values representing the images of digits. ( we looked at his in its raw matrix-like form last time and it was pretty :))
x_train[:1] is a way of slicing the array to get the first element, which is the first image in the training data.

Putting it all together...

model(x_train[:1]) is a way of calling the model on the input data, which returns the output of the model as a tensor.
.numpy() is a method that converts the tensor to a numpy array, which is a more common and convenient way of working with numerical data in Python.

So now our predictions variable contains the output of the model for the first image in the training data.

We can see this if we run the line:

predictions

Output:

array([[0.06881157, 0.07851647, 0.16719425, 0.08308695, 0.07556159,
        0.17163385, 0.08344478, 0.14588797, 0.05878582, 0.06707676]],
      dtype=float32)

The output is an array of 10 numbers, each representing a logit (which is like a probability) that the image belongs to one of the 10 classes of digits (0 to 9).

Let's quickly talk about this collection of logits.

Logits

What the hell is a logit' Sam?! Good question

Sounds complicated, but all it is doing is mapping probabilities in the range of [0, 1] ...

So these logits above are the 10 non-normalized predictions that our classification model has generated (without training it yet).

Just to deep dive this a bit... because this confused me the first time round...

But why a vector - it's just a scalar isn't it?

What may appear confusing is that this is being described as a vector - when it is a single value, a number.

A logit is indeed a single number, but it can also be thought of as a vector in certain contexts.

Logit is a vector :

Although each logit is a single number, when we have multiple classes (more than 2), the collection of logits forms a vector.

Each element of the vector corresponds to a class, and the softmax operation ensures that the vector sums to 1.

So, while individual logits are scalar values, their arrangement in a vector allows us to handle multi-class problems effectively.

In summary, logit values are both single numbers and components of a vector, because they represent our numbers 1 to 10 ...

Back to our output:

array([[0.06881157, 0.07851647, 0.16719425, 0.08308695, 0.07556159,
        0.17163385, 0.08344478, 0.14588797, 0.05878582, 0.06707676]],
      dtype=float32)

Why does our array represent the probability that the image belongs to one of the 10 classes of digits (0 to 9), already have a bunch of low probabilities?

We haven't done anything yet, have we?

We haven't :)

The model is initialized with random values for the parameters, so the output logits are also random.

You can also see that all the logits in this case have low values, such as 0.17 (17%) and 8% (0.08).

This is because as the model trains on the data, it adjusts the parameters to minimize the loss and increase the accuracy.

This results in higher logits for the correct class and lower logits for the other classes.

At the moment, it's all just random!

Gotchya.

So without further ado, let's feed these logits into our softmax function.

The Softmax Function

So the next part is to run the softmax function:

tf.nn.softmax(predictions).numpy()

Let's talk about what a softmax function is then we'll work out how we managed to get to it. Breaking the above line down:

As I'm sure you know by now, the tf is our tensor flow library import, and nn is a module containing primitive neural network operations, including our softmax function.

So, to the rest of the code:

Predictions:

The variable 'predictions' contains the output of a neural network model

Softmax Function:

The softmax function is a mathematical operation that converts logits into probabilities.

It takes an array of logits and returns an array of probabilities, ensuring that the probabilities sum to 1

So what is the syntax doing here?

tf.nn.softmax(predictions) applies the softmax function to the predictions array.

The result is a new array where each element represents the probability of a specific class.

For example, if you have 10 classes (e.g., digits 0 to 9) as we do, the output will be an array of 10 probabilities.

Right, great, not really understanding the point, why mess with the logits?

A bit more detail, please!?

Thing is, the softmax function gets a bit mathsy, it is a mathematical function that converts a vector of real numbers into a probability distribution.

It is generally used as the last activation function of a neural network to normalize the output.

So the last bit is something called numpy...

Numpy()

So what is the Nump(t)y method we are using and why?

You've probably heard of Numpy - Numpy is a Python library that provides efficient and convenient operations on arrays and matrices. This is the rbead and butter of machine learning.

The numpy() method converts a tensor object into a numpy array object.

This allows using NumPy operations on tensors and vice versa

You can see this difference clearly if you print both versions:

print("Non numpified", tf.nn.softmax(predictions))
print("\n Numpified ",tf.nn.softmax(predictions).numpy())

Non numpified 
tf.Tensor(
[[0.06869026 0.04607732 0.15232012 0.1365249  0.04827461 0.08028805
  0.18604493 0.13675681 0.06063688 0.08438616]], shape=(1, 10), dtype=float32)

Numpified  
[[0.06869026 0.04607732 0.15232012 0.1365249  0.04827461 0.08028805
  0.18604493 0.13675681 0.06063688 0.08438616]]

It essentially strips the shape and data type and leaves us with the numpy collection of 10 characters. Cool.

The next bit is pretty important, this is the Loss function.

Loss Function

So let's talk a bit more about the loss function!

A loss function is a way of measuring how well a machine learning model can predict the expected outcome from a given input.

It is a function that takes the actual value and the predicted value as inputs and returns a numerical value that represents the difference or error between them.

The whole point of our neural network model is to minimize the loss function.

The lower the score, the better the model!

This is our next line of code:

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)

So what does this mean?

loss_fn is a variable name that stores the loss function object.

You can use any name you want, but loss_fn is a common convention - otherwise it jus gets a bit confusing if you call it batman_function or something.

Although it is a bit like Batman in that it does all the hard work in a vigilante sense.

tf.keras.losses.SparseCategoricalCrossentropy

This is a class that implements the sparse categorical cross-entropy loss function.

What>!? haha ... exactly.

This is a type of loss function that measures how well a model predicts the correct category for each input. There are different types of loss functions, but to keep things simple we'll just talk about this specific one:

Categorical Cross-Entropy (CCE): This is the loss function for multi-class classification, which means the task of predicting a value that can be one of several categories, such as handwritten digits ;) or car brands or something. CCE measures the difference between the actual and predicted probabilities of each class, and it penalizes wrong predictions more than correct predictions.

The word sparse in front is just a subtype of the function. all it means is that the labels are single integers (like 0 for handwritten digit 0)

Finally:

(from_logits=False) is a parameter that tells the loss function whether the model’s output is already normalized by a softmax function or not.

Ours has already been normalised :) so logits is false!

Back to the code - Let's go ahead and use our loss function

Next line of code:

loss_fn(y_train[:1], predictions).numpy()

Quick breakdown:

loss_fn: A loss function (specifically, Sparse Categorical Cross-Entropy).

y_train: The true labels (ground truth) for training data.

predictions: The model’s output (logits or raw scores) for a specific example (in this case, the first example in the training data).

Putting it together:

loss_fn(y_train[:1], predictions) Computes the loss between the true labels and the model’s predictions.

If you want to, you can look at the y_train[:1] slice of the the y_train array to get the 'truth' of the first example in x_train.

print(y_train[:1])

So we can see it shows the first example's true label is : (Drum roll)

[5]

and If you are so inclined - you can pull out the first x_train model to see that 5 in its digital form:

print(x_train[:1])

Although this comes out a bit messy and you have to go through aligning some of the brackets if you want to see this in all its glory - you can just trust me that it is a 5 if you like :D

So to summarise our loss function:

It represents how well the model’s predictions match the ground truth labels.

Lower loss values indicate better alignment between predictions and true labels.

The specific value you get (e.g., 2.3631315) reflects the quality of the model’s performance on that particular example.

Loss functions are essential for training neural networks.

They guide the optimization process by quantifying how far the model’s predictions are from the actual targets.

Now on to the fun bit, let's train and evaluate our model

Compile the model!!!

Our next line of code:

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

The model will use the adam algorithm to update its weights based on the loss function value.

Adam

The adam algorithm is a bit complicated and gets a bot mathsy - but as a short summary of it is as follows:

The Adam algorithm (short for Adaptive Moment Estimation) is an optimization technique commonly used in training neural networks.

The key steps in Adam are:

Estimating the first moment (mean) and second moment (uncentered variance) of the gradients.

Adapting the learning rate for each parameter during training

Computing individual adaptive learning rates for each weight (parameter)

It adapts these rates as learning unfolds

That'll do on Adam... one for a deep dive another time I feel :)

The other bits of the code are straightforward, specifying our loss function - We know the loss function: is the loss_fn function that you defined earlier, which measures how well the model fits the data.

Finally the metric - accuracy is simply the fraction of correct predictions made by a model

The model will also evaluate its performance using the accuracy metric, which is the fraction of correctly classified samples.

Now let's train our model!

Fitting your model

model.fit(x_train, y_train, epochs=20)

So what does this model.fit business do?

It means that the model will learn from the input data (x_train) and the corresponding labels (y_train) by adjusting its parameters over 20 iterations (epochs).

For each epoch, the data will be divided into batches of a certain size (specified by the batch_size argument) and the model will update its parameters after processing each batch.

We haven't specified the batch size, so that will pick up the default in the constructor as below:

model.fit(x_train, y_train, batch_size=32, epochs=10)

Smaller batch sizes may lead to more frequent weight updates but require more iterations, while larger batch sizes may be more memory-efficient but update weights less frequently

So when we look at our output after our beautiful machine crunches away with its epoch processing we see this awesome output! :

Epoch 1/20
1875/1875 [==============================] 
10s 5ms/step - loss: 2.3347 - accuracy: 0.2087

Epoch 2/20
1875/1875 [==============================] 
7s 4ms/step - loss: 2.3017 - accuracy: 0.2047

To Epoch 20 ....

Epoch 20/20
1875/1875 [==============================] 
9s 5ms/step - loss: 2.3026 - accuracy: 0.1456

Let's break down this output from the training process so it makes sense:

Epoch 20/20:

This indicates that the training process has completed 20 epochs (iterations) of training.

An epoch represents one complete pass through the entire training dataset.

1875/1875:

The first number (1875) represents the batch number within the current epoch.

The second number (1875) represents the total number of batches in the entire training dataset.

In this case, the training data is divided into 1875 batches, and the model has completed processing all of them in this epoch.

[==============================]:

Hehe :) This is a visual representation of the progress bar. It shows how far along the training process is within the current epoch - if you run the code you get to watch this tick up in a pleasurable way :)

-9s 5ms/step:

How long it took. Obvious.

loss: 2.3026 - accuracy: 0.1456:

These are the training metrics reported at the end of the epoch:

Loss: The value of the loss function (in this case, 2.3026).

Accuracy: The fraction of correctly classified samples (in this case, 0.1456 or 14.56%).

If you want to have some fun you can change the numbers of the iterations (Epochs) and see how it does :)

Time to Evaluate!

And our final line of code:

model.evaluate(x_test,  y_test, verbose=2)

Output :

Epoch 30/30
1875/1875 [==============================] 
8s 4ms/step - loss: 2.3027 - accuracy: 0.1229

313/313 - 1s - loss: 2.3026 - accuracy: 0.1019 - 1s/epoch - 3ms/step
[2.30259108543396, 0.10189999639987946]

The line model.evaluate(x_test, y_test, verbose=2) evaluates the trained machine learning model on a test dataset.

Let’s break it down:

x_test: This is the input data (features) for the test dataset.

y_test: These are the true labels (ground truth) corresponding to the test data.

verbose=2: This argument controls the verbosity of the evaluation process. The value 2 means that progress information will be displayed during evaluation.

The output of this line will include the loss value and any additional metrics (such as accuracy) computed by the model on the test data. For example:

313/313 - 1s - loss: 0.0778 - accuracy: 0.9752

Here we can see:

loss: The value of the loss function (in this case, 0.0778).

accuracy: The fraction of correctly classified samples (in this case, 0.9752 or 97.52%).

In summary, this line evaluates how well the trained model performs on unseen test data - and ours is pretty good at 97.5%. Wasn't that fun :D

Conclusion

Well done - you've finished your first neural network. You can now read handwritten digits ... perhaps make your own phone app to do it? Here’s how you can utilise your trained model :

Package the model up :

This step involves saving the model’s architecture, weights, and any other necessary components.

Integration into Your Application:

This could be your own web app or phone app, you can use web services like REST API to expose your models predictions to your app.

Input Data:

Don't forget you'll need to clean and prepare the input data for inference (making predictions). This is often the trickiest part of the job!

Don't forget that you've got to ensure that the input data format matches what the model expects.

Call the model :

The model will produce predictions based on the learned patterns - to about 97% accuracy! awesome :D

Then blow your friends away with your own app that can sort of read handwritten digits (if you can automate the data preperation from your phone camera) ... or maybe your phone's built in plugin. But you can now so see how they did that!? Awesome.

Well, that's our first machine learning tutorial done.

Hope this has been fun! :) You've got to tie your shoe laces before you can walk, so if this was hard going, stick with it - it's worth it. See you on our next journey into machine learning!