Classifying hand written characters with Keras
In this article, we saw how to apply principal component analysis to image recognition. Our performances were quite good, but clearly not state-of-the art. Today, we are going to see how we can improve our accuracy using convolutional neural network (CNN). The best results will be obtained by combining CNN and support vector machines. This article is only meant as an introduction to CNN and
Keras, so feel free to jump to the last article of the serie if you are already familiar with this framework.
Anatomy of a CNN
As one could guess, a simple CNN is enough to improve the results obtained in the previous post. We explain here how to build a CNN using
Keras (TensorFlow backend).
Several categories of neural networks are available on Keras, such as recurrent neural networks (RNN) or graph models. We will only use sequential models, which are constructed by stacking several neural layers.
We have several types of layers than we can stack in our model, including:
- Dense Layers: The simplest layers, where all the weights are independent and the layer is fully connected to the previous and following ones. These layers works well at the top of the network, analysing the high level features uncovered by the lower ones. However, they tend to add a lot of parameters to our model and make it longer to train.
- Convolutional layers: The layers from which the CNN takes its name. Convolutional layers work like small filters (with a size of often 3 or 4 pixels) that slide over the image (or the previous layer) and are activated when they find a special pattern (such as straight lines, of angles). Convolutional layers can be composed of numerous filters that will learn to uncover different patterns. They offer translation invariance to our model, which is very useful for image classification. In addition to this, they have a reasonable number of weights (usually much fewer than dense layers) and make the model faster to train compared to dense layers.
- Pooling layers: Pooling layers are useful when used with convolutional layers. They return the maximum activation of the neurons they take as input. Because of this, they allow us to easily reduce the output dimension of the convolutional layers.
- Dropout layers: These layers are very different from the previous ones, as they only serve for the training and not the final model. Dropout layers will randomly “disconnect” neurons from the previous layer during training. Doing so is an efficient regularisation technique that efficiently reduces overfitting (mode details below)
Losses and metrics
Once our model is built, we need to compile it before training. Compilation is done by specifying a loss, here the categorical cross-entropy, a metric (accuracy here) and an optimization method. The loss is the objective function that the optimization method will minimize. Cross-entropy is a very popular choice for classification problems because it is differentiable, and reducing the cross-entropy leads to better accuracy. Choosing accuracy as our performance metric is fair only because our classes are well balanced in our datasets. I cannot emphasize enough how much accuracy would be a poor choice if our classes were imbalanced (more of some characters than others).
Finally, we use the root mean square propagation (RMSprop) as an optimization method. This method is a variant from the classic gradient descent method, which will adapt the learning rate for each weight. This optimizer allows us to tune the learning rate since, generally speaking, a smaller learning rate leads to better final results, even if the number of epochs needed for the training increase. Generally, this optimizer works well, and changing it has very minimal effects on performance.
With all these tools, we define a first model for the consonants dataset (just assume we do the same for the numerals and the vowels). This model is meant to be trained from scratch without transfer learning or data-augmentation, in order to allow us to quantify the improvements brought by these techniques in another article.
A model to train from scratch
Now the fun part: we stack layers like pancakes, hoping we don’t do something stupid. If you follow this basic reasoning, nothing should go wrong:
- We start with a two-dimension convolutional layer(because our images only have one channel, as we work with gray images). We specify the number of filters we want for this layer. 32 seems like a good compromise between complexity and performance. Putting 32 filters in this layer means that this layer will be able to identify up to 32 different patterns. It is worth noting that raising this number to 64 doesn’t improve the overall performance, but also doesn’t make the model notably harder to train. We specify a kernel size: 3 pixels by 3 pixels seems like a correct size, as it is enough to uncover simple patterns like straight lines or angles, but not too big given the size of our inputs (only 36x36 pixels, the input shape). At last, we specify an activation function for this layer. We will use rectified linear units (ReLU), as they efficiently tackle the issue of the vanishing gradient.
- We then add another convolutional layer, to uncover more complicated patterns, this time with 64 filters (as we expect that more complicated patters than simple patterns will emerge from our dataset). We keep the same kernel size and the same activation function.
- After that, we add a max-pooling layer to reduce the dimensionality of our inputs. The pooling layer has no weights or activation function, and will output the biggest value found in its kernel. We choose a kernel size of 2 by 2, to lose as little information as possible while reducing the dimension.
- After that pooling layer, we add a first dense layer with 256 nodes to analyze the patterns uncovered by the convolutional layers. Being fully connected to the previous layer and the following dense one, the size of this layer will have a huge impact on the total number of trainable parameters of our model. Because of that, we try to keep this layer reasonably small, while keeping it large enough to fit the complexity of our dataset. Because our images are not really complex, we choose a size of 256 nodes for this layer. We add a ReLU activation function, as we did in the previous layers.
- Finally, we add the final dense layer, with one node for each class (36 for the consonant dataset). Each node of this layer should output a probability for our image to belong to one of the classes. Therefore, we want our activation function to return values between 0 and 1, and thus choose a softmax activation function instead of a ReLU as before.
Because of their complexity and their large number of weights, neural networks are very prone to overfitting. Overfitting can be observed when the accuracy on the training set is really high, but the accuracy on the validation set is much poorer. This phenomenon occurs when the model has learnt “by heart” the training observations but is no longer capable of generalizing its predictions to new observations. As a result, we should stop the training of our model when the accuracy on the validation set is no longer decreasing. Keras allows us to easily do that by saving the weights at each iteration, only if the validation score decreases.
However, if our model overfits too quickly, this method will stop the training too soon and the model will yield very poor results on the validation and testing sets. To counter that, we will use a regularisation method, preventing overfitting while allowing our model to perform enough iterations during the learning phase to be efficient.
The method we will use relies on dropout layers. Dropout layers are layers that will randomly “disconnect” neurons from the previous layer, meaning their activation for this training iteration will be null. By disconnecting different neurons randomly, we prevent the neural network to build overly specific structures that are only useful for learning the training observations and not the “concept” behind them.
To apply this method, we insert two drop_out layers in our model, before each dense layer. Drop_out layers require only one parameter: the probability of a neuron to be disconnected during a training iteration. These parameters should be adjusted with trials and errors, by monitoring the accuracy on the testing and validation set during training. We found that 25% for the first drop_out layer and 80% for the second gives the best results.
Keras with a TensorFlow backend to implement our model:
Also, we will implement a
get_score function that will take as inputs the following:
- tensors: A whole dataset as a tensor
- labels: The corresponding labels
- model: The untrained Keras model for which we want to compute the accuracy
- epoch: An integer specifying the number of epochs for training
- batch_size: An integer, the size of a batch for learning (the greater the better, if the memory allows it)
- name: The name of the model (to save the weights)
- verbose: An optional boolean (default is false) that determines if we should tell Keras to display information during the training (useful for experimentation).
The function will:
- Perform one-hot encoding on the labels, so they can be understood by the model.
- Split our dataset into a training, a validation and a testing sets as detailed above.
- Create a checkpointer which allows us to save the weights during training (only if the accuracy is still improving).
- Fit the model on the training set and monitor its performances on the validation set (to know when to save weights).
- Compute and print the accuracy on the testing set.
- Return the trained model with the best weights available.
We can now train our model and get our score:
By reporting the results obtained for the three datasets, we see improvements compared to the SVC methods.
|CNN from scratch||SVC with PCA|
Ultimately, it is possible to increase again this accuracy by training our models on bigger datasets. How can we have more training images with only the dataset we used? To answer this question, we will in the last article of the series:
- Use data-augmentation
- Train a generic model on the three datasets, before specializing it by replacing the last layers by support vector machines.