Developing a final model

It was clear from our previous results that the Sklearn model is more promising than the Tensorflow model. We will therefore use these to build a final model, testing them on our 20% holdout data, and submitting predictions to the Kaggle competition we obtained the data from. This competition data is separate from data used to train the model.



Voting Model

The three (Row/Column, Image, Combined) models we used had accuracies ranging from 90-97% on our test data. It's unlikely all of these models overlap in their predictions, so let's test what a simple voting algorithm provides. When uncombined, there are 953 unique mispredictions. If we implement a best 2 of 3 voting algorithm, the mispredictions drop to 244/8400, or about 3%.


Figure 1: Confusion matrix for predictions on the combined voting. Consensus voting indicates a 98% chance that the prediction is correct.



The question now becomes what to do when there is no consensus. First, let's have a look at what numbers are hard to classify, and see if there is insight into why:

Figure 2: Subsample of images where there is no consensus in our voting scheme.



Well, it's no surprise our algorithm has issues. These numbers are tough to read for a human. This is only a subsample of the images, but a quick glance indicates these are going to all be difficult to read. For our purposes, errors in reading poorly written numbers is alright. We could work hard to optimize reading of these, however, given handwriting errors, or pens running out of ink, or errors taking the pictures, this is a futile effort. We can only force the algorithm to be so accurate, and want an algorithm that focuses on matching legible numbers, as opposed to something overfit to our dataset.

Therefore, we settle on 3 different models for a break in voting consensus. Looking at the breakdown of the Row/Col, Img, and Combined models, we can calculate the fraction of time they are correct in their prediction. Some models are better at catching certain numbers, and this would provide them more weight. Two models appear from this, where we can take the model that is correct in their prediction most often, or randomly select based on this weight. Randomness is not appreciated in a model, but is an option. The third model simply assumes the most accurate model is correct, and takes the selection from the Combined model.

Figure 3: Our three final models. Each model votes and takes the majority vote. If there is no consensus, each model handles the last cases differently. The random model (left) draws from a weighted random distribution, with weights drawn from the accuracies of the predictions each model has for it's model. The maximum likelihood model (middle) uses the weights from the random model, and takes the one most likely to be correct. The Combined Model tiebreaker (right) takes the most accurate of our models, and takes that one to be correct. The accuracies from left to right, are 97.5%, 97.6%, and 97.6%, virtually indistinguishable from one another.



Final Scores

The last thing to do is submit our final results to Kaggle. The competition uses 25% of the test data to calculate the public score. Unfortunately due to the nature of this problem, the leaderboard is rife with people who have cheated by selecting values by eye, and have a 100% score. A quality classifier is not perfect, and will sit in the high 90s. Our final results are:



Classifier Accuracy
Combined Model 97.9%
Combined Model as tiebreaker 97.7%
Maximum Likelihood as tiebreaker 97.7%
Randomized tiebreaker 97.6%



It appears our Combined Model may outperform our tiebreaking schemes. Approaching 98% accuracy is very commendable giving the simplicity of the model, and variations in handwriting. The fact these percentages are similar to that of our test set, suggests we have a robust and well biased model, and did not overfit. Sklearn is extremely easy to implement, and outperformed the Tensorflow model, making it my choice tool in future machine learning.

Back to project home.