Recognizing handwritten text is a problem that can be traced back to the first automatic machines that needed to recognize individual characters in handwritten documents. Think about, for example, the ZIP codes on letters at the post office and the automation needed to recognize these five digits. Perfect recognition of these codes is necessary in order to sort mail automatically and efficiently.
In this article we are going to look into how to use the K-Nearest Neighours algorithm to recognize handwritten digits from 0–9, on the dataset Digits from the scikit-learn library.
Scikit-learn is a free machine learning library for Python. It features various algorithms like support vector machine, random forests, and k-neighbours, and it also supports Python numerical and scientific libraries like NumPy
and SciPy
. The scikit-learn library also provides numerous datasets that are useful for testing many problems of data analysis and prediction of the results.
Now that we have a brief about the algorithm and datasets used for this project, lets move on to its implementation.
Import necessary libraries
Loading dataset
As mentioned before, we load the digits dataset from sklearn. The dataset digits consists of 1,797 images that are 8x8 pixels in size. Each image is a handwritten digit in grayscale. To confirm we have 1797 images, we use shape attribute
Looks good!
The images of the handwritten digits are contained in a digits.images array. Each element of this array is an image that is represented by an 8x8 matrix of numerical values that correspond to a grayscale from white, with a value of 0, to black, with the value 15. For example for the first image
You can visually check the contents of this result using the matplotlib library, as shown below:
The numerical values represented by images, i.e., the targets, are contained in the digit.targets array.
This is just one image. Lets visualize, say first 10 images with their target values:
So far we have just explored the dataset. Now, lets move onto the real action!
Split the dataset into training and testing set
Lets split the dataset into two subsets, training and testing sets
Modelling
Since we have decided to use the k-nearest neighbour algorithm for this project, and have already imported the necessary library, lets create an instance of KNeighborsClassifier as shown below:
Next we train the model using the above instance
Now we have to test our classifier, making it interpret the digits in the test set. We save the predictions in the variable ‘prediction’
Measuring the accuracy of the model
For checking the accuracy of the model, we use the method accuracy_score() using the following code:
The accuracy_score() compares the predicted values and the real test set values and computes what percentage of the test values were correctly predicted.
The result shows the value 0.98, which means that our model can predict the target values of 98% of the images correctly.
We can also use some other algorithms like SVM, Random Forest classifier etc. You can find the entire code here.
I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Experience. Thank you www.suvenconsultants.com