The hardest thing for a Machine Learning Algorithm (MLA) to say

I don’t know (Photo by Paolo Nicolello on Unsplash)

Abstract: In this article, you will learn why machine learning algorithms find it hard to detect novelty, ponder over how humans are better at this, find out why some common approaches fall short and appreciate the benefits should there be a way to get machine learning algorithms to say — I don’t know.

A thought experiment

Imagine a thought experiment where you are trying to teach a child the various types of fruits like bananas, apples, oranges, watermelons etc. You show a number of pictures of each of the fruits (various sizes and shapes) to the child to ensure that the child becomes familiar. At the end of it, when you now show a fruit to the child, it is able to identify it correctly based on what it has learnt.

Images of fruits that comprise training data (Photo by Kamala Saraswathi on Unsplash)

Let us assume that this child has never seen pictures of a lion or of an elephant and that you now show such a picture to a child and ask it to identify the same. A likely outcome may be that the child is either quiet or says I don’t know. Of course, there will be some cases where the child may identify it as one of the fruits that it is recently familiar with but this would be unlikely is what I imagine.

The MLA in the hot-seat

Do you think the same experiment when done with a Machine Learning Algorithm (MLA) will yield the same result i.e. would the MLA be able to say I don’t know or in machine learning parlance, would it be able to classify it as a novel object or as an unknown object?

In this experiment, the MLA may probably classify the lion as an orange because of the similarity in color and the elephant as a banana because of the similarity in shape of the elephant’s trunk.

Note that MLAs learn the patterns in the images and associate them with the appropriate image labels.

Approaches that may not work

Sometimes it is as important to know what will not work as it is to know what will. Here are some approaches that we may think of but they fall short in one way or another.

Training with more data — While the answer to most things in machine learning may be train with more data, this is a case outside of that. No matter how much data you may have, there can always be an image that is out of distribution i.e. out of scope in the training or test data but that is encountered in a real world deployment. The well trained MLA will still not be able to identify the out of distribution object as an unknown.

Introducing more classification classes — Again, just by increasing the classes and making this into a multi-class classification problem does not automatically resolve this issue. Though the classification granularity may get better, it doesn’t address the core issue of identifying an out of distribution object.

Adding negative classes — One approach could be to add a bunch of random objects unrelated to the objective and labeling all those random objects as unknown in the training data. For the experiment we discussed above, add images of different cars, planes, animals, birds, pets, furniture etc. to the training data and label all of these as unknown. The question here is how exhaustive can the negative class be made i.e. how many random objects can be included and the consequent impact on the cost and complexity of the solution.

Confidence scores — MLAs may be associated with a confidence score of their predictions that is expressed as a probability or as a percentage. For example, a new image may be classified as an Orange with a confidence of 92 %. While this seems like a sensible approach, note that the confidence score is based on the degree of similarity that is seen in the new image with the patterns that the MLA has been trained on. A classification model may return a very high score for unknown classes, especially when these classes share common patterns .For example, a lion may be classified as an orange with a confidence score of 78 % because there may be a similarity in the color.

What then is a workable approach?

Let us work through an example to get some ideas on what can be done.

Consider the sample representative data below that uses two features — Width (x1 in m) and Length (x2 in m) to label a vehicle as a Car or Not-A-Car (class labels).

Parallel lines to the x and y axis drawn to represent the data boundaries split the two dimensional (2-D) space into areas that may be labeled with the appropriate class labels. Note that the test data points shown in another table and plotted on the graph in red are given the label of the areas in which they fall even though they are quite a distance away from the training data points plotted in blue. This is why an MLA will classify any novel data point into a known class rather than call it out as Unknown.

Training and test data plotted on a graph with classification regions (Source: youplusai.com)

What we need is a mechanism to mark significant enough regions in the 2-D space where training data has not been seen as Unknown. If that were done, the test data points seen in this example would fall into the Unknown region and will therefore be marked as Unknown rather than into the known classes.

Test data points that are significantly further from training data labeled as Unknown (Source: youplusai.com)

Note that what is described above is an approach and not a fool-proof solution.

Identifying what you don’t know in machine learning is a hard problem that has not been completely solved yet.

One of the interesting research papers that I read on this topic thanks to the reference from Aaron Chavez is titled — Is Uncertainty Quantification in Deep Learning Sufficient for Out-of-Distribution Detection?In this paper, the authors compare several state-of-the-art uncertainty quantification methods for deep neural networks regarding their ability to detect novel inputs. In conclusion, they state that the current uncertainty quantification approaches alone are not sufficient for an overall reliable out-of-distribution (data that strongly differs from training data) detection.

In a real-world production deployment of a machine learning based solution, this ability to call out the out-of-distribution data i.e. Novelty as an Unknown class may have several benefits.

  • Avoidance — of Novel data misclassification.
  • Utility — in domains where it is difficult to obtain fully representative data for training.
  • Alerts — could be raised if the percentage of data inputs that got classified as an Unknown class exceeded a set threshold (say 10% of the total inputs classified).
  • Retraining — could be triggered if a significant number of data inputs get classified as Unknown.

With an active interest in Data Science, I will be exploring many more topics in this domain. If you’re interested in such endeavors, please join along my journey by following me on Twitter and subscribing to the youplusai YouTube channel.

Many thanks to Madhusoodhana Chari for his valuable feedback.