MineralId: On choosing the appropriate evaluation metric
This is an ongoing series about the progress on improving our mineral identification web app.
TL;DR:
Choosing the correct evaluation metric matters. In this post, I talk about why the evaluation metric nDCG@k (normalized Discounted Cumulative Gain), typically used in recommender systems, is the appropriate metric for the task.
Mineral classification is an inherently multi-class multi-label problem
Minerals typically do not occur in isolation in nature. For example, in the photo below shows a rock with uvite crystals in talc schist. This means that classifying the image as uvite or talc are both “correct”, i.e., a multi-label problem.
However, we find that people normally do not go to such extent to specify all the minerals in their photos. Labeling everything exactly would be crazy expensive, therefore, we have to tackle this inherently multi-label problem as a single label classification problem.
nDCG as the appropriate metric
A good evaluation metric helps frame the problem and can give us feedback on how well we are doing. Typically, accuracy, precision and recall are good enough metrics for straightforward classification problems. However, it is clearly that these metrics might not be appropriate, since they would take the maximum likelihood estimates (mineral with highest probability in the output vector). In the example above, if the classifier decides to give talc 70% and uvite 30%, then all the three metrics would be hurt.
Instead, it is more appropriate to treat this as a ranking problem where nDCG is a good evaluation metric.
\(DCG(y) = \sum_{i = 1} \frac{\mathbb{I}(y_i\text{ is relevant})}{\log_2(i + 1)}\)
Typically, you would normalized \(DCG(y)\) by the best possible \(DCG(y)\) so that the normalized score is within \([0, 1]\). However, since in our case, each observation can only be relevant for one class, this means that the best possible \(DCG(y) = 1\). Moreover, we typically say \(nDCG@k\) where we compute the nDCG score for the top \(k\) predictions.
Interpretation of nDCG@k for mineral classification
The nDCG metric gives us much more information on how well our classifier is doing. In our use case, we typically show the user a ranking of possible mineral classes for the given image. It is likely that the user would look at the first few suggestions before moving on.
This means that we would like to have the correct mineral class within the top \(k\), say 5 suggestions. In our case, a non-zero nDCG@5 for an observation would tell us that the correct mineral observation is within the top 5 suggestions. This would help us with model selection during the training process.
Moreover, computing the average nDCG over observations with non-zero nDCG for a particular class would tell us what’s the average position of the correct mineral class if we manage to get the class within the top \(k\) predictions. For example, if we get an average nDCG@5 of \(0.5\) for diamonds, then we would know that on average our classifer gets diamond at the third suggestion if we get it correct, since \(3 = 2^{1/0.5} - 1\) (reversing DCG).
Lastly, we can use the metric to compute the precision and recall for getting the correct mineral class within top \(k\) suggestions by defining observations with non-zero scores as positives and those with zero scores as negatives.