A second look at rock identification requests at whatsthisrock subreddit
This is the second post about my ongoing efforts to improve the deep learning part of our mineral identification web app. This continues the previous post.
TL;DR:
This post summarizes the statistics about mentions of minerals and rocks in the comments of the posts. Also, it shows the top mentioned official minerals in both the submissions and comments of the whatsthisrock subreddit.
Top minerals and rocks mentioned in comments:
The comments of the submissions posted from 2016 - 2019-12-27 were processed to compile the following plots. In total, 1042/43932 minerals were mentioned. We show the top 100 minerals below. Next, 140/1432 rocks were mentioned. We show the top 100 rocks in log scale below.
Top official minerals mentioned in submissions and comments:
We can also filter the mineral mentions by the list of official minerals. There are 5554 official minerals as shown in IMA list. Filtering this way significantly reduces the number of mentioned minerals. As shown below, 70 unique official minerals have been mentioned as identified in the submissions. Amongst the comments, 385 minerals were mentioned; we show the top 100 minerals below. The plots are shown on log scale.
Limitations:
There are several limitations regarding the statistics presented above due to NLP problems:
- Misspellings: I didn’t take care of any of the possible misspellings which means that the statistics probably are lower bound approximations.
- Negative/positive mentions: Mentions could be positive or negative towards a particular mineral identification request. For example, a comment of a mineral could be negating a mineral class rather suggesting a mineral class: “This is definitely not a diamond!”. Would need to use better NLP techniques to get this fine-grained differentiation.