Our experiments verified the significant differences between transformer-based models and previous state-of-the-art machine learning, including deep learning, models. The low proportion of abuse/misuse representing tweets caused past classifiers to perform poorly on this class, obtaining F1-scores similar to the ones presented in Tables 1 and 2 and Figure 3. While some studies reported good overall accuracies or micro-/macro-averaged F-scores over all the classes, these metrics are primarily driven by the majority class(es), and do not reflect the performances of the approaches on the important non-majority class. From our experiments, it is evident that (i) transformer-based approaches considerably improve PM abuse detection (in our case, by approximately 20 points in terms of F1-score as shown in Figure 3); and (ii) a fusion-based approach that combines the outputs of multiple transformer-based models is more robust in terms of performance than a single transformer-based method. The second finding is unsurprising since past research involving traditional classifiers have shown that ensemble learning typically outperforms individual classifiers(55). The improved classification performance, however, comes with a heavy cost in computing time, particularly during training, as the hyper-parameters of each model needs to be fine-tuned to optimize performance.
Analysis of the effect of training size on model performance
We repeated the classification experiments by varying the size of the training set and evaluating each classifier on the same test set. Our intent was to determine how the performances of the classifiers varied with training set sizes, particularly the rates of increases in performances with increases in training set sizes. We drew stratified samples of the training set consisting of 25%, 50%, and 75% of all the tweets, and computed the F1-scores over class A for all the classifiers.
Figure 4 shows the performances of the classifiers for the different training set sizes. The figure reveals two key information: (i) even with very small training data set sizes, transformer-based models consistently outperform other models, and (ii) while some of the traditional machine learning models appear to hit a ceiling in performance, transformer-based models appear to keep improving as more training data is added. Finding (i) illustrates the strength of such models, particularly in terms of text representations, which enable relatively high performances even with small amounts of training data. Finding (ii) is promising as it suggests further improvements to classification performances are possible.
Post classification content analysis
We classified 100k tweets using the developed model, then we performed content analysis by calculating the term frequency–inverse document frequency (TFIDF) as shown in Figure 5. The high TFIDF terms give us an overview of the contents in each class. Our objective was to assess and verify if the classification strategy actually managed to separate tweets representing different contents. For example, in the abuse chatters (class A), we see that the content involves tweets are reporting how the user abuse PM, such as usage of more than typical dosage (taking mixing) or PM co-consumption (Whiskey), and the reason for misuse, such as for recreation (first-time, took-shit). The high-frequency topics in PM consumption chatters (class C) indicate the users using the medications for medical symptoms (panic-attacks, mental health, nerve pain), or discussing side effects they experienced (side-effect). Interestingly, the dominant theme in PM mention chatters (class m) is related to opioid crisis (drug-legal, prescription-needed, purdue-pharma), and a variety of other topics regarding the medications (powerful-prescription, college-grades, need-powerful) that require further analysis to understand the specific information. This surface-level topic analysis suggests that the classifier is indeed able to distinguish contents associated with abuse or nonmedical use of the PMs.
Challenges and possibilities for improving classification performance
Building on our promising experimental findings, we analyzed the errors made by our best-performing classifier in order to identify the common causes of errors (as shown in the confusion matrix in Figure 6) and to explore possible mechanisms by which performance may be further improved in future research.
The confusion matrix shows that there were 581 misclassifications for all classes on the test data; 154 tweets (26.50%) were misclassified between abuse or consumption classifications, 223 tweets (38.38%) were between abuse or mention classes, 162 tweets (27.88%) were between consumption or mention classes, and the remaining 42 tweets (7.22%) were misclassifications between unrelated classes or all other classes. The misclassification percentage is similar to the disagreement percentage among human annotators on the full annotated datasetdataset that analysis in the previous paper (33) (“29.80% disagreements between abuse or mention classes, 31.95% between abuse or consumption classes, 32.66% between consumption or mention, and 5.5 disagreements between unrelated classes and all other classes”). The results illustrate the robust performance of our model. The potential reasons for model misclassification errors can be interpreted as follows.
Lack of complete context
While BERT-like models are able to capture context-dependent meanings of words better, they still lack the commonsense/pragmatic inference(56) capabilities of humans. When it comes to PM abuse/misuse detection, the mechanism of action of a medication is purely human knowledge and is not necessarily explicitly encoded in tweets. In the example below, the human annotator recognizes that Adderall® is used for performance enhancement by students, and hence can infer that the tweet represents possible misuse/abuse. However, due to the lack of explicit contextual cues, the machine learning classifier mis-classifies the tweet as M (i.e., medication mention).
Example 1: @userX1 books, @userX2 Adderall, @userX3 the place.
While there is no existing method for capturing commonsense, it may be possible to add further context to a given tweet by incorporating additional tweets by the same user (e.g., tweets posted before and after the given tweet).
Absence of deep connections within the context
Humans can generalize more and observe connections deeper than those presented in the context or said words. Humans connect the word within said sentences with speakers’ social groups, culture, and age, as well as their life experience to obtain full understanding, which might be even difficult for humans from different backgrounds, ages, or social groups to understand. In examples 2 and 3, while the annotators were capable of connecting the relationship between Adderall® and laundry, and a movie (Harry Potter in this case) and Xanax®, our classifiers failed to classify them correctly.
Example 2: laundry and adderall go side by side
Example 3: we see harry potter and pop Xanax
Such tweets represent extremely difficult cases for the classifiers, particularly if there are no other similar tweets annotated in the training set. In addition to annotating more data, which is always a time-consuming yet effective method for improving machine learning performance, future research may try to incorporate more user-level information (e.g., details from the user’s profile, followers, following etc.) to improve performance.
Influence of the pretraining dataset
The BERT model used in this study was pre-trained using a corpus of books (800M words) and English Wikipedia (2,500M words),(30) and the other models were pre-trained on similar datasets. However, social media language differs from that in books or Wikipedia, and in past research using word2vec, we have observed that word embedding models trained specifically on social media data improved system performances for social media text classification tasks.(57) Therefore, conducting pre-training using large social media datasets may help improve performance further.
Users’ network influence on understanding the context
Beyond the issue of language context, social media network features, such as mentioning or retweeting, can affect the meaning of sentences. Consequently, capturing the meaning and the context only from the text is challenging. For example, writing content inside quotation marks implies that the statement is not the user’s and is quoted from another user:
Example 4: “someone send me xanax i'll pay”
Example 5: someone mail me xanax i'll pay
Both examples were classified by all BERT-based models as abuse. However, example 4 was considered nonabuse by human annotators because it was represented between two quotations and just mentioned what somebody else says. Example 5 was considered abuse and represented the user himself/herself. Misclassification possibly occurred because the quoted statements were observed much less in the training dataset compared with general statement. In a scenario where the patterns are evident but rarely represented in the training dataset, incorporating a simple rule that can guide the algorithm in understanding this situation or similar situations can improve performance, although the example is not well represented in the training. This study only considers detecting textual contents in tweets, and did not consider prescription drug information within images or PM texts presented as images. Future study may investigate along this direction.