Datasets and Preprocessing
As mentioned, in this study we will use two pre-coded datasets. The first dataset stems from a project on the framing of traditional Chinese medicine in non-Chinese media. We will refer to it as the TCM dataset in the following. The TCM dataset comprises 253 English language articles. On average, each article contains 586.4 words, yielding a total of 148,369 words. However, the size of the articles varies quite significantly, the smallest just comprising 35 words and the largest 2,430. Each article was rated by two research assistants in parallel. A variety of variables were assessed, among them the six generic frames we are focusing on. Crucially, research assistants were instructed to select one (and only one) frame for each article.
The second dataset was constructed in a research project on international reporting of the Russia-Ukraine war, which is why we will refer to it as the UKR dataset. The UKR dataset comprises 100 English language articles from English, American, and Chinese newspapers. The original dataset also contained articles in Chinese and Russian, which are not included in this analysis. On average, each article contains 1,114 words, making a total of 111,402 words in the whole dataset. Again, article length is quite heterogeneous, ranging from 38 to 5,815 words. The articles were rated by a team of seven research assistants. A small fraction of articles (only seven in this study) were rated by all seven in parallel, the remaining ones by a single research assistant each. Again, a wide range of variables was applied to each article, including the six generic frames. Different from the TCM dataset, research assistants were instructed to select one or several generic frames for each article, as many as they felt applied. On average, 2.6 frames were indicated. The consequences of this specific difference between the TCI and UKR dataset will be discussed further below. In both datasets, we exclude the headline, short summary, and the name of the author from each article. We also exclude other irrelevant text, such as location (e.g., ‘KIEW –’), page number, page headers and footers, image descriptions, etc. Thus, the texts used as input for the LLM exclusively consist of the body of the articles.
Figure 1 shows the distribution of different generic frames in the two datasets. It can easily be seen that both datasets are quite unbalanced, some frames being many times more prevalent than others. The TCM dataset is dominated by human interest and factual information frame stories, accounting for a combined 70 percent of texts. Due to the multi-label nature of the UKR dataset, percentages add up to more than 100 percent, but it can be easily seen that it, too, is dominated by two frames, in this case conflict (70%) and leadership (62%). This unbalanced distribution would be an additional challenge for supervised learning, but not necessarily for the zero-shot approach we follow here.
Performance Metrics
In this study, we aim at training an NLP system to automatically detect six generic frames in news article text. This problem can be formalized in two different ways, namely as a multi-class or a multi-label classification problem. In multi-class classification, each instance (in our case, text) is classified into a single category (in our case, a single generic frame). In contrast, in multi-label classification, each instance may be classified as having one or several labels. These two options correspond to the different ways in which our datasets where generated (see above): the TCM dataset was coded as a multi-class classification task, whereas the UKR dataset was coded as a multi-label task. This difference in task structure will be reflected in the setup of our automated classification approach.
The first point in which the difference between multi-class and multi-label classification is reflected is the evaluation of our classification results. Statisticians and computer scientists have come up with a wide variety of approaches to the evaluation of the quality of automated classification. All of them are based on a comparison between automated classification and the ‘gold standard’ of human annotation. As the use of the term ‘ground truth’ for human-annotated data shows, computer scientists often have a rather uncritical approach to the quality of these data. Only recently is there a growing realization that the performance a machine-learning (ML) system can achieve in classification tasks is capped by the quality of human annotations. In other words, if human raters do not even agree with each other (or worse, make their annotations semi-randomly), how can a ML system achieve perfect agreement? To reflect this issue, we will compare the performance of our classification approach to the agreement among our human raters.
All performance metrics are based on the analysis of a so-called confusion matrix. A confusion matrix plots the predictions of the ML system against the ‘ground truth’—in our case, the annotations of our research assistants. In case of a simple, binary classification problem, the confusion matrix consists of four boxes or cells: 1) true positives (TP): these are cases in which the model predicted the positive class (presence of a frame) correctly, meaning in correspondence with human raters; 2) true negatives (TN): these are cases in which the model predicted the negative class (absence of a frame) correctly; 3) false positives (FP), or type I errors: these are cases in which the model incorrectly predicted the positive class; and 4) false negatives (FN), or type II errors: these are cases in which the model incorrectly predicted the negative class.
Accuracy, representing the ratio of correctly predicted observations (TP + TN) to total observations (TP + TN + FP + FN), is undeniably the most rudimentary and comprehensible evaluative metric for classification models. However, its effectiveness is noticeably diminished when applied to datasets with imbalanced classes. The reason for this is that, if some classes have a very high or very low proportion of positive cases (i.e., texts that belong to a specific frame), it becomes possible to achieve high accuracy simply by always predicting the most prevalent class. For example, if 90 percent of texts in the TCM dataset were instances of the human interest frame, an ML system could achieve 90 percent accuracy simply by always classifying texts as human interest, irrespective of any properties of the text. Thus, while accuracy is certainly an informative metric, it needs to be complemented by other metrics to gain further insights into the classification results.
Recall, otherwise known as the true positive rate or sensitivity, quantifies the ability of the model to predict positive instances correctly (TP / (TP + FN)). It is defined as the fraction of positive cases (in our case, texts that have been classified by human raters as having a specific frame) that have been classified by the ML system as positive. This is particularly crucial in contexts where the repercussions of overlooking a positive instance are considerable, such as in medical diagnoses.
Precision, synonymous with the positive predictive value, is determined as the ratio of true positives to the sum of true and false positives (TP / (TP + FP)). In our context, it is the ratio of texts that have been classified by human raters as having a specific frame among texts that have been labeled by the ML system as having this frame. Intuitively, precision gauges the capacity of the classifier to refrain from erroneously labeling a negative sample as positive.
It is easy to see that sensitivity and precision are difficult to optimize at the same time. An ML system with a tendency to classify ambiguous instances as positive will tend to have higher sensitivity (it will ‘catch’ more positive instances), but it will also tend to make more mistakes in classifying instances as positive, which will decrease its precision.
The F1 score combines precision and recall into a single metric. It is defined as the product of precision and recall, divided by their sum. Thus, a high F1 score (close to 1) can only be reached if both precision and recall are high. In machine learning, the F1 score is often preferred to accuracy as a more reliable metric, especially in instances of uneven class distribution. While a high level of precision corresponds to an exceptional level of accuracy, it could simultaneously result in the overlooking of difficult-to-classify instances. Conversely, high recall but low precision implies that the classifier, despite identifying the majority of positive instances, also wrongly classifies numerous negatives as positives.
Lastly, Cohen’s kappa serves as an efficacious metric, particularly when dealing with imbalanced classes. Conventionally utilized as a measurement of agreement between two human raters, it is also employed in machine learning as a comparative tool between a classifier and a simplistic baseline classifier. Cohen’s kappa compares the observed accuracy with an expected accuracy (random chance). It takes into account the possibility of the agreement occurring by chance, which makes it a more robust measure than accuracy. The kappa score ranges between -1 and +1: a kappa score of +1 represents perfect agreement among the raters; a score of 0 implies the agreement is equivalent to random chance; a score of -1 denotes total disagreement among the raters. It is worth noting, though, that it can be difficult to reach high values of kappa in cases of unbalanced categories, as this would correspond to almost perfect agreement.
For multi-label classification problems like in our UKR project, we obtain one binary confusion matrix for each generic frame. This means we have to compute each metric described above separately for each generic frame. If we want to get an impression of the overall performance of our model, we have to combine these numbers,—e.g., by taking their arithmetic mean. In contrast, for multi-class classification problems like in the TCM project, we obtain a single confusion matrix, but with the number of rows and columns equal to the number of generic frames. In this case, we compute the quality metrics only once for the whole matrix.
To conclude, each of these evaluative measures caters to different aspects of model performance. Hence, their application is contingent on the specific requirements of each unique case. For example, when conducting a framing analysis, researchers have to determine whether their data correspond to a multi-label or multi-class classification problem. Furthermore, they need to decide whether all frames are of equal importance or more frequent frames are more relevant, and whether false negatives are more, less, or equally relevant for their analysis in relation to false negatives.
BERT
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained deep learning model for NLP tasks that has significantly advanced the field of NLP. Developed by researchers at Google AI language, it was open-sourced in late 2018. BERT is based on the transformer architecture, an attention mechanism that learns contextual relations between words in a text. For example, BERT can learn that the meaning of the word ‘mouse’ changes radically depending on whether the context is computers (“I can’t click on this without a mouse”) or small animals (“I saw a mouse in the basement”). Likewise, it can learn that different words like ‘insult’ and ‘offense’ may have similar meanings. Depending on the context, BERT will represent the same token ‘mouse’ using different vector embeddings. Unlike previous models such as Long Short-Term Memory Networks (LSTMs), BERT is bidirectional. Conventional models process words in a sequence, either left-to-right or right-to-left. In contrast, BERT processes word context from both directions simultaneously. This bidirectional approach allows the model to gain a comprehensive understanding of the word context, resulting in more accurate predictions.
BERT is pre-trained on a large corpus of unlabeled text data collected from Wikipedia and book corpora. This pre-training step is unsupervised; it involves predicting words in a sentence (masked language modeling) and predicting sentence order (next sentence prediction). The first task trains BERT to learn the context of words, while the second task helps it to learn relationships between sentences. Through this pre-training, BERT acquires much more sophisticated knowledge of language than previous models, including syntactic regularities, information on the meaning of words in different contexts, and semantic relations between words (e.g., synonyms, antonyms). Thus, once pre-trained, BERT constitutes an ideal foundation for more specific applications, such as question answering, named entity recognition, sentiment analysis, or topic modeling. When applied to these tasks, the body of the LLM is preserved, but instead of focusing the model on masked language modeling and next sentence prediction, it is equipped with a new task head that is tailored to the specific application. With this new task head in place, it can be fine-tuned in a supervised manner on these specific tasks. This can be done faster and with smaller amounts of labeled data than would be necessary if the task were learned from scratch. The process of utilizing the language knowledge gained by unsupervised training of an LLM for different applications is also known as deep transfer learning. It has been shown that fine-tuning models with prior language knowledge to new tasks is significantly more efficient than training models from scratch (Iman et al., 2023; Ruder, 2019; Tan et al. 2018).
Natural Language Inference
One area for which BERT (and other language models) can be fine-tuned is known as natural language inference (NLI; MacCartney, 2009). NLI, also known as recognizing textual entailment (RTE), is a subtask in NLP that involves determining whether a given piece of text, also called the hypothesis, can be logically inferred from another text, the premise. In essence, NLI is about understanding the relationship between two texts. Three types of relationship exist: entailment, contradiction, and neutral. If the hypothesis is a logical implication of the premise, it is considered an entailment. If the premise and hypothesis convey opposing information, it is a contradiction. If the premise and the hypothesis are unrelated or the relationship is unclear, it is deemed neutral. NLI is considered a high-level task for LLMs: to solve NLI tasks, the model not only has to be able to deal with syntax and semantics, but also needs world knowledge—i.e., information on factual relations between concepts.
The BERT-NLI Model
For our analyses, we will use the BERT-NLI model, which has been developed by Laurer et al. (2024). As the name implies, the model was created by fine-tuning an advanced version of BERT, DeBERTaV3 base (He et al., 2021), on a variety on NLI tasks. More specifically, the model was fine-tuned on 1.2 million human-annotated premise-hypothesis pairs coming from eight different NLI datasets. Laurer at al. (2024) used a simplified version of the NLI task structure (see above), in which only the distinction between entailment and non-entailment (encompassing both contradiction and neutral) is considered. They found that the model reaches peak performance after only a few iterations of the training dataset. This shows how much the model benefits from the basic language knowledge of DeBERTa, acquired during its pre-training.
Fine-tuning an LLM in this way has the advantage that NLI is a universal task, meaning a large variety of problems can be reformulated as NLI tasks. For example, text classification tasks can be turned into NLI problems by formulating a hypothesis like “the text is about sports” or “the text is about politics”, and using the text we want to classify as premise. This is how Laurer et al. (2024) proceed. They apply their fine-tuned model on a variety of tasks for political science research, such as deciding whether sentences from party manifestos represent a pro- or anti-military, pro-protectionism or pro-traditionalism stance, classifying US State of the Union speeches and Supreme Court cases according to policy topic, and classifying news texts as positive or negative about the economy. For each of these tasks, Laurer et al. (2024) develop hypothesis statements based on the categories in the original codebooks that were used in the generation of these datasets. For example, for analyzing the stance on the military represented in party programs, the hypotheses are “The quote is positive towards the military” and “The quote is negative towards the military”; for classifying news articles into topics, one hypothesis is “The quote is about economy, or technology, or infrastructure, or free market”. This process, called label verbalization, constitutes a significant difference between NLI models and classic text classification. In classic supervised machine learning, the algorithm is trained purely on numeric ‘ground truth’ labels, which only tell it which texts belong in which category. The actual differences between these categories need to be inferred by the algorithm from scratch. In contrast, the hypotheses in NLI actually inform the language model about the meaning of the different categories. This makes training an NLI system much more akin to training a human rater. Laurer et al. (2024) validate their BERT-NLI LLM in two distinct ways: firstly, in a zero-shot approach without any further fine-tuning on the task dataset; and secondly, by fine-tuning the model on ever larger batches of task-specific training data. Depending on the task, the zero-shot approach yields F1 values between 0.1 and 0.65, with an average of around 0.38. Further training increases the F1 drastically, all the way up to an average of around 0.75, given a training set of 2,500 data points. Over most tasks and for most training sample sizes, BERT-NLI outperforms a standard BERT text classification model (not fine-tuned on NLI data).
To sum up, the BERT-NLI LLM has several features that fit very well to the requirements of our project. 1) It is based on the BERT LLM, which enables it to capture the meaning of words, phrases, and whole documents, rather than just the frequencies of words, as was the case in traditional NLP methods. 2) It is fine-tuned on large NLI datasets, which enables it to judge whether a statement (hypothesis) is justified, given a text (premise). By formulating appropriate hypotheses, this ability can be used to classify statements in many different ways (e.g., topic, sentiment, political stance, etc.). 3) Laurer et al. (2024) have demonstrated that BERT-NLI can reach high levels of performance on different political science datasets, which makes it likely that it can also be applied to other social science tasks. However, there are two major obstacles that make it challenging to apply ERT-NLI to our datasets. In the following, we will discuss them one at a time.
Challenges
Abstract variables. One of the main challenges in our project is the relatively abstract nature of the variables we are trying to detect, also a key issue in identifying frame (Guo et al., 2023; Nicholls and Culpepper, 2021). Traditionally, NLP has been focused on topic modeling—i.e., the categorization of a given text into one of a number of pre-defined topics. For example, news articles might be categorized into foreign news, sports, weather, society, economy, etc. While not exactly trivial, the difficulty of this task is somewhat attenuated by the fact that certain words or phrases are highly correlated with specific topics but not with others. If an article contains the terms ‘storm’, ‘rain’, ‘high pressure area’, and ‘tomorrow’, we can safely assume that it falls into the category ‘weather’. This close relationship between categories and phrases does not exist to the same degree for generic frames. An article written under a leadership frame might have a higher likelihood of containing words like ‘leader’ or ‘leadership’, but it is completely possible to write a leadership framed article without ever mentioning these or other pertinent phrases. This issue is also acknowledged by Nicholls and Culpepper (2021), who assert that computational methods largely return definite topics but not frames. Furthermore, while specific phrases might be correlated with a generic frame in the context of one topic (say, traditional Chinese medicine), they may no longer be correlated with the same frame in a different context (the Ukraine War, for example). For these reasons, frame detection is extremely challenging for traditional, word frequency-based NLP techniques.
As mentioned, the new generation of large language models goes beyond analyzing word frequencies. By taking into account the context, these models manage to extract the semantic content of a text. Thus, it is plausible to assume that they are more suited to the task of identifying generic frames. However, if we look at the variables in the political science datasets used by Laurer et al. (2024) to evaluate their BERT-NLI model, they still seem comparatively concrete in comparison to our generic frames. In particular, they are still mostly on the level of what is being said and not how the story is told, as is the case for the generic frames.
Input length constraints. As described above, the NLI model used by Laurer et al. (2024) is based on BERT. As such, it has the same constraints regarding the maximum size of textual input that the model can handle, namely 512 tokens. A token is a piece of text that results from tokenization—i.e., the process of segmenting a text into units that an LLM can process. Often, tokens correspond to complete words, especially if they are simple and occur frequently in the training corpus of the LLM (like ‘house’). Infrequent compound words like ‘tokenization’, however, are cut into their more frequent constituent parts - ‘token’ and ‘ization’ in this case. If these tokens are contained in the vocabulary of the LLM, they are then translated into an index number; so ‘token’ and ‘ization’ might become 183,372 and 457, for example. These index numbers, and not the words or tokens themselves, constitute the input of the LLM.
On average, the texts in our datasets contain 753.5 (TCM) and 1,372.7 (UKR) tokens, significantly exceeding the input limit of BERT and BERT-NLI. Thus, if we were to pass our articles to the LLM as a whole, they would be truncated, meaning they would be cut after the 512th token. In consequence, we would be losing large chunks of most of the articles in our dataset, which potentially contain relevant information about the frames. However, the 512 token limit is still too optimistic: the premise sentences in the eight textual entailment datasets on which BERT-NLI has been fine-tuned, as well as the political science datasets on which it has been tested, typically only comprise a few sentences. For example, the premise sentences in the MultiNLI dataset are on average only 22.3 words, and the hypotheses only 11.4 words (Williams et al., 2017). In contrast, our articles on average consist of 586.4 (TCM) and 1,114.1 (UKR) words. Based on this, it stands to reason that BERT-NLI should achieve the best performance when given input texts with rather limited size as the premise. In the next section, we will discuss how we tackled these challenges.
Hypothesis Formulation
In contrast to Laurer et al. (2024), who usually only formulate a single hypothesis for each variable they assess, we formulate three hypotheses for each of the seven generic frames we analyze. This allows us to cover a wider range of different aspects of each of our generic frames without formulating hypotheses that are too lengthy and complicated and which might thus be challenging for the model to process . We developed these statements based on the technical definitions of the generic frames (see above), as well as the operationalization of the frames in the codebooks of our two studies. In accordance with the recommendations of Laurer et al. (2024), we tried to keep each statement as short and simple as possible, using everyday language instead of technical or academic terms whenever possible. The statements can be seen in Table 1.
Table 1. Generic frames and associated hypothesis statements
Generic frame
|
Hypothesis statements
|
Human interest
|
- “The quote tells a person's story”
- “The quote talks about someone's own experiences”
- “The quote includes personal words that make you feel emotions”
|
Responsibility
|
- “The quote is about a government being responsible for an issue”
- “The quote is about an individual being responsible for an issue.”
- “The quote is about a group being responsible for an issue”
|
Morality
|
- “The quote talks about the ethics related to a problem”
- “The quote talks about the moral aspects of a problem”
- “The quote is about morality”
|
Economic consequences
|
- “The quote is about the economic consequences of an event”
- “The quote is about the economic consequences of a decision”
- “The quote is about economic impact”
|
Conflict
|
- “The quote is about a conflict”
- “The quote points out different opinions between people”
- “The quote stresses the 'us against them' story”
|
Leadership
|
- “The quote is about a leader”
- “The quote talks about how leaders affect things”
- “The quote is about a government”
|
Factual
|
- “The quote contains only factual information”
- “The quote is written in an objective way”
- “The quote discusses an issue in a purely factual manner”
|
Developing these statements can be seen as analogous to the development of scales in a questionnaire design. Typically, an abstract construct like ‘authoritarianism’ will not be measured with a single item but with a number of different items covering different aspects of the construct, and in combination making the measurement more reliable.
We decided to use the same hypothesis statements for both of our datasets. However, it might be advantageous to think of ways in which these statements can be adapted further to the specific context of a study, especially if the researcher has a priori information about how the frames are predominantly expressed in this specific context. On the other hand, making the hypothesis statements too specific can cause more unusual and unexpected manifestations of the frames to be overlooked. Since we formulate several hypothesis statements for each generic frame, the model will give us separate predictions for each of these statements. This raises the question of how these predictions can be combined to obtain a single prediction for the frame. We will address this issue further below.
Text Chunking
From each article, we exclude the headline, short summary, and the name of the author. We also exclude other irrelevant text, such as location (e.g., ‘KIEW –’), page number, page headers and footers, image descriptions, etc. Thus, the texts used as input for the LLM exclusively consist of the body of the articles. As discussed above, the texts in our datasets are significantly longer than the maximum input window of BERT, which in turn is much larger than the average length of the texts on which BERT-NLI was fine-tuned and validated (Laurer et al., 2024). For this reason, we decided to cut each text in our dataset into chunks of one or several sentences, using the English sentence tokenizer from the Python NLTK package. By choosing whole sentences as constituent parts, we avoid confusing the language model with statements that end abruptly mid-sentence. By varying the size of these chunks from one to fifteen sentences, we can compare which chunk size results in the best outcomes. Given that the average numbers of sentences in our datasets are 21.1 (TCM) and 46.8 (UKR), this almost always leads to each text being split into several chunks. If a text contains fewer sentences than the chunk size, it is treated as a single chunk. To validate that chunking does in fact improve the outcomes, we compare the results with a model to which we fed the whole texts, truncated to the maximum input window.
Analyzing the Text Chunks with BERT-NLI
After dividing each text in our datasets into text chunks, we add the phrase ‘The quote: "’ to the beginning of each text chunk in our dataset, and the phrase ‘" - end of the quote’ to the end, as recommended by Laurer et al. (2024). This allows the language model to connect the text of our chunks to our hypotheses, which all use the term ‘the quote’ when referring to the premise (text chunk) under scrutiny (see Table 1). We then tokenize the texts using the tokenizer paired with BERT-NLI, and pass the token indices to the BERT-NLI LLM. Likewise, we tokenize each of the 21 hypothesis statements (see Table 1). The model will then analyze each token chunk together with each of the hypothesis statements and return a probability (from 0 to 1), reflecting whether the model believes that the hypothesis is entailed by the text chunk. We will refer to these numbers as the model ratings to distinguish them from human ratings. Thus, if a text has been divided into four chunks, we obtain 4 x 21 = 84 separate model ratings. In contrast, our datasets will only contain human ratings of the seven generic frames for the text as a whole. The challenge is now to aggregate the 84 probability values in such a way that they can be compared to the seven frame ratings.
Re-aggregating Outcomes
The re-aggregation of the model ratings involves two steps: 1) aggregating the ratings of the three hypothesis statements to a rating of the generic frame (one for each text chunk); and 2) aggregating the ratings for each text chunk into a rating for the whole text. For step 1), we simply take the arithmetic mean across the model ratings of the three hypothesis statements belonging to a specific generic frame. For step 2), it is more complex: if we average each model rating over all the chunks belonging to a specific text, it might cause the signal to get drowned out. This is because, even if a text has a certain generic frame, we do not assume that each chunk in this text reflects this fact equally. Some chunks will receive high model ratings, but many more might receive ratings close to zero. We avoid this by using the maximum model rating across chunks instead of the average. We compared this procedure to other possible ways (e.g., taking the maximum / the average across both text chunks and hypothesis statements) and found it to be optimal.
After re-aggregating, we obtain one model rating per generic frame per text. As discussed above, in the TCM dataset, the human raters were asked to choose only one frame per text. In an analogy to this, for each text we choose the highest model as the predicted frame. In contrast, in the UKR dataset, the human raters were allowed to choose several frames. Thus, for the UKR dataset, we select as model predictions all the frames whose model ratings are above a threshold of 0.5. Having done this, we can now compare the model’s predictions to the human-made annotations.
Outcomes
Optimal Chunk Length
For both of our datasets, we tested the performance of BERT-NLI on different chunk sizes, varying their length from one to fifteen sentences. After each model run, we re-aggregate the model ratings in the way described above to make them comparable to the human ratings. The lines in Figure 2 represent the agreement between the model predictions and the human annotations, measured by Cohen’s kappa, for different chunk sizes, while the dots represent Cohen’s kappa for the model without chunking. For both datasets, the picture is quite complex. On the left, we can see the results for the TCM dataset. Since this dataset represents a multi-class classification problem, the overall model performance can be expressed as global Cohen’s kappa (black line/dot). The highest value of this metric is achieved for the smallest chunk size, where each chunk comprises only a single sentence. The Cohen’s kappa value for this chunk size is 0.31, which is conventionally interpreted as fair agreement. With increasing chunk size, it falls to values of around 0.18, reflecting only slight agreement—i.e., levels of agreement that are not much above chance. Even this is still higher than the kappa of 0.15 for the model run without chunking, where the model was simply fed with the first 512 tokens of each text (black dot).
The model performance regarding specific frames is very heterogeneous. For a chunk size of 1, the highest agreement between human and model ratings is achieved for the morality and human interest frames, with kappa values around 0.4. An intermediate value of 0.2 is achieved for the economic consequences frame, while the leadership, responsibility, factual information, and conflict frames are all between 0 and 0.1 (slight agreement). It might seem surprising that the model’s overall performance (global kappa) is relatively high, given that most frames are detected rather poorly. This is due to the fact that the frames are represented very unevenly in the TCM dataset, with human interest accounting for 40 percent of all news articles. The falling trend in global kappa with increasing chunk size is also mostly due to the same trend in human interest, while the other frames show no clear correlation with chunk size.
The picture is much similarly complex when we look at the agreement between model predictions and human ratings for the UKR dataset (Figure 2, right). Again, most of the generic frames show no clear trend with regard to chunk size; it is worth mentioning, though, that human interest shows a similar negative correlation with chunk size to the TCM dataset. The maximum value of Cohen’s kappa is at different chunk sizes for almost all frames. If we look at the weighted average of the frame-specific kappa values (black line; weights are the frequencies of the frames in the UKR dataset), there is also no clear trend. It has values around 0.3 for the whole spectrum of chunk size. The maximum kappa value of 0.36 (fair agreement) is reached for a chunk size of 4. For the UKR dataset, running the model on the whole texts, without any chunking, does not produce outcomes quite as bad as for the TCM dataset; but still, with a kappa of 0.31, this model performs worse than most of the models with smaller chunk sizes. We see a clearer pattern when we look at the differences in kappa between the different generic frames. At their peak, human interest, economic consequences, and leadership all achieve values between 0.5 and 0.6 (moderate agreement). The values for responsibility, morality, and conflict all peak around values of 0.3, while the factual information frame peaks below 0.2 (slight agreement) and mostly hovers around zero. Our outcomes demonstrate that dividing the texts in our datasets into chunks indeed improved the model performance significantly. The optimal chunk size, however, seems to be different for each generic frame. Based on these outcomes, we select the model runs with chunk size 1 (TCM) and 4 (UKR), and analyze the model’s performance in more detail.
Model Performance under Optimal Chunk Size
We can learn more about the performance of the model with regard to the TCM dataset by looking at the confusion matrix (Figure 3). The rows of the confusion matrix give us the generic frame chosen by the human annotators, while the columns express the model’s predictions. The numbers in the matrix are row-normalized, meaning they add up to 1 for each row (human predictions). If the model’s predictions were completely in accordance with human ratings, all entries would be on the diagonal of the matrix. We can easily see that this is far from the case. Roughly two-thirds of texts that were categorized by the human annotators as human interest and morality were categorized in the same way by the model. For the factual information frame, it is roughly half; and it is between 17 percent and 28 percent for economic consequences, conflict, and leadership frames. The only frame for which the model prediction is slightly worse than chance (14.3%) is responsibility, with only a 10 percent overlap. A far larger proportion of texts that were classified by the human annotators as responsibility frame were mis-classified as conflict (50%). Even more remarkably, 43 percent of human-annotated conflict texts were mis-classified as responsibility. The model therefore seems systematically to confuse these two generic frames (we could improve the model performance by switching their labels). Unsurprisingly, the model performance on the TCM dataset as a whole is not very high. The overall accuracy is 0.47, meaning 47 percent of texts were classified correctly. However, as the associated Cohen’s kappa of 0.31 indicates, this is still well above chance level.
For the UKR dataset, we will not look at seven separate confusion matrices, but instead focus on Table 2. For most generic frames, the model seems to have performed better in the context of the UKR dataset. Accuracy ranges from 66 percent (responsibility) to 84 percent (human interest), with an average of 75.1 percent, meaning that three quarters of human annotations have been predicted correctly by the model. The spread of accuracy values is much broader, ranging between 14 percent (factual information) and 86 percent (human interest). The factual information frame also has the lowest recall at 25 percent, and thus the lowest F1 score at 0.18. Luckily, with only 12 percent of texts, factual information is also the rarest category in the UKR dataset, and so these suboptimal outcomes do not impact the overall model performance too much. At the other extreme, conflict and leadership reach recall values of 99 percent and 94 percent, respectively, meaning that almost all the texts labeled by humans as having these frames were labeled similarly by the model. With regard to diagnosing human interest and morality frames, the model seems to have been too cautious (precision > recall), whereas the model seems to have been too eager to diagnose texts as having conflict, leadership, or responsibility frames (recall > precision).
Table 2. Performance metrics for the UKR dataset
Frame
|
Accuracy
|
Precision
|
Recall
|
F1
|
Kappa
|
Human interest
Responsibility
Morality
Economic consequences
Conflict
Leadership
Factual information
|
0.84
0.66
0.69
0.82
0.76
0.76
|
0.86
0.52
0.60
0.64
0.75
0.74
|
0.46
0.69
0.34
0.78
0.99
0.94
|
0.60
0.60
0.44
0.70
0.85
0.83
|
0.51
0.31
0.24
0.57
0.28
0.45
|
0.73
|
0.14
|
0.25
|
0.18
|
0.03
|
The F1 score, which combines accuracy and precision into a comprehensive measure of model performance, is widely considered good above a value of 0.7. The model reaches this threshold for three generic frames, namely conflict, leadership, and economic consequences. It is interesting to note that the message of the F1 score seems in some cases to deviate considerably from Cohen’s kappa. In particular, with regard to the conflict frame, the F1 is quite high (0.85) whereas the kappa is mediocre at best (0.28). This occurs in cases where a high degree of agreement between human and model ratings is expected by chance—i.e., in cases where the proportion of texts belonging to a particular frame is very high (as is the case for both conflict and leadership frames).