Towards Algorithmic Framing Analysis: Expanding the Scope by Using LLMs

doi:10.21203/rs.3.rs-5032053/v1

Download PDF

Research Article

Towards Algorithmic Framing Analysis: Expanding the Scope by Using LLMs

https://doi.org/10.21203/rs.3.rs-5032053/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Framing analysis, an extensively used, multi-disciplinary social science research method, requires substantial manpower and time to code and uncover human-level understanding of story contexts. However, recent advances in deep learning have led to a qualitative jump in algorithm-assisted methods, with large language models (LLMs) like BERT and GPT going beyond surface characteristics to infer the semantic properties of a text. In this study, we explore the application of the LLM BERT-NLI, which leverages bidirectional context and rich embeddings to assist scholars in identifying contextual information in media texts for quantitative framing analysis. More specifically, we investigate the capability of LLMs to identify generic media frames by comparing the results from a zero-shot analysis using BERT-NLI to those from human analysis. We find that the reliability of detecting generic frames varies significantly across different datasets, indicating that even a large LLM like BERT-NLI, trained on millions of texts from diverse sources, cannot be uniformly trusted across different contexts. Nonetheless, LLMs might be employed productively in specific contexts after careful consideration of their agreement with human-generated ratings.

Big data

context

framing analysis

large language models (LLMs)

BERT-NLI

Fueled by advancements in deep learning (DL), the field of natural language processing (NLP) has progressed rapidly over the past decade. Increasingly, large language models (LLMs) are able to penetrate beyond the surface layer of language (i.e., the usage of certain words or phrases) to its deeper meaning. This development has major repercussions for many industries, such as supply chain management (Barzizza et al., 2023), search engine optimization (Aggarwal et al., 2023), machine translation (Huang et al., 2023), and conversational agents (Deng et al., 2023). What is still overlooked, though, is that these technologies could also have a significant impact on other scientific fields, in particular humanities and social sciences. In this paper, we investigate how LLMs can be utilized to identify contextual information in media texts, a field with a long tradition that requires intensive human labor and time to read, code, and label in-depth contextual meanings. Our ambition involves the potential to expand the scope of quantitative framing analysis (QFA), a research method applied throughout the social sciences. Our outcomes are not only interesting for social scientists aiming to analyze large-scale datasets (“big data”), but also for NLP experts interested in more complex challenges than the ones usually given in benchmarking competitions.

Our study will be based on two datasets generated for earlier studies conducted by two of the authors. Regarding both their size and their content, these datasets are typical for QFA in media and communication research as they comprise media reports on particular social issues. Both datasets were generated by trained human raters (student research assistants) through a process of careful manual analysis of a limited number of texts. The process was focused on categorizing news items with regard to the frames present in them.

Frames are seen as complex, contextual concepts (Greco, 2012), and not strictly tied to certain words or phrases. As such, they present a unique challenge to LLMs, which are often trained to identify topics or text sentiments that are arguably much more closely tied to particular terms or phrases. In addition, to be useful for media and communication scholars practically, LLMs would have to be able to detect news frames without being fine-tuned on large amounts of training data. After all, if a scholar already had a dataset with labeled news items, the added value of conducting an algorithmic analysis would be very limited. Phrased in machine learning terms, this means that we are interested in zero-shot learning approaches where the model is capable of making predictions on new, unseen classes or categories without having been explicitly trained on any examples of those classes. Thus, our main research question is: To what extent can a state-of-the-art LLM approach human-level performance when conducting zero-shot news frame analysis across various contexts?

The potential benefit of using LLMs for QFA is enormous. If LLMs can indeed replicate human-level performance in frame analysis, it would enable researchers to increase the sample size of QFA studies from at most several hundred to potentially millions of documents. At the same time, financial and time requirements would collapse: whereas manual QFA studies can take weeks or even months, algorithmic QFA could be done in hours or even minutes.

In the following literature review sections, we first describe the standards and limitations of traditional (manual) QFA, with a special focus on identifying frames. We then review the literature on how LLMs and big data have been used in social science research, especially those that aim at automatizing content analysis using various NLP and computer-assisted techniques. Following that, we give a brief overview of recent developments in NLP and outline how state-of-the-art LLMs may advance the scope of QFA. Next, we sum up previous attempts that used various NLP techniques to automatize the identification of contextual information in large datasets. After this, we focus on the challenges of automated frame identification and discuss the extent to which LLMs can advance the development.

After the literature review, we introduce the methods to approach the two research questions, starting with a description of our training and test datasets. In the analysis and results, we investigate how well LLMs can emulate the performance of trained human raters. We start with unsupervised LLMs (zero-shot learning) and compare their performance to LLMs trained on varying amounts of human-labeled data. Reflecting the situation and requirements of most media researchers, our training dataset remains small compared to industrial benchmarking datasets. A conclusion to the findings and a discussion on the significance of the findings, their implications, and the direction for future research are presented at the end of the article.

Quantitative Framing Analysis (QFA) and its Limitations

To explore how mass media influence the public’s attitudes and beliefs and how the media agenda is in turn shaped by various interests, it is crucial to analyze the content of media under the highest level of methodological standards. If conducted correctly, QFA can satisfy these standards. However, QFA has certain limitations. Most importantly, it is very time and resource intensive. To allow a double-blind design, the researcher has to hire and train staff to conduct the analysis. To determine the reliability of the QFA, at least two of these raters have to work in parallel on the same corpus of documents. The process of analysis itself is much slower than simply reading these documents, and cannot be sped up without making it extremely error prone. As a result, most studies that employ QFA have sample sizes ranging from around one hundred to one thousand documents. Thus, QFA frequently reaches its limits when applied to traditional media—say, a study comparing the Iraq war coverage of the New York Times and Washington Post, each comprising many hundreds of articles on the topic. But if researchers instead want to investigate online and social media, they are inundated with datasets several orders of magnitude larger. So far, the only way to tackle these is effectively to throw large amounts of data away and focus on small subsamples of these datasets, which are still manageable with classic QFA. Even though these samples are usually drawn at random, it is questionable how representative they can really be of the extremely heterogeneous content of online social media.

Framing Theory and Framing Analysis

The concept of framing has been developed by a few generations of researchers following its first introduction by Goffman (1974). Entman (1993, p. 52) defined framing as a process of “select[ing] some aspects of a perceived reality and mak[ing] them more salient in a communicating text, in such a way as to promote a particular problem definition, causal interpretation, moral evaluation and/or treatment recommendation”. Media framing is therefore a process in which media organizations or outlets conceptualize an issue in a particular way by emphasizing some aspects of it (Chong and Druckman, 2007a, 2007b). The products of this process—that is, media frames—are expected to influence the public’s thinking and shape their perceptions and opinions about the issue (de Vreese, 2005; Knight, 1999).

de Vreese (2005) proposed a distinction between issue-specific and generic frames. Issue-specific frames can only be applied to communication on specific events or topics, while generic frames transcend thematic limitations and can be applied to a variety of different genres and topics. It is customary for researchers to combine self-developed issue-specific frames with a canonical set of generic frames. In particular, there are five generic frames that are used very commonly: conflict frame, responsibility frame, human interest frame, economic consequence frame, and morality frame (Semetko and Valkenburg, 2000). Communications using a conflict frame focus on the “conflict between individuals, groups, or institutions as a means of capturing audience interests” (Semetko and Valkenburg, 2000, p. 95). A responsibility frame presents an issue or problem “in such a way as to attribute responsibility for its cause or solution to either the government or to an individual or group” (Semetko and Valkenburg, 2000, p.96). A human interest frame “brings an individual’s story or an emotional angle to the presentation of an event, issue, or problem” (Valkenburg et al., 1999, p. 551; Zhou, 2008). An economic consequences frame emphasizes the actual or potential economic impact or consequences of an event, issue, action, or problem on individuals, societies, or nations (Luther and Zhou, 2005; Neuman et al., 1992; Valkenburg et al., 1999). A morality frame interprets an issue, event, or problem in terms of “moral prescriptions”, such as presenting moral messages or offering “specific social prescriptions about how to behave” (Zhou, 2008, p. 120). Studies conducting QFA on Chinese media add two more frames, leadership frame and factual information frame (Luther and Zhou, 2005). A leadership frame, which was found to have significant prevalence in the Chinese context, concerns the activities, actions, or speeches of a leader of a nation, an institution, a government agency, or a group, or highlights discussion and assessment of the leadership (Luther and Zhou, 2005; Zhou, 2008). Finally, a factual information frame presents news in a factual manner and in a “straightforward fashion” without “indication of implications” (Zhou, 2008, p. 120; Zillmann et al., 2004). Strictly speaking, factuality does not emphasize one or more aspects of an issue and is not therefore taken as a “frame” (Kuang and Wei, 2018). However, our research underscores the significance of factual information, given that certain delicate international news narratives, particularly those highlighting China's relations with other nations, are conveyed in an objective fashion to prevent provoking any unwarranted assumptions in the public (Kuang and Wang, 2020).

While the above seven generic frames are widely recognized and used in different existing studies, issue-specific frames identified in one study can only be applied in another study that is researching the same issue/topic. In other words, for issues that have not been studied by applying framing analysis, identification of the issue-specific frames would involve qualitative coding and analysis of some sampled texts before a quantitative analysis or coding for the appearance of these newly developed frames is conducted.

Currently, framing analysis is still largely done by human coders who are trained to understand both the denotative and connotative meanings of the frames and who can take contextual information into account in their analysis. While the coding practices have been largely formalized and standardized, the coding itself is rather time consuming and labor intensive. The call for computational assistance in the coding of frames is urgent, yet little progress has been made in the past few decades (see Burscher et al., 2014; Nicholls and Culpepper, 2021; Walter and Ophir, 2019) since the flourishing of framing research in the 1990s. While some skepticism is certainly warranted, we believe that the progress made by NLP over the past decade has brought the automatic detection of generic frames within reach, but this has yet to be employed in the discussion on frame analysis.

Recent Development of NLP for Contextualization

NLP is a branch of artificial intelligence (AI) focused on bridging the communication divide between computers and humans (Chowdhary, 2020; Li and Yang, 2018). The concept behind NLP is to create computer systems capable of analyzing, comprehending, and synthesizing natural human languages, which involves tasks such as sentiment analysis (Dang et al., 2020), machine translation (Jiang and Lu, 2020), text summarization (Awasthi et al., 2021), etc. The majority of NLP tasks involve using data presentation techniques to map text into numeric values so that machine learning methods can be used to carry out further analysis.

Traditional text representation methods focus on using statistical approaches such as one-hot encoding, bag-of-words, and term frequency. Such techniques treat each word or chunk of words as an independent unit (Rudkowsky et al., 2018). However, such methods ignore the semantics of words. More recently, word embedding techniques have become popular for various tasks in NLP (Mao et al., 2016; Rudkowsky et al., 2018; Zhang, Sneyd, et al., 2020), where words are represented as dense vectors that capture semantic information. The fundamental concept underlying word embedding techniques like Word2Vec (Mikolov et al., 2013) and GloVe (Huang et al., 2012) is the representation of each word in a language as a real-value numerical vectors, such that words with similar semantic meaning will be close to one another. However, words in different contexts may have different meanings. For example, the meaning of the work ‘bank’ in two different sentences ‘the river bank was flooded’ and ‘the bank is out of money’ is different. Traditional word embedding can only generate static embedding for each word and fails to capture contextual meaning.

With the recent advance of deep learning, large pre-trained language models—i.e., the BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) families of models—have revolutionized NLP by consistently achieving state-of-the-art performance across numerous tasks (Gu et al., 2024) . Both BERT and GPT are derived from the Transformer model (Vaswani et al., 2017), which makes use of the attention mechanism to consider the contribution of each word while learning the model.

Pre-trained language models particularly aim to predict a word based on its context, typically predicting the next word given the preceding words. This task benefits from abundant, naturally occurring, self-supervised text available for training. This process is referred to as pre-training, with the expectation that additional training or task adaptation is needed to use the pre-trained model effectively for the desired NLP task (Min et al., 2023).

Social scientists have been increasingly using NLP to gain insights into social problems and advance insights into empirical data and further existing theories in the area (e.g., Macanovic, 2022; Sik et al., 2023). The utilization is enabled by developments in NLP, which can effectively approach extremely large and complex datasets that are beyond the scope of traditional data processing methods like human/manual coding. Generally speaking, there are a few ways in which NLP has been used to approach big data in social science research.

First, NLP techniques are frequently used to analyze big data for an understanding of people’s opinions, attitudes, and emotions on a social issue—i.e., to do sentiment analysis (Dey et al., 2020). Existing literature shows that sentiment analysis has become a leading area of research sought after in the field of NLP (Devi and Kamalakkannan, 2020). Also called opinion mining, sentiment analysis aims to identify the opinions of internet/social media users either as neutral, positive, or negative (Kim and Hovy, 2018; Whitelaw et al., 2005).

Second, NLP is utilized for text mining—i.e., to extract and analyze textual information from various sources, including social media, so as to understand the linguistic knowledge and actions of humans (Shamshiri et al., 2024). Furthermore, topic modeling techniques have been developed to perform text classification and summarize large amounts of unstructured text data by topic/theme (Chae and Davidson, 2023). Other approaches extract network data from text, which can then be used to understand interactions, connections, and dynamics among social media users (Abbas, 2021).

Third, LLMs, as a subset of NLP techniques, are trained on massive datasets, often using deep learning techniques to achieve a high level of language understanding and generation. With the ability to generate coherent output following human prompts in natural language, state-of-the-art LLMs can complete scientific content analysis tasks with moderate accuracy requirements (Platt and Platt, 2023). With relatively limited training, LLMs can perform new tasks and show effective results, opening up new opportunities for social scientific research (Do et al., 2022, Wankmüller, 2022). In particular, general-purpose LLMs have been found to be reasonably capable in summarizing news articles, social media posts, fiction, and meeting minutes (Yang et al., 2023). They demonstrate strong performance in classifying both unstructured (Anil et al., 2023; Zhang, Wu, et al., 2020) and structured data (Hegselmann et al., 2023). With improved input instructions, LLMs could generate more accurate output (Giray, 2023; Zamfirescu-Pereira et al., 2023). Thus, human input in the prompting or training is considered crucial to the quality of output (Deroy et al., 2023).

Adopting NLP in Framing Analysis

As said, until now, the most widely used method of frame identification has been human coding guided by indicator questions. In 2014, Burscher et al. developed a computer-assisted dictionary-based approach to coding. In this approach, pre-defined search strings are used either to code a frame directly or to code a set of predefined concepts, the co-occurrence of which reveals the identification of a particular frame (Ruigrok and Van Atteveldt, 2007; Shah et al., 2022). As both approaches involve intensive manual coding by researchers to different extents, the call for automatic coding of frames using machine learning NLP techniques has been there for more than a decade.

An effort was made by Burscher et al. (2014) to explore the use of supervised machine learning (SML) to code frames. Following several experiments, they found that SML was good enough to automate frame coding but the level of performance was dependent on how SML was implemented (Burscher et al., 2014). Such findings indicate that the training of the computer using part of our existing manual coding data is crucial to the quality of automated frame coding. In particular, researchers’ understanding of the concepts of frames and the effectiveness of the statements they can develop to train the machine are the key.

The more recent experiments with approaches to automated frame analysis by researchers (e.g., Eisele et al., 2023; Kroon et al., 2022; Nicholls and Culpepper, 2021) all suggested the superiority of supervised machine learning in the identification of frames. Guo and her co-authors (2023), for instance, advance an automated system named Open Framing AI (OFAI; http://www.openframing.org), which identifies patterns that endure over time based on the constructive tradition. Meanwhile, studies agree that one of the difficulties is that computational methods recognize topics instead of frames (Guo et al., 2023). In other words, social scientists remain somewhat skeptical about the potential for automatically detecting frames—or, to use Nicholls and Culpepper's (2021, p. 160) words, whether they “are fully fit-for-purpose”—since framing analysis does not rely exclusively on the literal or referential meaning of a text but necessitates the understanding of connotative meaning, context, and ideologies. Meanwhile, the need for large human-labeled datasets also to some degree defeats the usefulness of automated approaches.

In another recent review, Wen and Younes (2023) suggested that LLMs have improved significantly in language understanding but are still not performing satisfactorily “in tasks that require a deeper, more nuanced understanding of context and bias”, such as framing analysis. Thus, the question remains as to whether NLP techniques can approach human-level performance in framing analysis even without the need for large-scale training datasets (Burscher et al., 2014).

Datasets and Preprocessing

As mentioned, in this study we will use two pre-coded datasets. The first dataset stems from a project on the framing of traditional Chinese medicine in non-Chinese media. We will refer to it as the TCM dataset in the following. The TCM dataset comprises 253 English language articles. On average, each article contains 586.4 words, yielding a total of 148,369 words. However, the size of the articles varies quite significantly, the smallest just comprising 35 words and the largest 2,430. Each article was rated by two research assistants in parallel. A variety of variables were assessed, among them the six generic frames we are focusing on. Crucially, research assistants were instructed to select one (and only one) frame for each article.

The second dataset was constructed in a research project on international reporting of the Russia-Ukraine war, which is why we will refer to it as the UKR dataset. The UKR dataset comprises 100 English language articles from English, American, and Chinese newspapers. The original dataset also contained articles in Chinese and Russian, which are not included in this analysis. On average, each article contains 1,114 words, making a total of 111,402 words in the whole dataset. Again, article length is quite heterogeneous, ranging from 38 to 5,815 words. The articles were rated by a team of seven research assistants. A small fraction of articles (only seven in this study) were rated by all seven in parallel, the remaining ones by a single research assistant each. Again, a wide range of variables was applied to each article, including the six generic frames. Different from the TCM dataset, research assistants were instructed to select one or several generic frames for each article, as many as they felt applied. On average, 2.6 frames were indicated. The consequences of this specific difference between the TCI and UKR dataset will be discussed further below. In both datasets, we exclude the headline, short summary, and the name of the author from each article. We also exclude other irrelevant text, such as location (e.g., ‘KIEW –’), page number, page headers and footers, image descriptions, etc. Thus, the texts used as input for the LLM exclusively consist of the body of the articles.

Figure 1 shows the distribution of different generic frames in the two datasets. It can easily be seen that both datasets are quite unbalanced, some frames being many times more prevalent than others. The TCM dataset is dominated by human interest and factual information frame stories, accounting for a combined 70 percent of texts. Due to the multi-label nature of the UKR dataset, percentages add up to more than 100 percent, but it can be easily seen that it, too, is dominated by two frames, in this case conflict (70%) and leadership (62%). This unbalanced distribution would be an additional challenge for supervised learning, but not necessarily for the zero-shot approach we follow here.

Performance Metrics

In this study, we aim at training an NLP system to automatically detect six generic frames in news article text. This problem can be formalized in two different ways, namely as a multi-class or a multi-label classification problem. In multi-class classification, each instance (in our case, text) is classified into a single category (in our case, a single generic frame). In contrast, in multi-label classification, each instance may be classified as having one or several labels. These two options correspond to the different ways in which our datasets where generated (see above): the TCM dataset was coded as a multi-class classification task, whereas the UKR dataset was coded as a multi-label task. This difference in task structure will be reflected in the setup of our automated classification approach.

The first point in which the difference between multi-class and multi-label classification is reflected is the evaluation of our classification results. Statisticians and computer scientists have come up with a wide variety of approaches to the evaluation of the quality of automated classification. All of them are based on a comparison between automated classification and the ‘gold standard’ of human annotation. As the use of the term ‘ground truth’ for human-annotated data shows, computer scientists often have a rather uncritical approach to the quality of these data. Only recently is there a growing realization that the performance a machine-learning (ML) system can achieve in classification tasks is capped by the quality of human annotations. In other words, if human raters do not even agree with each other (or worse, make their annotations semi-randomly), how can a ML system achieve perfect agreement? To reflect this issue, we will compare the performance of our classification approach to the agreement among our human raters.

All performance metrics are based on the analysis of a so-called confusion matrix. A confusion matrix plots the predictions of the ML system against the ‘ground truth’—in our case, the annotations of our research assistants. In case of a simple, binary classification problem, the confusion matrix consists of four boxes or cells: 1) true positives (TP): these are cases in which the model predicted the positive class (presence of a frame) correctly, meaning in correspondence with human raters; 2) true negatives (TN): these are cases in which the model predicted the negative class (absence of a frame) correctly; 3) false positives (FP), or type I errors: these are cases in which the model incorrectly predicted the positive class; and 4) false negatives (FN), or type II errors: these are cases in which the model incorrectly predicted the negative class.

Accuracy, representing the ratio of correctly predicted observations (TP + TN) to total observations (TP + TN + FP + FN), is undeniably the most rudimentary and comprehensible evaluative metric for classification models. However, its effectiveness is noticeably diminished when applied to datasets with imbalanced classes. The reason for this is that, if some classes have a very high or very low proportion of positive cases (i.e., texts that belong to a specific frame), it becomes possible to achieve high accuracy simply by always predicting the most prevalent class. For example, if 90 percent of texts in the TCM dataset were instances of the human interest frame, an ML system could achieve 90 percent accuracy simply by always classifying texts as human interest, irrespective of any properties of the text. Thus, while accuracy is certainly an informative metric, it needs to be complemented by other metrics to gain further insights into the classification results.

Recall, otherwise known as the true positive rate or sensitivity, quantifies the ability of the model to predict positive instances correctly (TP / (TP + FN)). It is defined as the fraction of positive cases (in our case, texts that have been classified by human raters as having a specific frame) that have been classified by the ML system as positive. This is particularly crucial in contexts where the repercussions of overlooking a positive instance are considerable, such as in medical diagnoses.

Precision, synonymous with the positive predictive value, is determined as the ratio of true positives to the sum of true and false positives (TP / (TP + FP)). In our context, it is the ratio of texts that have been classified by human raters as having a specific frame among texts that have been labeled by the ML system as having this frame. Intuitively, precision gauges the capacity of the classifier to refrain from erroneously labeling a negative sample as positive.

It is easy to see that sensitivity and precision are difficult to optimize at the same time. An ML system with a tendency to classify ambiguous instances as positive will tend to have higher sensitivity (it will ‘catch’ more positive instances), but it will also tend to make more mistakes in classifying instances as positive, which will decrease its precision.

The F1 score combines precision and recall into a single metric. It is defined as the product of precision and recall, divided by their sum. Thus, a high F1 score (close to 1) can only be reached if both precision and recall are high. In machine learning, the F1 score is often preferred to accuracy as a more reliable metric, especially in instances of uneven class distribution. While a high level of precision corresponds to an exceptional level of accuracy, it could simultaneously result in the overlooking of difficult-to-classify instances. Conversely, high recall but low precision implies that the classifier, despite identifying the majority of positive instances, also wrongly classifies numerous negatives as positives.

Lastly, Cohen’s kappa serves as an efficacious metric, particularly when dealing with imbalanced classes. Conventionally utilized as a measurement of agreement between two human raters, it is also employed in machine learning as a comparative tool between a classifier and a simplistic baseline classifier. Cohen’s kappa compares the observed accuracy with an expected accuracy (random chance). It takes into account the possibility of the agreement occurring by chance, which makes it a more robust measure than accuracy. The kappa score ranges between -1 and +1: a kappa score of +1 represents perfect agreement among the raters; a score of 0 implies the agreement is equivalent to random chance; a score of -1 denotes total disagreement among the raters. It is worth noting, though, that it can be difficult to reach high values of kappa in cases of unbalanced categories, as this would correspond to almost perfect agreement.

For multi-label classification problems like in our UKR project, we obtain one binary confusion matrix for each generic frame. This means we have to compute each metric described above separately for each generic frame. If we want to get an impression of the overall performance of our model, we have to combine these numbers,—e.g., by taking their arithmetic mean. In contrast, for multi-class classification problems like in the TCM project, we obtain a single confusion matrix, but with the number of rows and columns equal to the number of generic frames. In this case, we compute the quality metrics only once for the whole matrix.

To conclude, each of these evaluative measures caters to different aspects of model performance. Hence, their application is contingent on the specific requirements of each unique case. For example, when conducting a framing analysis, researchers have to determine whether their data correspond to a multi-label or multi-class classification problem. Furthermore, they need to decide whether all frames are of equal importance or more frequent frames are more relevant, and whether false negatives are more, less, or equally relevant for their analysis in relation to false negatives.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained deep learning model for NLP tasks that has significantly advanced the field of NLP. Developed by researchers at Google AI language, it was open-sourced in late 2018. BERT is based on the transformer architecture, an attention mechanism that learns contextual relations between words in a text. For example, BERT can learn that the meaning of the word ‘mouse’ changes radically depending on whether the context is computers (“I can’t click on this without a mouse”) or small animals (“I saw a mouse in the basement”). Likewise, it can learn that different words like ‘insult’ and ‘offense’ may have similar meanings. Depending on the context, BERT will represent the same token ‘mouse’ using different vector embeddings. Unlike previous models such as Long Short-Term Memory Networks (LSTMs), BERT is bidirectional. Conventional models process words in a sequence, either left-to-right or right-to-left. In contrast, BERT processes word context from both directions simultaneously. This bidirectional approach allows the model to gain a comprehensive understanding of the word context, resulting in more accurate predictions.

BERT is pre-trained on a large corpus of unlabeled text data collected from Wikipedia and book corpora. This pre-training step is unsupervised; it involves predicting words in a sentence (masked language modeling) and predicting sentence order (next sentence prediction). The first task trains BERT to learn the context of words, while the second task helps it to learn relationships between sentences. Through this pre-training, BERT acquires much more sophisticated knowledge of language than previous models, including syntactic regularities, information on the meaning of words in different contexts, and semantic relations between words (e.g., synonyms, antonyms). Thus, once pre-trained, BERT constitutes an ideal foundation for more specific applications, such as question answering, named entity recognition, sentiment analysis, or topic modeling. When applied to these tasks, the body of the LLM is preserved, but instead of focusing the model on masked language modeling and next sentence prediction, it is equipped with a new task head that is tailored to the specific application. With this new task head in place, it can be fine-tuned in a supervised manner on these specific tasks. This can be done faster and with smaller amounts of labeled data than would be necessary if the task were learned from scratch. The process of utilizing the language knowledge gained by unsupervised training of an LLM for different applications is also known as deep transfer learning. It has been shown that fine-tuning models with prior language knowledge to new tasks is significantly more efficient than training models from scratch (Iman et al., 2023; Ruder, 2019; Tan et al. 2018).

Natural Language Inference

One area for which BERT (and other language models) can be fine-tuned is known as natural language inference (NLI; MacCartney, 2009). NLI, also known as recognizing textual entailment (RTE), is a subtask in NLP that involves determining whether a given piece of text, also called the hypothesis, can be logically inferred from another text, the premise. In essence, NLI is about understanding the relationship between two texts. Three types of relationship exist: entailment, contradiction, and neutral. If the hypothesis is a logical implication of the premise, it is considered an entailment. If the premise and hypothesis convey opposing information, it is a contradiction. If the premise and the hypothesis are unrelated or the relationship is unclear, it is deemed neutral. NLI is considered a high-level task for LLMs: to solve NLI tasks, the model not only has to be able to deal with syntax and semantics, but also needs world knowledge—i.e., information on factual relations between concepts.

The BERT-NLI Model

For our analyses, we will use the BERT-NLI model, which has been developed by Laurer et al. (2024). As the name implies, the model was created by fine-tuning an advanced version of BERT, DeBERTaV3 base (He et al., 2021), on a variety on NLI tasks. More specifically, the model was fine-tuned on 1.2 million human-annotated premise-hypothesis pairs coming from eight different NLI datasets. Laurer at al. (2024) used a simplified version of the NLI task structure (see above), in which only the distinction between entailment and non-entailment (encompassing both contradiction and neutral) is considered. They found that the model reaches peak performance after only a few iterations of the training dataset. This shows how much the model benefits from the basic language knowledge of DeBERTa, acquired during its pre-training.

Fine-tuning an LLM in this way has the advantage that NLI is a universal task, meaning a large variety of problems can be reformulated as NLI tasks. For example, text classification tasks can be turned into NLI problems by formulating a hypothesis like “the text is about sports” or “the text is about politics”, and using the text we want to classify as premise. This is how Laurer et al. (2024) proceed. They apply their fine-tuned model on a variety of tasks for political science research, such as deciding whether sentences from party manifestos represent a pro- or anti-military, pro-protectionism or pro-traditionalism stance, classifying US State of the Union speeches and Supreme Court cases according to policy topic, and classifying news texts as positive or negative about the economy. For each of these tasks, Laurer et al. (2024) develop hypothesis statements based on the categories in the original codebooks that were used in the generation of these datasets. For example, for analyzing the stance on the military represented in party programs, the hypotheses are “The quote is positive towards the military” and “The quote is negative towards the military”; for classifying news articles into topics, one hypothesis is “The quote is about economy, or technology, or infrastructure, or free market”. This process, called label verbalization, constitutes a significant difference between NLI models and classic text classification. In classic supervised machine learning, the algorithm is trained purely on numeric ‘ground truth’ labels, which only tell it which texts belong in which category. The actual differences between these categories need to be inferred by the algorithm from scratch. In contrast, the hypotheses in NLI actually inform the language model about the meaning of the different categories. This makes training an NLI system much more akin to training a human rater. Laurer et al. (2024) validate their BERT-NLI LLM in two distinct ways: firstly, in a zero-shot approach without any further fine-tuning on the task dataset; and secondly, by fine-tuning the model on ever larger batches of task-specific training data. Depending on the task, the zero-shot approach yields F1 values between 0.1 and 0.65, with an average of around 0.38. Further training increases the F1 drastically, all the way up to an average of around 0.75, given a training set of 2,500 data points. Over most tasks and for most training sample sizes, BERT-NLI outperforms a standard BERT text classification model (not fine-tuned on NLI data).

To sum up, the BERT-NLI LLM has several features that fit very well to the requirements of our project. 1) It is based on the BERT LLM, which enables it to capture the meaning of words, phrases, and whole documents, rather than just the frequencies of words, as was the case in traditional NLP methods. 2) It is fine-tuned on large NLI datasets, which enables it to judge whether a statement (hypothesis) is justified, given a text (premise). By formulating appropriate hypotheses, this ability can be used to classify statements in many different ways (e.g., topic, sentiment, political stance, etc.). 3) Laurer et al. (2024) have demonstrated that BERT-NLI can reach high levels of performance on different political science datasets, which makes it likely that it can also be applied to other social science tasks. However, there are two major obstacles that make it challenging to apply ERT-NLI to our datasets. In the following, we will discuss them one at a time.

Challenges

Abstract variables. One of the main challenges in our project is the relatively abstract nature of the variables we are trying to detect, also a key issue in identifying frame (Guo et al., 2023; Nicholls and Culpepper, 2021). Traditionally, NLP has been focused on topic modeling—i.e., the categorization of a given text into one of a number of pre-defined topics. For example, news articles might be categorized into foreign news, sports, weather, society, economy, etc. While not exactly trivial, the difficulty of this task is somewhat attenuated by the fact that certain words or phrases are highly correlated with specific topics but not with others. If an article contains the terms ‘storm’, ‘rain’, ‘high pressure area’, and ‘tomorrow’, we can safely assume that it falls into the category ‘weather’. This close relationship between categories and phrases does not exist to the same degree for generic frames. An article written under a leadership frame might have a higher likelihood of containing words like ‘leader’ or ‘leadership’, but it is completely possible to write a leadership framed article without ever mentioning these or other pertinent phrases. This issue is also acknowledged by Nicholls and Culpepper (2021), who assert that computational methods largely return definite topics but not frames. Furthermore, while specific phrases might be correlated with a generic frame in the context of one topic (say, traditional Chinese medicine), they may no longer be correlated with the same frame in a different context (the Ukraine War, for example). For these reasons, frame detection is extremely challenging for traditional, word frequency-based NLP techniques.

As mentioned, the new generation of large language models goes beyond analyzing word frequencies. By taking into account the context, these models manage to extract the semantic content of a text. Thus, it is plausible to assume that they are more suited to the task of identifying generic frames. However, if we look at the variables in the political science datasets used by Laurer et al. (2024) to evaluate their BERT-NLI model, they still seem comparatively concrete in comparison to our generic frames. In particular, they are still mostly on the level of what is being said and not how the story is told, as is the case for the generic frames.

Input length constraints. As described above, the NLI model used by Laurer et al. (2024) is based on BERT. As such, it has the same constraints regarding the maximum size of textual input that the model can handle, namely 512 tokens. A token is a piece of text that results from tokenization—i.e., the process of segmenting a text into units that an LLM can process. Often, tokens correspond to complete words, especially if they are simple and occur frequently in the training corpus of the LLM (like ‘house’). Infrequent compound words like ‘tokenization’, however, are cut into their more frequent constituent parts - ‘token’ and ‘ization’ in this case. If these tokens are contained in the vocabulary of the LLM, they are then translated into an index number; so ‘token’ and ‘ization’ might become 183,372 and 457, for example. These index numbers, and not the words or tokens themselves, constitute the input of the LLM.

On average, the texts in our datasets contain 753.5 (TCM) and 1,372.7 (UKR) tokens, significantly exceeding the input limit of BERT and BERT-NLI. Thus, if we were to pass our articles to the LLM as a whole, they would be truncated, meaning they would be cut after the 512^th token. In consequence, we would be losing large chunks of most of the articles in our dataset, which potentially contain relevant information about the frames. However, the 512 token limit is still too optimistic: the premise sentences in the eight textual entailment datasets on which BERT-NLI has been fine-tuned, as well as the political science datasets on which it has been tested, typically only comprise a few sentences. For example, the premise sentences in the MultiNLI dataset are on average only 22.3 words, and the hypotheses only 11.4 words (Williams et al., 2017). In contrast, our articles on average consist of 586.4 (TCM) and 1,114.1 (UKR) words. Based on this, it stands to reason that BERT-NLI should achieve the best performance when given input texts with rather limited size as the premise. In the next section, we will discuss how we tackled these challenges.

Hypothesis Formulation

In contrast to Laurer et al. (2024), who usually only formulate a single hypothesis for each variable they assess, we formulate three hypotheses for each of the seven generic frames we analyze. This allows us to cover a wider range of different aspects of each of our generic frames without formulating hypotheses that are too lengthy and complicated and which might thus be challenging for the model to process . We developed these statements based on the technical definitions of the generic frames (see above), as well as the operationalization of the frames in the codebooks of our two studies. In accordance with the recommendations of Laurer et al. (2024), we tried to keep each statement as short and simple as possible, using everyday language instead of technical or academic terms whenever possible. The statements can be seen in Table 1.

Table 1. Generic frames and associated hypothesis statements

Generic frame	Hypothesis statements
Human interest	“The quote tells a person's story” “The quote talks about someone's own experiences” “The quote includes personal words that make you feel emotions”
Responsibility	“The quote is about a government being responsible for an issue” “The quote is about an individual being responsible for an issue.” “The quote is about a group being responsible for an issue”
Morality	“The quote talks about the ethics related to a problem” “The quote talks about the moral aspects of a problem” “The quote is about morality”
Economic consequences	“The quote is about the economic consequences of an event” “The quote is about the economic consequences of a decision” “The quote is about economic impact”
Conflict	“The quote is about a conflict” “The quote points out different opinions between people” “The quote stresses the 'us against them' story”
Leadership	“The quote is about a leader” “The quote talks about how leaders affect things” “The quote is about a government”
Factual	“The quote contains only factual information” “The quote is written in an objective way” “The quote discusses an issue in a purely factual manner”

Developing these statements can be seen as analogous to the development of scales in a questionnaire design. Typically, an abstract construct like ‘authoritarianism’ will not be measured with a single item but with a number of different items covering different aspects of the construct, and in combination making the measurement more reliable.

We decided to use the same hypothesis statements for both of our datasets. However, it might be advantageous to think of ways in which these statements can be adapted further to the specific context of a study, especially if the researcher has a priori information about how the frames are predominantly expressed in this specific context. On the other hand, making the hypothesis statements too specific can cause more unusual and unexpected manifestations of the frames to be overlooked. Since we formulate several hypothesis statements for each generic frame, the model will give us separate predictions for each of these statements. This raises the question of how these predictions can be combined to obtain a single prediction for the frame. We will address this issue further below.

Text Chunking

From each article, we exclude the headline, short summary, and the name of the author. We also exclude other irrelevant text, such as location (e.g., ‘KIEW –’), page number, page headers and footers, image descriptions, etc. Thus, the texts used as input for the LLM exclusively consist of the body of the articles. As discussed above, the texts in our datasets are significantly longer than the maximum input window of BERT, which in turn is much larger than the average length of the texts on which BERT-NLI was fine-tuned and validated (Laurer et al., 2024). For this reason, we decided to cut each text in our dataset into chunks of one or several sentences, using the English sentence tokenizer from the Python NLTK package. By choosing whole sentences as constituent parts, we avoid confusing the language model with statements that end abruptly mid-sentence. By varying the size of these chunks from one to fifteen sentences, we can compare which chunk size results in the best outcomes. Given that the average numbers of sentences in our datasets are 21.1 (TCM) and 46.8 (UKR), this almost always leads to each text being split into several chunks. If a text contains fewer sentences than the chunk size, it is treated as a single chunk. To validate that chunking does in fact improve the outcomes, we compare the results with a model to which we fed the whole texts, truncated to the maximum input window.

Analyzing the Text Chunks with BERT-NLI

After dividing each text in our datasets into text chunks, we add the phrase ‘The quote: "’ to the beginning of each text chunk in our dataset, and the phrase ‘" - end of the quote’ to the end, as recommended by Laurer et al. (2024). This allows the language model to connect the text of our chunks to our hypotheses, which all use the term ‘the quote’ when referring to the premise (text chunk) under scrutiny (see Table 1). We then tokenize the texts using the tokenizer paired with BERT-NLI, and pass the token indices to the BERT-NLI LLM. Likewise, we tokenize each of the 21 hypothesis statements (see Table 1). The model will then analyze each token chunk together with each of the hypothesis statements and return a probability (from 0 to 1), reflecting whether the model believes that the hypothesis is entailed by the text chunk. We will refer to these numbers as the model ratings to distinguish them from human ratings. Thus, if a text has been divided into four chunks, we obtain 4 x 21 = 84 separate model ratings. In contrast, our datasets will only contain human ratings of the seven generic frames for the text as a whole. The challenge is now to aggregate the 84 probability values in such a way that they can be compared to the seven frame ratings.

Re-aggregating Outcomes

The re-aggregation of the model ratings involves two steps: 1) aggregating the ratings of the three hypothesis statements to a rating of the generic frame (one for each text chunk); and 2) aggregating the ratings for each text chunk into a rating for the whole text. For step 1), we simply take the arithmetic mean across the model ratings of the three hypothesis statements belonging to a specific generic frame. For step 2), it is more complex: if we average each model rating over all the chunks belonging to a specific text, it might cause the signal to get drowned out. This is because, even if a text has a certain generic frame, we do not assume that each chunk in this text reflects this fact equally. Some chunks will receive high model ratings, but many more might receive ratings close to zero. We avoid this by using the maximum model rating across chunks instead of the average. We compared this procedure to other possible ways (e.g., taking the maximum / the average across both text chunks and hypothesis statements) and found it to be optimal.

After re-aggregating, we obtain one model rating per generic frame per text. As discussed above, in the TCM dataset, the human raters were asked to choose only one frame per text. In an analogy to this, for each text we choose the highest model as the predicted frame. In contrast, in the UKR dataset, the human raters were allowed to choose several frames. Thus, for the UKR dataset, we select as model predictions all the frames whose model ratings are above a threshold of 0.5. Having done this, we can now compare the model’s predictions to the human-made annotations.

Outcomes

Optimal Chunk Length

For both of our datasets, we tested the performance of BERT-NLI on different chunk sizes, varying their length from one to fifteen sentences. After each model run, we re-aggregate the model ratings in the way described above to make them comparable to the human ratings. The lines in Figure 2 represent the agreement between the model predictions and the human annotations, measured by Cohen’s kappa, for different chunk sizes, while the dots represent Cohen’s kappa for the model without chunking. For both datasets, the picture is quite complex. On the left, we can see the results for the TCM dataset. Since this dataset represents a multi-class classification problem, the overall model performance can be expressed as global Cohen’s kappa (black line/dot). The highest value of this metric is achieved for the smallest chunk size, where each chunk comprises only a single sentence. The Cohen’s kappa value for this chunk size is 0.31, which is conventionally interpreted as fair agreement. With increasing chunk size, it falls to values of around 0.18, reflecting only slight agreement—i.e., levels of agreement that are not much above chance. Even this is still higher than the kappa of 0.15 for the model run without chunking, where the model was simply fed with the first 512 tokens of each text (black dot).

The model performance regarding specific frames is very heterogeneous. For a chunk size of 1, the highest agreement between human and model ratings is achieved for the morality and human interest frames, with kappa values around 0.4. An intermediate value of 0.2 is achieved for the economic consequences frame, while the leadership, responsibility, factual information, and conflict frames are all between 0 and 0.1 (slight agreement). It might seem surprising that the model’s overall performance (global kappa) is relatively high, given that most frames are detected rather poorly. This is due to the fact that the frames are represented very unevenly in the TCM dataset, with human interest accounting for 40 percent of all news articles. The falling trend in global kappa with increasing chunk size is also mostly due to the same trend in human interest, while the other frames show no clear correlation with chunk size.

The picture is much similarly complex when we look at the agreement between model predictions and human ratings for the UKR dataset (Figure 2, right). Again, most of the generic frames show no clear trend with regard to chunk size; it is worth mentioning, though, that human interest shows a similar negative correlation with chunk size to the TCM dataset. The maximum value of Cohen’s kappa is at different chunk sizes for almost all frames. If we look at the weighted average of the frame-specific kappa values (black line; weights are the frequencies of the frames in the UKR dataset), there is also no clear trend. It has values around 0.3 for the whole spectrum of chunk size. The maximum kappa value of 0.36 (fair agreement) is reached for a chunk size of 4. For the UKR dataset, running the model on the whole texts, without any chunking, does not produce outcomes quite as bad as for the TCM dataset; but still, with a kappa of 0.31, this model performs worse than most of the models with smaller chunk sizes. We see a clearer pattern when we look at the differences in kappa between the different generic frames. At their peak, human interest, economic consequences, and leadership all achieve values between 0.5 and 0.6 (moderate agreement). The values for responsibility, morality, and conflict all peak around values of 0.3, while the factual information frame peaks below 0.2 (slight agreement) and mostly hovers around zero. Our outcomes demonstrate that dividing the texts in our datasets into chunks indeed improved the model performance significantly. The optimal chunk size, however, seems to be different for each generic frame. Based on these outcomes, we select the model runs with chunk size 1 (TCM) and 4 (UKR), and analyze the model’s performance in more detail.

Model Performance under Optimal Chunk Size

We can learn more about the performance of the model with regard to the TCM dataset by looking at the confusion matrix (Figure 3). The rows of the confusion matrix give us the generic frame chosen by the human annotators, while the columns express the model’s predictions. The numbers in the matrix are row-normalized, meaning they add up to 1 for each row (human predictions). If the model’s predictions were completely in accordance with human ratings, all entries would be on the diagonal of the matrix. We can easily see that this is far from the case. Roughly two-thirds of texts that were categorized by the human annotators as human interest and morality were categorized in the same way by the model. For the factual information frame, it is roughly half; and it is between 17 percent and 28 percent for economic consequences, conflict, and leadership frames. The only frame for which the model prediction is slightly worse than chance (14.3%) is responsibility, with only a 10 percent overlap. A far larger proportion of texts that were classified by the human annotators as responsibility frame were mis-classified as conflict (50%). Even more remarkably, 43 percent of human-annotated conflict texts were mis-classified as responsibility. The model therefore seems systematically to confuse these two generic frames (we could improve the model performance by switching their labels). Unsurprisingly, the model performance on the TCM dataset as a whole is not very high. The overall accuracy is 0.47, meaning 47 percent of texts were classified correctly. However, as the associated Cohen’s kappa of 0.31 indicates, this is still well above chance level.

For the UKR dataset, we will not look at seven separate confusion matrices, but instead focus on Table 2. For most generic frames, the model seems to have performed better in the context of the UKR dataset. Accuracy ranges from 66 percent (responsibility) to 84 percent (human interest), with an average of 75.1 percent, meaning that three quarters of human annotations have been predicted correctly by the model. The spread of accuracy values is much broader, ranging between 14 percent (factual information) and 86 percent (human interest). The factual information frame also has the lowest recall at 25 percent, and thus the lowest F1 score at 0.18. Luckily, with only 12 percent of texts, factual information is also the rarest category in the UKR dataset, and so these suboptimal outcomes do not impact the overall model performance too much. At the other extreme, conflict and leadership reach recall values of 99 percent and 94 percent, respectively, meaning that almost all the texts labeled by humans as having these frames were labeled similarly by the model. With regard to diagnosing human interest and morality frames, the model seems to have been too cautious (precision > recall), whereas the model seems to have been too eager to diagnose texts as having conflict, leadership, or responsibility frames (recall > precision).

Table 2. Performance metrics for the UKR dataset

Frame

Accuracy

Precision

Recall

Kappa

Human interest

Responsibility

Morality

Economic consequences

Conflict

Leadership

Factual information

0.84

0.66

0.69

0.82

0.76

0.86

0.52

0.60

0.64

0.75

0.74

0.46

0.69

0.34

0.78

0.99

0.94

0.60

0.44

0.70

0.85

0.83

0.51

0.31

0.24

0.57

0.28

0.45

0.73

0.14

0.25

0.18

0.03

The F1 score, which combines accuracy and precision into a comprehensive measure of model performance, is widely considered good above a value of 0.7. The model reaches this threshold for three generic frames, namely conflict, leadership, and economic consequences. It is interesting to note that the message of the F1 score seems in some cases to deviate considerably from Cohen’s kappa. In particular, with regard to the conflict frame, the F1 is quite high (0.85) whereas the kappa is mediocre at best (0.28). This occurs in cases where a high degree of agreement between human and model ratings is expected by chance—i.e., in cases where the proportion of texts belonging to a particular frame is very high (as is the case for both conflict and leadership frames).

In this paper, we have investigated whether a state-of-the art LLM is able to perform a typical QFA on the level of human raters without any previous training on the dataset (i.e., in a zero-shot scenario). The model we used is BERT-NLI (Laurer et al., 2024), an NLI model based on DeBERTaV3 base (He et al., 2021), which in turn is based on BERT (Devlin et al., 2018), an LLM developed by Google Research. BERT-NLI is an example of deep transfer learning: it acquired basic insights into English syntax and grammar by training on large language corpora. On this basis, it was fine-tuned by Laurer et al. (2024) on several NLI datasets. In this way, it acquired the ability to judge whether or not a statement (the so-called hypothesis) follows from a text (the premise). As Laurer at al. (2024) point out, this skill can be applied to many different tasks, including opinion mining, sentiment detection, and text classification. They apply it with some success to a variety of political science tasks, including determining the stance of political parties on a variety of policy issues based on their party manifestos.

In our study, we investigated whether BERT-NLI can also be applied to a typical task in the field of framing analysis for recognizing and considering context. Conducting this type of content analysis is often very time-consuming and expensive; the resulting datasets are therefore often rather small. We use two datasets from previous studies, created in very different contexts: the first contains articles about TCM, the second one articles about UKR. The potential gains from automating news framing analysis tasks would be considerable. Once it can be established that a computer can achieve human-level performance, much larger text corpora could be annotated in a much shorter time frame and at considerably lower cost.

We acknowledge that there were several obstacles to the application of BERT-NLI to our datasets. Firstly, our variables of interest, generic news frames, are more abstract than the variables used in Guo et al. (2023) or Laurer et al. (2024). In particular, they are less about the specific content or topic of a text and more about the way a story is told so as to pinpoint the relevance of contextualization in identifying the frame(s). Secondly, the texts in our dataset are much longer than the ones used by Laurer et al. (2024), both in their fine-tuning of BERT-NLI and in their validation of the model. We overcame these obstacles with a two-fold strategy. Firstly, we developed three separate hypothesis statements for each of the seven generic frames we were trying to assess, capturing different aspects of each frame. Secondly, we divided each text in our datasets into text chunks comprising one or several sentences. We applied BERT-NLI to each pair of hypotheses and text chunks, and then re-aggregated the resulting model ratings onto the level of whole texts and generic frames.

Since our two datasets have been produced under different task frameworks (multi-class vs multi-label classification), it is not easy to compare the model performance in a straightforward way. However, it seems that, overall, the model performed similarly well on both datasets, with a global Cohen’s kappa of 0.31 for the TCM dataset and a weighted average kappa of 0.36 for the UKR dataset. However, this global performance is based on very heterogeneous outcomes for the different frames. Here, the level of agreement between model and human ratings ranges between basically chance (factual frame in the UKR dataset and leadership, conflict, and responsibility frames in the TCM dataset) and levels that would be acceptable (if still suboptimal) in a publication (human interest and economic consequences in the UKR dataset).

Our study demonstrates the role of context in the performance of LLMs. It is necessary to explain that context in this case has two meanings: context as research inquiries and database information, and the context of the test that we want to acknowledge through NLP. Accordingly, the context of our study is different from the tasks used by Laurer et al. (2024) to validate their BERT-NLI model, both with regard to the target variables and the data structure. This difference constitutes the main challenge of this study. We overcame it by using multiple hypothesis statements for each of our variables and by re-structuring our dataset into text chunks: by doing so, we aimed to tune the model to develop frame better given the consideration of specific contexts of our cases. Given that we apply BERT-NLI to our datasets in a zero-shot framework—meaning without any further fine-tuning on our datasets—our outcomes are quite encouraging: for most variables, the agreement between model and human ratings is significantly above chance levels. As was noted by Laurer et al. (2024), BERT-NLI performs better for more concrete topics that are easy to verbalize and to transform into hypotheses, and worse for abstract, multi-faceted topics. Laurer et al. (2024) also note that longer premise-texts are challenging for the model since it was almost exclusively trained on short texts of, at most, a few sentences. Finally, Laurer et al. (2024) predicted that the model performance would decrease with a higher number of categories, although this was only partially reflected in their outcomes. Our task, of course, combines all three of these disadvantages: long premise-texts (much longer on average than any texts in Laurer et al., 2024), and a relatively large number of highly abstract categories (generic frames). Given these handicaps, it seems quite remarkable that the model still performed so well.

Nevertheless, we want to raise awareness of the potential risks in straightforwardly translating our outcomes into an unqualified recommendation to use BERT-NLI as a replacement for human coders. As we have seen, the model performance is highly dependent on other contextual factors. While the overall model performance in our two datasets is similar, different variables (generic frames) are detected with very different degrees of reliability. In addition, the TCM and UKR datasets are quite different in terms of which generic frames are detected with greater reliability. The observation that a specific variable was detected with great reliability in one context does not therefore guarantee a similar performance in a different context. Thus, even a large LLM like BERT-NLI, which was trained on millions of texts from a diverse range of sources, cannot be trusted to perform uniformly across different contexts.

In conclusion, for the time being, it is still at the very least necessary to have a reasonably sized human-annotated test dataset on which to evaluate the performance of the model. The size of this dataset, however, can be significantly smaller than would be needed to actually fine-tune a model, let alone to train a model from scratch. So, there is still significant potential for using BERT-NLI as a labor saving device. To what degree this is the case depends on the specific context and requirements of each research project: in some cases, it will be more important to analyze a limited number of texts with great care and precision, for which human annotation is still the gold standard. In cases where the focus is more on quantity, however, BERT-NLI presents itself as a viable alternative (after testing it on annotated data).

Limitations and Outlook

The biggest limitation of our study is certainly the omission of any fine-tuning of the BERT-NLI model on our datasets. In contrast, Laurer et al. (2024) achieved significant performance improvements of their model by fine-tuning it on part of their data. There are two main reasons why we did not follow this path. First of all, our datasets are simply too small to yield sufficiently sized training, test, and validation datasets after splitting. While this may be seen as a limitation, it is one shared with many other similarly sized datasets in social science research. Another reason lies in the way we restructured our data by dividing the texts into chunks, as well as by creating multiple hypothesis statements for each generic frame. While this improves the performance of the zero-shot model, it constitutes a significant obstacle for fine-tuning efforts: in fine-tuning, the model learns to improve its performance based on human annotations. However, these annotations only exist on the level of whole texts (not chunks) and generic frames (not hypothesis statements). We can, of course, simply apply the human rating of the whole text to each associated text chunk and hypothesis statement, but this is likely to lead the model astray: many of the text chunks will simply not reflect the overall frame of the article from which they are taken, and even if they do, it is not guaranteed that all of the hypothesis statements will be activated in this way. Consequently, when we explored this route, we found that model performance, instead of slowly increasing through training epochs, actually decreased rapidly and then remained on the chance level. The culprit for this decrease in performance was that the model developed a strong tendency to over-diagnose all frames. It remains to be seen whether these obstacles can be overcome by a more sophisticated training algorithm, or whether they constitute a more fundamental barrier.

There are also different ways in which the model’s performance could be increased. For one, the hypothesis statements could be further refined, and even adapted in different ways to different contexts. Another possibility is to adapt the weights and thresholds for the model ratings after they have been generated by BERT-NLI, and in this way increase or decrease the model’s tendency to diagnose specific frames. However, when done too extensively, all of these strategies might result in over-fitting the model to the specific dataset at hand without increasing its performance on novel data. In future publications, we will follow up with a more detailed discussion of these strategies.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests

The authors declare that they have no competing interests.

Funding

The research was supported by the Danish Sapere Aude: DFF-Forskningsleder Fund (Project: To Use or Not to Use? A Relational Approach to ICTs as Repertoire of Contention; Sub-project: Framing Activism in the Digital World; Project No. 1055-00011B).

Authors' contributions

All authors contributed equally to the design and conceptualization of the study. XK and JL made major contributions to the introduction and literature review sections. HZ and SS contributed data pre-processing, machine learning based experiments and the analysis of the results. SS contributed the discussion and conclusion sections. All authors read and approved the final manuscript.

Abbas AM (2021) Social network analysis using deep learning: applications and schemes. Social Network Analysis and Mining 11(1):106.
Aggarwal B, Rai D, Kumar N, Aggarwal G (2023, November 1-3) Evaluation of search engine optimization based on blockchain technology. In: 3rd international conference on technological advancements in computational sciences (ICTACS). IEEE, pp 529–34.
Anil R, Dai AM, Firat O, et al. (2023) PaLM 2 technical report. arXiv: 2305.10403 [cs.CL].
Awasthi I, Gupta K, Bhogal PS, Anand SS, Soni PK (2021, January 20-22) Natural language processing (NLP) based text summarization—a survey. Paper presented at the 2021 6th International Conference on Inventive Computation Technologies (ICICT).
Barzizza E, Biasetton N, Ceccato R, and Salmaso L (2023) Big data analytics and machine learning in supply chain 4.0: a literature review. Stats 6(2):596–616. https://doi.org/10.3390/ stats6020038
Burscher B, Odijk D, Vliegenthart R, de Rijke M, de Vreese CH (2014) Teaching the computer to code frames in news: comparing two supervised machine learning approaches to frame analysis. Communication Methods and Measures 8(3):190–206. doi:10.1080/19312458. 2014.937527
Chae Y, Davidson T (2023) Large language models for text classification: from zero-shot learning to fine-tuning. Open Science Foundation.
Chong D, Druckman JN (2007a) A theory of framing and opinion formation in competitive elite environments. Journal of Communication 57(1):99–118.
Chong D, Druckman JN (2007b) Framing theory. Annual Review of Political Science 10:103–26. doi:10.1146/annurev.polisci.10.072805.103054
Chowdhary KR (2020) Fundamentals of artificial intelligence. New Delhi: Springer doi:10.1007/978-81-322-3972-7
Dang NC, Moreno-García MN, De la Prieta F (2020) Sentiment analysis based on deep learning: a comparative study. Electronics 9(3): 483.
Deng Y, Lei W, Huang M, Chua TS (2023, November 26) Rethinking conversational agents in the era of LLMs: proactivity, non-collaborativity, and beyond. In: Proceedings of the Annual international ACM SIGIR conference on research and development in information retrieval in the Asia Pacific region, pp 298–301.
Deroy A, Ghosh K, Ghosh S (2023) How ready are pre-trained abstractive models and LLMs for legal case judgement summarization? In: Conrad JG, Linna DW, Baron JR, et al. (eds) Proceedings of the 3rd international workshop on artificial intelligence and intelligent assistance for legal professionals in the digital workplace. Braga, Portugal, pp 8–19.
Devi GD, Kamalakkannan S (2020) Literature review on sentiment analysis in social media: open challenges toward applications. Test Eng. Manag. 83(7):2466–74.
Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
De Vreese CH (2005) News framing: theory and typology. Information Design Journal + Document Design 13(1):51–62.
Dey RK, Sarddar D, Sarkar I, Bose R, Roy S (2020) A literature survey on sentiment analysis techniques involving social media and online platforms. International Journal of Scientific & Technology Research,1(1):166–73.
Do S, Ollion E, and Shen R (2022) The augmented social scientist: using sequential transfer learning to annotate millions of texts with human-level accuracy. Sociological Methods & Research 53(3). https://doi.org/10.1177/00491241221134526
Eisele O, Heidenreich T, Litvyak O, Boomgaarden HG (2023) Capturing a news frame—comparing machine-learning approaches to frame analysis with different degrees of supervision. Communication Methods and Measures 17(3):205–26.
Entman R (1993) Framing toward clarification of a fractured paradigm. Journal of Communication 10(2):155–73.
Giray L (2023) Prompt engineering with ChatGPT: a guide for academic writers. Annals of Biomedical Engineering. https://doi.org/10.1007/s10439-023-03272-4
Goffman E (1974) Framing analysis: an essay on the organization of experience. Harper & Row.
Greco S (2012) Contextual frames and their argumentative implications: a case study in media argumentation. Discourse Stud. 14:197–216. 10.1177/1461445611433636
Gu F, Gu Y, Xu Y, Sun H, Pan Y, Li S, Zhang H (2024) Language-based audio retrieval with GPT-augmented captions and self-attended audio clips. Paper presented at the 27th international conference on computer supported cooperative work in design (CSCWD 2024).
Guo L, Su C, Paik S, Bhatia V, Prasad Akavoor V, Gao G, Betke M, Wijaya D (2023) Proposing an open-sourced tool for computational framing analysis of multilingual data. Digital Journalism 11(2):276–97. doi:10.1080/21670811.2022.2031241
He P, Gao J, Chen W (2021) DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
Hegselmann S, Buendia A, Lang H, Agrawal M, Jiang X, Sontag D (2023, April 25-27) TabLLM: few-shot classification of tabular data with large language models. In: Ruiz F, Dy J, van de Meent J-W (eds) Proceedings of the 26th international conference on artificial intelligence and statistics. PMLR, pp 5549–81.
Huang EH, Socher R, Manning CD, Ng AY (2012, July 8-14) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics (Volume 1: Long papers).
Huang H, Wu S, Liang X, Wang B, Shi Y, Wu P, et al. (2023, October 14-15) Towards making the most of LLM for translation quality estimation. In: CCF International conference on natural language processing and Chinese computing. Cham, Springer Nature Switzerland, pp 375–86.
Iman M, Arabnia HR, Rasheed K (2023) A review of deep transfer learning and recent advancements. Technologies 11(2):40.
Jiang K, Lu X (2020) Natural language processing and its applications in machine translation: a diachronic review. Paper presented at the 2020 IEEE 3rd International conference of safe production and informatization (IICSPI).
Kim S-M, Hovy E (2018) Determining the sentiment of opinions. In: COLING 2004: proceedings of the 20^th international conference on computational linguistics, Geneva, Switzerland, pp 1367–73.
Knight MG (1999) Getting past the impasse: framing as a tool for public relations. Public Relations Review 17(2):27–36.
Kroon A, Van der Meer T, Vliegenthart R (2022) Beyond counting words: assessing performance of dictionaries, supervised machine learning, and embeddings in topic and frame classification. Computational Communication Research 4(2):528–70. https://doi.org/10. 5117/CCR2022.2.006.KROO
Kuang X, Wang H (2020) Framing international news in China: an analysis of trans-edited news in Chinese newspapers. Global Media and China 5(2):188–202. https://doi.org/10.1177/ 2059436420924947
Kuang X, Wei R (2018) How framing of nationally and locally sensitive issues varies? A content analysis of news from party and nonparty newspapers in China. Journalism 19(9–10):1435–51.
Laurer, M, Van Atteveldt, W, Casas, A, Welbers, K (2024) Less annotating, more classifying: Addressing the data scarcity issue of supervised machine learning with deep transfer learning and BERT-NLI. Political Analysis, 32(1): 84-100.
Li Y, Yang T (2018) Word embedding for understanding natural language: a survey. In: Srinivasan S (ed) Guide to big data applications, Springer, pp 83–104.
Luther C, Zhou X (2005) Within the boundaries of politics: news framing of SARS in China and the United States. Journalism and Mass Communication Quarterly 82(4):857–72.
Macanovic A (2022) Text mining for social science—the state and the future of computational text analysis in sociology. Social Science Research 108:102784.
MacCartney B (2009) Natural language inference. Stanford University.
Mao J, Xu J, Jing K, Yuille AL (2016, November 20-23) Training and evaluating multimodal word embeddings with large-scale web annotated images. In: Lee DD, von Luxburg U, Garnett R, Sugiyama M, Guyon I (eds) NIPS' 16: proceedings of the 30th international conference on neural information processing systems, pp 442–50.
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, et al. (2023) Recent advances in natural language processing via large pre-trained language models: a survey. ACM Computing Surveys 56(2): 1–40.
Neuman WR, Just MR, Crigler AN (1992) Common knowledge: news and the construction of political meaning. University of Chicago Press.
Nicholls T, Culpepper PD (2021) Computational identification of media frames: strengths, weaknesses, and opportunities. Political Communication 38(1–2):159–81. doi:10.1080/10584609.2020.1812777
Platt M, Platt D (2023, October 18-20) Effectiveness of generative artificial intelligence for scientific content analysis. In: 17th international conference on application of information and communication technologies (AICT). IEEE, pp 1–4.
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. The University of British Columbia Website. Accessed on August 30, 2024. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
Ruder S (2019) Neural transfer learning for natural language processing. National University of Ireland, Galway.
Rudkowsky E, Haselmayer M, Wastian M, Jenny M, Emrich Š, Sedlmair M (2018) More than bags of words: sentiment analysis with word embeddings. Communication Methods and Measures 12(2–3):140–57.
Ruigrok N, Van Atteveldt W (2007) Global angling with a local angle: how US, British, and Dutch newspapers frame global and local terrorist attacks. Harvard International Journal of Press/Politics 12(1):68–90.
Semetko HA, Valkenburg PM (2000) Framing European politics: a content analysis of press and television news. Journal of Communication 50(2):93–109.
Shah P, Wang W, Yang JZ, Kahlor L, Anderson J (2022) Framing climate change mitigation technology: the impact of risk versus benefit messaging on support for carbon capture and storage. International Journal of Greenhouse Gas Control 119:103737.
Shamshiri A, Ryu KR, Park JY (2024) Text mining and natural language processing in construction. Automation in Construction 158:105200.
Sik D, Rakovics M, Buda J, Németh R (2023) The impact of depression forums on illness narratives: a comprehensive NLP analysis of socialization in e-mental health communities. Journal of Computational Social Science 6(2):781–802.
Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C (2018) A survey on deep transfer learning. In: Artificial neural networks and machine learning. 27th International conference on artificial neural networks, Rhodes, Greece, 4–7 October 2018, proceedings, part III. Springer International Publishing, pp 270–9.
Valkenburg PM, Semetko HA, de Vreese CH (1999) The effects of news frames on readers’ thoughts and recall. Communication Research 26(5):550–69.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. (2017) Attention is all you need. Advances in neural information processing systems, 30.
Walter D, Ophir Y (2019) News frame analysis: an inductive mixed-method computational approach. Communication Methods and Measures 13(4):248–66. doi:10.1080/ 19312458.2019.1639145
Wankmüller S (2022) Introduction to neural transfer learning with transformers for social science text analysis. Sociological Methods & Research. https://doi.org/10.1177/0049124 1221134527
Wen Z, Younes R (2023) ChatGPT vs media bias: a comparative study of GPT-3.5 and fine-tuned language models. In: The 5th international conference on computing and data science (CONF-CDS 2023), Macau SAR, China. https://www.researchgate.net/publication/ 374090803_ChatGPT_vs_Media_Bias_A_Comparative_Study_of_GPT-35_and_Fine-tuned_Language_Models
Whitelaw C, Garg N, Argamon S (2005) Using appraisal groups for sentiment analysis. Proceedings of the 14th ACM international conference on information and knowledge management, pp 625–31.
Williams A, Nangia N, Bowman SR (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
Yang X, Li Y, Zhang X, Chen H, Cheng W (2023) Exploring the limits of ChatGPT for query or aspect-based text summarization. Computational Linguistics. https://arxiv.org/abs/2302.08081
Zamfirescu-Pereira J, Wong RY, Hartmann B, Yang Q (2023, April 23-28) Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In: Ruiz F, Dy J, van de Meent J-W (eds) Proceedings of the 2023 conference on human factors in computing systems. ACM, pp 1–21. https://doi.org/10.1145/3544548.3581388
Zhang H, Sneyd A, Stevenson M (2020, December 4-7) Robustness and reliability of gender bias assessment in word embeddings: the role of base pairs. Proceedings of the 1st conference of the Asia-Pacific chapter of the Association for Computational Linguistics and the 10th international joint conference on natural language processing.
Zhang G, Wu J, Tan M, Yang Z, Cheng Q, Han H (2020) Learning to predict US policy change using New York Times corpus with pre-trained language model. Multimedia Tools and Applications 79(45–46):227–34. https://doi.org/10.1007/s11042-020-08946-y
Zhou X (2008) Cultural dimensions and framing the Internet in China: a cross-cultural study of newspapers’ coverage in Hong Kong, Singapore, the US and the UK. International Communication Gazette 70(2):117–36.
Zillmann D, Chen L, Knobloch S, Callison C (2004) Effects of lead framing on selective exposure to internet news reports. Communication Research 31(1):58–81.

No competing interests reported.

Download PDF

Editorial decision: Revision requested
29 Oct, 2024
Reviews received at journal
24 Oct, 2024
Reviews received at journal
17 Oct, 2024
Reviewers agreed at journal
03 Oct, 2024
Reviewers agreed at journal
12 Sep, 2024
Reviewers invited by journal
11 Sep, 2024
Editor assigned by journal
10 Sep, 2024
Submission checks completed at journal
05 Sep, 2024
First submitted to journal
04 Sep, 2024

You are reading this latest preprint version

Towards Algorithmic Framing Analysis: Expanding the Scope by Using LLMs

Status:

Version 1

Abstract

Figures

Introduction

Literature Review

Quantitative Framing Analysis (QFA) and its Limitations

Framing Theory and Framing Analysis

Recent Development of NLP for Contextualization

Adopting NLP in Framing Analysis

Methods and Datasets

Datasets and Preprocessing

Performance Metrics

BERT

Natural Language Inference

The BERT-NLI Model

Challenges

Hypothesis Formulation

Model Performance under Optimal Chunk Size

Discussion

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1