The probabilistic or statistical topic models, TM, pioneered by Latent Dirichlet Allocation, LDA (Blei et al., 2003), are tools designed for analysing and understanding large text corpora based on words’ co-occurrence. TM are known as "unsupervised techniques" since they infer topics' content from a collection of texts or corpus rather than assume them as supervised techniques that require ex-ante definitions of topics (Roberts et al., 2014). Since we only observe the documents, TM aim to infer the latent or hidden topics by estimating how words are distributed in topics and topics in documents. Conceptually, we refer to topics as distributions or mixtures of words that belong to a topic with a certain probability or weight. These weights indicate how important a word is in a given topic. In this context, documents are distributed over topics where a single document can be composed of multiple topics, and words can be shared across topics. Thus, we can represent a document as a vector of proportions that shows the share of words belonging to each topic (Roberts et al., 2014).
TM allow us to evaluate the importance of topics in the documents. The sum of shares of topics across all topics in a document, the so-called document-topic proportions, is one. Equally, the sum of the word probabilities or topic-word distributions for a given topic is also one (Roberts et al., 2019). The input for TM is the collection of our raw job postings transformed into a document-term matrix representation, DTM. DTM represents the corpus of our words or terms as a bag of words or terms2. DTM is usually sparse and allows us to analyse the data using vectors and matrix algebra to filter and weigh the essential features of our document collection. Also, a critical input is the number of topics to be considered in the model. The researcher must choose this number based on some criterion (e.g., the held-out log-likelihood proposed by Wallach et al. (2009), or it can be estimated following strategies developed for this purpose (e.g., the Anchor Words algorithm developed by Lee & Mimno, 2014).
Most TM assume that document collections are unstructured since all documents arise from the same generative model without considering additional information (Roberts et al., 2014). Instead, in this study, we implement the STM developed by (Roberts et al., 2013, 2016). STM incorporates document metadata into the standard TM approach to structure the document collection, i.e., STM accommodates corpus structure through document-level covariates affecting topical prevalence. This feature contrasts with other TM like LDA. Thus, the critical contribution of STM is to include the covariates in the prior distributions for document-topic proportions and topic-word distributions. These document-level covariates can affect the topical prevalence, i.e., the proportion of each document devoted to a given topic, and we can measure these changes (Roberts et al., 2013). Also, we can evaluate the topical content, which refers to the rate of word use within a given topic, but we do not implement this evaluation here.
In this study, we applied the STM topical prevalence model, which examines how much each topic contributes to a document as a function of explanatory variables or topical prevalence covariates. In our case, the covariate corresponds to our dummy \(27F\) stated by Eq. (3.1), showing that our collection of job postings comes from the pre-and post-disaster periods. Next, we examine the topical prevalence variation between these two periods by applying a treatment effect regression.
In the next sections, we describe the specification and estimation of the STM topical prevalence model.
4.1. STM Topic-prevalence model specification
This section and the subsequent 4.2 follow the descriptions and technical guidelines detailed in Roberts et al. (2013, 2014, 2016, 2019) and Grajzl & Murrell (2019). As a model based on word counts, STM defines a data-generating process for each document, and the observed data are used to find the most likely values for the parameters specified by the model.
The specification starts by indexing the documents by \(d\in \left\{1\dots D\right\}\) and each word in the documents by \(n\in \left\{1\dots {N}_{d}\right\}\) in our DTM representation. The observed words, \({w}_{d,n}\), are unique instances of terms from a vocabulary of size \(V\) (our corpus of interest) that we indexed by the \(v\in \left\{1\dots V\right\}\). Regarding the addition of covariates for examining the topical prevalence, a designed matrix denoted by \(X\) holds this information. Each row defines a vector of document covariates for a given document. \(X\) has dimension \(D\times P\) (where \(p\) indexes the covariates in the design matrix \(X\), \(p\in \left\{1\dots P\right\}\) ). The rows of \(X\) are represented by \({x}_{d}\). Finally, the specification of the number of topics \(K\) is indexed by \(k\in \left\{1\dots K\right\}\).
Overall, the generative process considers each document, \(d\), as beginning with a collection of \({N}_{d}\) empty positions, which are filled with terms. Since our data is represented as a DTM or bag of words representation, we can assume that, for a given document, all positions are interchangeable, i.e., the choice of topic for any empty position is the same for all positions in that document (Grajzl & Murrell, 2019). The filling process starts with the number of topics chosen by the researcher (details below in section 4.2.1.2) to build a vector of parameters of dimension \(K\) of a distribution that produces one of the topics \(k\in \left\{1\dots K\right\}\) for each position in \(d\). This vector is the so-called topic-prevalence vector since it contains the probabilities that each of the \(k\) topics is assigned to a singular empty position. STM models the topic-prevalence vector as a function of the covariates to estimate the document properties’ influence on topic-prevalence. The process continues with selecting terms from the \(V\) vocabulary to generate a \(k\)-specific vector of dimension \(V,\) which will contain the probabilities of each term to be chosen to fill an empty position.
Formally, the generative process for each \(d\), given the vocabulary of size \(V\) and observed words \(\left\{{w}_{d,n}\right\}\), the number of topics \(K\), and the design matrix \(X\), for our STM Topic-prevalence model specification can be represented as a four-step method. First, we draw the topic-prevalence vector from a logistic-normal generalised linear distribution (Roberts et al., 2019), with a mean vector parameterised as a function of the vector of covariates. This specification allows the expected topic proportions to vary as a function of the document-level covariates, as follows:
\({\overrightarrow{\theta }}_{d}|{X}_{d}\gamma , {\Sigma }\sim\text{L}\text{o}\text{g}\text{i}\text{s}\text{t}\text{i}\text{c}\text{N}\text{o}\text{r}\text{m}\text{a}\text{l}\left({X}_{d}\gamma , {\Sigma }\right)\), | (4.1) | |
where \({\overrightarrow{\theta }}_{d}\) is the topic-prevalence vector for document \(d\), \({X}_{d}\) is the 1-by-\(p\) vector, \(\text{a}\text{n}\text{d} \gamma\) is the \(p\)-by-\((K-1)\) matrix of coefficients. \({\Sigma }\) is a \(\left(K-1\right)\) -by- \((K-1)\) covariance matrix that allows for correlations in the topic proportions across documents. The covariates’ addition to the model allows the observed metadata to influence the frequency of discussion in the corpus for a given topic. In our specification, the covariate corresponds to the \(27F\) dummy stated by Eq. (3.1).
Secondly, given the topic-prevalence vector \({\overrightarrow{\theta }}_{d}\) from Eq. (4.1), for each \(n\)word within document \(d\), which is the process of filling the empty positions \(n\in \left\{1\dots {N}_{d}\right\},\) a topic is sampled and assigned to that position from a multinomial distribution as follows:
\({z}_{d,n}\sim\text{M}\text{u}\text{l}\text{t}\text{i}\text{n}\text{o}\text{m}\text{i}\text{a}\text{l}\left({\overrightarrow{\theta }}_{d}\right)\), | (4.2) |
where \({z}_{d,n}\) is the topic assignment of words based on the document-specific distribution over topics, where the \({k}^{th}\) element of \({z}_{d,n}\) is one and the rest are zero for the selected \(k\).
Thirdly, we form the document-specific distribution over terms representing each topic \(k\) choosing specific vocabulary words \(v\)as follows:
\({\beta }_{d,k,v}|{z}_{d,n}\propto \text{exp}\left({m}_{v}+{k}_{k,v}\right),\) | (4.3) |
where \({\beta }_{d,k,v}\) is the probability of drawing the \(v\)-th word in the vocabulary to fill a position in document \(d\) for topic \(k\). \({m}_{v}\) is the marginal log frequency estimated from the total word counts of term \(v\) in the vocabulary \(V,\) representing the baseline word distribution across all documents. \({k}_{k,v}\) is the topic-specific deviation for each topic \(k\) and term \(v\) over the baseline log-transformed rate for term \(v\). \({k}_{k,v}\) represents the importance of the term, given the topic. The logistic transformation of \({m}_{v}\) and \({k}_{k,v}\) converts their sum into probabilities for use in the subsequent and final step, which refers to drawing an observed word conditional on the chosen topic.
Fourthly, the observed word \({w}_{d,n}\) is drawn from its distribution over the vocabulary \(V\) to fill a position \(n\) in document \(d\) as follows:
\({w}_{d,n}\sim\text{M}\text{u}\text{l}\text{t}\text{i}\text{n}\text{o}\text{m}\text{i}\text{a}\text{l}\left({\beta }_{d,k,1},\dots ,{\beta }_{d,k,V}\right)\) | (4.4) |
Also, default regularising prior distributions are used for \(\gamma\) in Eq. (4.1) and \(k\) in Eq. (4.3). The regularising prior distributions refer to zero mean Gaussian distribution with shared variance parameter i.e.\({\gamma }_{p,k}∼Normal\left(0,{\sigma }_{k}^{2}\right)\) and \({\sigma }_{k}^{2}∼Inverse-Gamma\left(\text{1,1}\right)\) (Roberts et al., 2016), where \(p\) and \(k\) indexes the covariates and topics, respectively, as shown above.
4.2. STM Topic-prevalence Model and effect estimation
This section outlines the techniques used to process our text data, to estimate the number of topics, the parameters inference of our STM Topic-prevalence model and, based on these parameters, to estimate the effect of our natural experiment on topic-prevalence. We use R packages like Quanteda (Benoit et al., 2018) to manage and analyse text data. The STM specification, estimation, and treatment effect analysis are performed using the Stm R package (Roberts et al., 2016, 2019, 2020).
4.2.1. Pre-processing and DTM representation
We perform standard pre-processing procedures on our collection of 4,136 job postings (see section 3 for details). As pointed out above, since our analysis does not deal directly with text data but is performed on specific text features such as word frequencies, we construct a DTM representation (Welbers et al., 2017). We apply cleaning, tokenisation, and stemming, among others, as standard pre-processing procedures to construct our DTM. We use unigrams (unique words) and bigrams (two consecutive words) as tokens or features. Using bigrams allows us to capture text structure or context that we cannot see using single words. For example, in the case of some job titles with generic words like "Engineer", including bigrams might make tokens more comprehensible since we are observing terms like "Software Engineer", "Construction Engineer", etc. We also apply the removal of infrequent terms by dropping features that do not appear in at least ten documents.
4.2.2 Estimating the number of topics, \(\varvec{K}\), and the STM topic prevalence model parameters
We estimate \(K\) by applying the Anchor Words algorithm (Lee & Mimno, 2014). This technique infers \(K\) by finding an approximated convex hull or the smallest convex polygon in a multi-dimensional word co-occurrence space given by our DTM representation. The central assumption of the Anchor Words algorithm is separability, i.e., each topic has a specific term that appears only in the context of that topic. This separability assumption implies that the terms corresponding to vertices are anchor words for topics. Alternatively, the non-anchor words correspond to the point within the convex hull. We expect a \(K\) of between 5 and 50, which is the range suggested for a small collection of documents, i.e., a few hundred to a few thousand (Roberts et al., 2020), like our sample.
Also, since there is no true \(K\) parameter (Lee & Mimno, 2014; Roberts et al., 2016, 2019), we apply a \(K\) data-driven search as a confirmatory analysis. Therefore, we conduct an examination across different topic numbers to select the proper specification from the computation of diagnostics, such as the held-out log likelihood(Wallach et al., 2009) and residuals analysis (Taddy, 2012). The held-out log-likelihood test evaluates the prediction of words within the document when those words have been removed from the document to estimate the probability of unseen held-out documents (given some training data). For the best specification, on average, we will observe a higher probability of held-out documents indicating a better predictive model. In practical terms, we plot the number of topics and their held-out likelihood to look for some breaks in this relationship as a diagnostic showing that additional topics are not improving this likelihood much. Related to the residual analysis, it evaluates the variance overdispersion of the multinomial described by Eq. (4.2) within the data-generating process. An appropriate number of topics will restrict this dispersion. We are interested in the number of topics with lower values in a plot showing \(K\) and their estimated dispersion or residual level.
Regarding the STM Topic-prevalence model estimation, the strategy takes the DTM, \(K\) and the covariate and returns fitted model parameters. To put it differently, given the observed data, \(K\) and our \(27F\) dummy, we estimate the most likely values for the model parameters specified by maximizing the posterior likelihood (see section 4.1). As a result, we can examine the proportion of job postings devoted to a given topic, or topical prevalence, over the \(27F\) dummy. However, as occurs in this kind of probabilistic model, the STM posterior distribution is intractable. Therefore, we apply the approximate inference method implemented by Roberts et al. (2019). This method, the so-called partially-collapsed variational expectation-maximization algorithm, posterior variational EM, gives us, upon convergence, the estimates of our STM Topic-prevalence model. We discuss our convergence evaluation below.
Another complexity that follows from the intractable nature of the posterior is the starting value of the parameters: in our case, this is the initial mixture of words for a given topic. This complexity is known as initialization, and our estimation depends on how we approach it. We specified the initialization method using the default choice named "Spectral"3. The spectral algorithm is recommended for a large number of documents like ours (Roberts et al., 2020). The described estimation is executed with a maximum number of 200 posterior variational EM iterations subject to meeting convergence. Convergence is examined by observing the change in the approximate variational lower bound. The model is considered converged when the change in the approximate variational lower bound between the iterations becomes very small (default value is 1e-5). We use functionalities included in the R package Stm (Roberts et al., 2020) to estimate \(K\) and STM topic-prevalence model parameters.
In practical terms, the STM Topic-prevalence estimation described above allows us to measure how much a given topic contributes to each of our online job postings. We interpret our result by inspecting the estimated mixture of terms associated with topics. We include the most important terms for each topic using metrics like the highest probability and FREX terms (Roberts et al., 2019). FREX4 measures the exclusivity of that term to a given topic. This association between terms, documents and topics results from the estimated model. However, for the sake of clarity, we name each topic according to our interpretation of the set of terms that motivates each of them. Thus, we can find topics associated with ICT labour. Since we specified the topical prevalence as a function of the \(27F\) dummy (see Eq. 4.1 related statements), we can measure the ICT labour topic prevalence variation between the pre-and post-disaster periods.
4.2.3. Treatment effect estimation and evaluation
Once we have estimated our STM Topic-prevalence model, the fitted parameters allow us to estimate a regression using the online job postings as units or documents, \(d\), to evaluate the influence of our dummy \(27F\) defined by Eq. (3.1) on topic-prevalence for a topic \(j\) (Roberts et al., 2019). Since \(27F\) indicates whether the job posting was published in the period before the earthquake impact or after, i.e., in the post-disaster or “treated” period (see section 3), we can study how the prevalence of topics changes in the aftermath of the disasters. In other words, we evaluate the "treatment effect" of the disaster on the topical prevalence by examining changes in topics’ proportions over our sample of job postings published after the earthquake. The effect estimates are analogous to Generalized Linear Models, GLM, coefficients (Roberts et al., 2013).
We compute the topic proportions from the \(\theta\) matrix where each column is the topic-prevalence vector for document \(d\), \({\overrightarrow{\theta }}_{d}\) (see Eq. (4.1)), and rows are \(d\). Thus, each element \({\theta }_{d,j}\) is the probability of job posting \(d\) being assigned to topic \(j\). As an illustration, in a model with only two topics, we consider the probability of each job posting for each of these two topics. In this example, for job posting \(d\) we can denote its proportions over the two topics as \({\theta }_{d,1}\) and \({\theta }_{d,2}\) where \({\theta }_{d,1}+{\theta }_{d,2}=1\). Thus, the regression to evaluate the treatment effect where the topic proportions for a given topic are the outcome variable can be represented as
\({\overrightarrow{\theta }}_{d}=\alpha +\beta *{27F}_{d}\) | (4.5) |
where \(\alpha\) is the intercept and \(\beta\) is the coefficient to be estimated. A significant \(\beta\) can be interpreted as changes (positive or negative) in topical prevalence because of our dummy standing for the post-disaster period.
The effect estimation procedure in the Stm R package relies on simulated draws of topic proportions from the EM variational posterior (see section 4.2.1.2) to compute the coefficients. We use the default value of 25 simulated draws to compute an average over all the results. This procedure randomly samples topic proportions from the estimated topic proportion distributions for each job repeatedly posted to estimate any given effect. Also, as suggested by the software's authors, we include estimation uncertainty of the topic proportions in uncertainty estimates, or "Global" uncertainty, using the method of composition (Roberts et al., 2019). Regression table results will display the various quantities of interest (e.g., coefficients, standard error, t-distribution approximation). The procedure uses 500 simulations (default value) to obtain the required confidence intervals in the standard error computation (drawn from the covariance matrix of each simulation) and a t-distribution approximation (Roberts et al., 2020). We also show our results visually by displaying the contrast produced by the change in topical prevalence, shifting from the pre-disaster to the post-disaster periods, using the mean difference estimates in topic proportions.
Regarding the evaluation of our estimation, although the robustness of the treatment effect estimation implemented here in terms of spurious effect has been validated by using several tests (e.g., Monte Carlo experiments as detailed by Roberts et al., 2014), we still apply a permutation test to evaluate the robustness of our findings. The procedure estimates our model 100 times, where each run applies a random permutation of our \(27F\) dummy to the job postings or documents. Then, the largest effect on our topics of interest is calculated. We would find a substantial effect, regardless of how we assigned the treatment to documents, if the results connecting treatment to topics were an artefact of the model (Roberts et al., 2014). Alternatively, we would find a treatment effect only when the assignment of our \(27F\) dummy aligned with the true data. We present the results of our permutation tests by plotting the contrast between our permutated model and the true model for our topics of interest.