Introduction
State Sequence Analysis, often referred to simply as Sequence Analysis (SA), is a set of statistical methods for studying longitudinal data [1]. It enables a holistic view of processes modelled as a succession of categorical states [2]. Over the past few years, SA has been increasingly used in health research for the study of trajectories, or pathways [3, 4, 5]. In these studies, the standard analysis identifies typical trajectories in two steps: by defining a dissimilarity measure to quantify the variation across trajectories [6] and by grouping similar trajectories with a cluster algorithm based on the pairwise dissimilarities [7]. Then, the relationships between covariates and the trajectories are studied by including the typology of trajectories into regressions, either as a dependent or independent variable [8]. This paper proposes to recognize the uncertainty involved in this standard framework and evaluate its impact on the analysis’ results.
Typologies are a popular tool as they summarize the information available by reducing the diversity of trajectories into a few ideal types (see Herle et al. [9] for a review of modelling strategies, including SA, to identify typical trajectories). For instance, this enables to highlight segments of the population that can be targeted for public health interventions [10]. While typologies are mainly descriptive by themselves, they also allow the inclusion of the complex concept of trajectories in subsequent inferential analyses. Such analyses consist for instance in investigating why individuals follow certain trajectories instead of others, or how previous trajectories influence later outcomes [11, 12, 13].
Assuming an ideal situation with no missing information, two major sources of uncertainty can impact the derivation and the subsequent use of a sequence typology. The first is linked to data reduction and the potentially imperfect assignment of sequences to clusters. The second is the sampling error, common to most statistical problems, as any sample is an inexact representation of the underlying population. The two issues are, however, closely interrelated, as we are less confident to make inference in a context with higher approximation risk.
These two sources of uncertainty might affect the analysis on two levels. First, the description of the processes observed in the data, which are generally inferred to the population. Second, it may lead to wrong conclusions when relating trajectories to covariates in subsequent regression models [7, 14, 15, 16].
As it will be established in the remaining of this section, previous methodological developments in SA have mostly focused on issues related to the data simplification induced by cluster analysis, and the associated assignment error. Issues related to the error in the estimation of the typology due to the sampling uncertainty have been on the other hand overlooked by the SA literature. However, this estimation error can have a serious impact on inference as it has been shown in the latent class literature [17, 18, 19]. In this article, we propose an innovative method to take sampling variation into account in the analysis. Our contribution goes beyond SA and is relevant for any studies combining cluster analysis with regressions.
Data reduction risk
Before considering the impact of sampling error on sequence typology inference, we consider in more detail the uncertainty involved in building the typology. SA often aims to reduce the large diversity of sequences into a few ideal types, while losing as little information as possible. However, this data reduction might raise two kinds of issues that are again closely interrelated.
First, cluster analysis always produces a typology, and therefore imposes a structure on the data even if the data is not structured into subgroups [20, 21]. When there is a clear clustering structure in the data, i.e., when the observed sequences are grouped into clearly distinguished types, we expect the clustering procedure to recover, precisely enough, the underlying clustering structure. In this case, there should be little doubt about whether a given sequence belong to a given type or another. On the contrary, when the observed data is unstructured, i.e., when there is a lot of diversity across sequences with no homogeneity apparent anywhere, the grouping produced by the clustering, and the associated assignment of sequences to cluster, might be uncertain. In such context, small variations of the data could lead to very different typologies.
Second, once the typology has been created, one commonly assumes that all individuals in a given cluster are perfectly represented by their corresponding typical trajectory. As a result, all the remaining within-cluster variation is ignored. However, while some sequences might be close to their cluster centre, and therefore well-represented by the centre, others may be far away and poorly represented [7, 15]. Such data reduction is a strength, as many real-world problems can be simplified by uncovering fundamental structures, but it is also a risk. Indeed, one should not simplify the relevant variation of the trajectories. In such case, an excessive simplification could lead to a wrong description of the trajectories. Furthermore, it raises additional risks when the typology is used in subsequent analysis, such as regression models. Indeed, cluster analysis might ignore exactly the relevant variation to understand the relationships between trajectories and key covariates of interest. In such cases, the use of the typology in subsequent regressions might lead to wrong conclusions [7, 14, 15, 16].
Several tools have been developed to handle these two methodological issues. First, cluster quality indices are commonly used tools to validate a SA typology. They measure the quality of a partition, usually by relating the within-cluster homogeneity and/or the between-cluster separation [7]. They are useful to evaluate the statistical quality of a partition and allow comparing results from different clustering algorithms. However, these cluster quality indices lack clear interpretation thresholds [22, 23]. Furthermore, they only indirectly inform on the underlying clustering structure of the data, i.e., whether the data is organized in subgroups or not. Testing the obtained typology against clustering applied on non-clustered data generated by a null model provides such interpretation thresholds and allows a better inference on the quality of the observed typology [17].
Second, several authors have advocated for the use of other clustering approaches, including fuzzy clustering [24] and the “representativeness” approach [14]. In these methods, sequences may belong to more than one cluster, with gradual membership strength. It allows a better description of sequences that can be understood as a mixture of several ideal types. Those sequences are necessarily assigned to a single type with “traditional” crisp clustering such as Ward’s method or Partitioning Around Medoids. In addition, it might better capture the within-cluster variation [24]. The “representativeness” approach further allows describing complete outliers, i.e., sequences that are far from any types. According to the simulations presented in Helske et al. [14], these two approaches provide better estimates of the relationships between sequence typologies and dependent or independent variables, as they better account for the imperfect cluster assignment of individual sequences.
Finally, Unterlerchner et al. [16] proposed a procedure to measure the potential impact of the simplification conducted by cluster analysis on subsequent analyses. More precisely, the method measures the share of a statistical relationship between sequences and covariates without prior clustering, which is accounted for by the clustering. When the relationship is poorly accounted for by the clustering, one should be careful about the interpretation of the subsequent results.
Previous research has therefore focused on the issues stemming from the simplification of the data induced by the clustering itself. The proposals allow either measuring the extent of the simplification issue using cluster quality indices, or advocate for the use of other clustering approaches such as fuzzy clustering. However, none of these works aimed to address the issue of inference of the typology. The typology is ordinarily estimated using a sample of the population, and its generalization to the population is subject to error. In other words, the typology might be sample-dependent, meaning that a different sample could lead to a different typology.
Sampling uncertainty
The proposals and developments introduced above emphasize that the typology should not be blindly relied upon, as the simplification may lead to wrong conclusions. These risks should be understood at two levels. First, one might draw wrong conclusions on the descriptive level by incorrectly summarizing the observed processes, because of excessive simplification or imprecise cluster assignment. Second, it might affect subsequent analyses making use of the typology.
However, these developments do not consider the sampling error. As in most statistical analysis, a given result may be specific to a given sample, and poorly represent the properties of the underlying population. One clear way to illustrate this - experienced by many researchers - is that if the sample is modified (by for instance removing corrupted data or adding new observations), chances are that the resulting typology will look somewhat different. Such changes in the typology may be explained by several reasons, which are closely related to the issues presented earlier. First, we can expect bigger changes in the typology when the data does not contain a clear clustering structure. Indeed, in such cases, several clustering solutions might describe the underlying structure equally well and the best solution could differ between samples resulting in different typologies. Second, when individual sequence assignment to clusters is dubious, this assignment is likely to change between samples. This is typically the case for sequences that can be seen as a mixture of ideal types. Third, and most obviously, a new sample will include new observations, which may lead to the estimation of a completely different typology. This is particularly expected when using small samples, as additional observations will have a stronger influence in this case.
This discussion highlights two points that are common to many statistical methods. The estimation process depends on the sample. In such context, we can expect more reliable estimation when the underlying structure is strong. Furthermore, we can also be more confident when using larger samples, as we have a more complete dataset. This is akin to the estimation of the population mean using the sample mean. In this case, the standard error depends on the variance (which can be linked to underlying clustering structure) and the sample size.
Sampling error raises further issues when the typology is used in subsequent regressions. All the methodological improvements previously mentioned still handle typologies as measures, i.e., singular realization of the partition of interest (even if their validity is then investigated). However, one can argue that they should be handled as estimates, i.e., random realizations of the partition of interest that can be reproduced to derive the distribution of instructive quantities. This recognizes the fact that exhaustive information is never available to derive the typology. One should therefore consider the estimation error of the typology in subsequent analysis to draw correct inference.
While not common in SA, several tools were proposed in the data mining literature. These propositions generally rely on the use of different kinds of resampling schemes to account for sample variation of the results. Monti [25] proposed a clustering procedure aiming to avoid sample dependence of the resulting typology, called consensus clustering. The method starts by creating several typologies of the same underlying population in many subsamples. Then, it looks for a consensus clustering between these typologies. While the method supports the creation of a robust typology, it does not allow considering the typology as estimate in subsequent regressions.
Other authors have proposed to measure the stability of the clustering across multiple subsamples of the data [26, 27]. Stability measures provide an estimate of the sample dependence of the results, and indirectly, of the underlying clustering strength of the data. For this reason, they are also used for cluster validation, as a complement to other validation techniques [28]. Generally speaking, these stability measures are estimated as follows [29]:
- Cluster the original data to obtain the typology to be evaluated.
- Resample or alter the original data.
- Cluster the new sample using the same clustering method as in step 1.
- Repeat steps 2 and 3 many times.
- Measure the variations of the clusterings between the data resamples.
Several propositions were made, which differ according to the resampling procedure (bootstrap, data jittering, etc.), and the measurement of the variation of the clustering (see Liu et al. [30] for a review). In this article, we rely on the method proposed by Hennig [26] called “Cluster-wise stability assessment”. This method primarily uses random sampling with replacement from the data, i.e., bootstrapping, to replicate the original sample as no true underlying distribution is known. The strength of the approach is to estimate the stability separately for each type. Indeed, while some types might be very well-defined in the data, others may be more dubious. This is achieved by measuring the number of times a given type was recovered in the bootstraps using the Jaccard coefficient [31]. A type is considered as recovered when the same observations were regularly clustered together in the bootstrap clusterings.
The aim of this paper is to extend this approach to the use of the typology in the subsequent step of traditional SA, i.e., to assess the robustness of the relationship between sequence patterns and covariates. This assessment delivers new model diagnostics and alternative estimates for an association of interest. We focus on the situation where the clustering is used as dependent variable in a regression.
The content of the article is outlined as follows. We begin by presenting an illustrative application, which serves as case study throughout this paper. A classical SA framework used to construct a typology of trajectories on this illustration data and estimate its association with covariates is then described, implemented, and commented. In the Methods section, we lay out a novel approach to assess the robustness of typology-based inference studies in two steps. First, a bootstrap procedure to reproduce the typology over an ensemble of perturbed datasets. This step is common with previous works on the evaluation of cluster-wise stability. Second, an adjuvant multilevel model to pool estimates obtained from the typology replications. In the Results section, we show how implementing this methodology allows to revisit the illustrative application and to derive new quantities, which inform on the impact of sampling uncertainty on the original results. We also consider a range of possible relationships between clusters and covariates to further clarify the significance of the methods. Finally, we demonstrate how our contribution permits to gauge the analysis’ reproducibility and contextualise it inside a general cluster validation framework.
Illustrative application
Problem setting
The presentation of the methodological developments is illustrated by applying them to the study of diabetic patients’ healthcare utilisation trajectories. Diabetes is one of the most common chronic diseases of our times, with an estimated global prevalence of 10.5% among 20-79 years old persons in 2021 [32]. It is considered an ambulatory care sensitive condition (ACSC), meaning that adverse and costly events such as emergencies and hospitalisations for diabetes complications can be avoided by high-quality primary care [33, 34, 35]. We are interested in studying how compliance with recommended screening processes is related to subsequent healthcare utilisation patterns. For illustration purposes, we concentrate our investigation on regular lipid screening.
Studying healthcare utilisation patterns as longitudinal processes provides a holistic perspective allowing to situate each healthcare event within the whole trajectory [13]. This is key as, for instance, a hospitalisation should be interpreted differently if it occurs repeatedly over time. Moreover, healthcare utilisation trajectories can in some cases be modelled as a succession of categorical states and hence, studied with SA methods. To the best of our knowledge, two previous studies have applied SA in the context of diabetes and health services research, with the aim of identifying typical care pathways and characterizing the patients on each pathway [12, 36].
Study design, population and measurements
We used data from a prospective cohort study (CoDiab-VD) of non-institutionalised adult patients with diabetes diagnosed for at least a year and residing in the canton of Vaud, Switzerland [37]. We considered the individuals recruited by community-based pharmacies in 2011-2012 and followed-up yearly until 2017. This cohort included 519 participants at baseline, out of which 428 participated to at least one follow-up. We focused on the 348 participants with no more than two missing observations during the follow-up period. The data were collected by postal questionnaires encompassing different aspects of participants’ health status, diabetes care and daily life.
The trajectories of healthcare utilisation were measured between 2013 and 2017 by looking at self-reported emergency visits, hospitalisations, and deaths (retrieved from registries), resulting in the following states: no utilisation, emergency visit, hospitalisation, emergency visit & hospitalisation, dead, and missing. As multiple occurrences of emergency visits and hospitalisations during a year were rare, we grouped them together with single events, thus limiting the number of distinct states.
Quality of diabetes care was measured by looking at compliance with processes of care, which should be conducted yearly [38]. We looked at six processes: foot examination, microalbuminuria screening (from urine sample), lipid testing (from blood sample), influenza immunisation, eye examination (by an ophthalmologist), and glycated haemoglobin (HbA1c) measurement. We considered that the patient complied to the recommendations if these processes were reported for two successive years, i.e., both at baseline in 2011 and in the first follow-up in 2012.
In our analysis, we controlled for a set of confounding factors based on existing literature, the investigators’ domain-specific knowledge, preliminary analyses (not reported here) and statistical considerations [39]. The selected variables, detailed in Table S1 in the Supplementary Material, were all collected before the start of the trajectories: age category, household income, diabetes treatment, diabetes-related complications, and comorbidities. Covariates such as gender and education were discarded as they were associated neither with the outcome nor with the exposure in this context.
The typology
Before presenting our methodological proposal, let us follow a standard SA framework to derive a typology of healthcare utilisation trajectories. Figure 1 presents the individual sequences that serve as basis for our illustrative application, together with their state distribution, which shows the cumulative proportion of participants in each state at a given time point. The most frequent trajectories correspond to patients reporting no emergency visits nor hospitalisations throughout follow-up and patients who died soon after inclusion in the cohort.
<Insert Figure 1>
We used Optimal Matching as the dissimilarity measure, which focuses on duration in each state and their ordering [6]. As transitions from no utilisation to emergency visit or hospitalisation were more common than to death directly, we used substitution costs based on the observed transition rates and set indel costs to half the maximum substitution costs [40]. Following Halpin [41], we used a maximal substitution cost between missing states and any other states, including other nonresponses, to avoid considering missing data as a factor of similarity between trajectories. Clustering was performed with Partitioning Around Medoids [42]. We selected the solution in three groups by looking at the best quality of partitioning according to the average silhouette width and the Calinski-Harabasz index (CHI) [7].
<Insert Figure 2>
Figure 2 presents the resulting typology using state distribution plots. The clusters identified correspond to individuals with “low” (n = 206) and “intensive” (n=111) healthcare utilisation, as well as those who died early in the study (n = 31). The largest cluster contains trajectories congruent with a diabetes under control, and the two others feature high levels of adverse healthcare events. Figure S2 in the Appendix shows the most representative sequences in each cluster.
Association study
<Insert Table 1>
Table 1: Demographic, socioeconomic, health characteristics and processes of care indicators of study participants, by cluster
|
Whole sample
|
LHU
|
IHU
|
ED
|
p-value
|
Total
|
N
|
348
|
206
|
111
|
31
|
|
%
|
100
|
59.2
|
31.9
|
8.9
|
Age* (years)
|
<65
|
48.9%
|
55.3%
|
45.9%
|
16.1%
|
<0.001
|
65–74
|
37.1%
|
35.4%
|
38.7%
|
41.9%
|
>=75
|
14.1%
|
9.2%
|
15.3%
|
41.9%
|
Sex
|
Female
|
36.8%
|
38.3%
|
37.8%
|
22.6%
|
0.228
|
Male
|
63.2%
|
61.7%
|
62.2%
|
77.4%
|
Educational level
|
Basic
|
14.9%
|
15%
|
16.2%
|
9.7%
|
0.963
|
Secondary
|
57.5%
|
56.8%
|
57.7%
|
61.3%
|
Higher
|
25.9%
|
26.7%
|
24.3%
|
25.8%
|
Other
|
1.7%
|
1.5%
|
1.8%
|
3.2%
|
Household income*
|
Low
|
17.5%
|
15%
|
16.2%
|
38.7%
|
0.042
|
Lower-middle
|
26.1%
|
26.7%
|
27.9%
|
16.1%
|
Upper-middle
|
28.4%
|
30.1%
|
27%
|
22.6%
|
High
|
17%
|
19.4%
|
15.3%
|
6.5%
|
Unknown
|
10.9%
|
8.7%
|
13.5%
|
16.1%
|
Diabetes treatment*
|
OAD
|
52%
|
56.3%
|
47.7%
|
38.7%
|
0.186
|
Insulin
|
20.4%
|
18.4%
|
20.7%
|
32.3%
|
Both
|
26.7%
|
24.3%
|
31.5%
|
25.8%
|
missing
|
0.9%
|
1%
|
0%
|
3.2%
|
Diabetes-related complications* (N)
|
Mean
|
0.7
|
0.6
|
0.8
|
1
|
0.014
|
SD
|
0.9
|
0.9
|
1.1
|
0.8
|
Comorbidities* (N)
|
Mean
|
1.9
|
1.7
|
2.1
|
2.3
|
0.003
|
SD
|
1.3
|
1.2
|
1.5
|
1.4
|
Foot examination
|
Yes
|
56.9%
|
54.4%
|
61.3%
|
58.1%
|
0.493
|
No
|
43.1%
|
45.6%
|
38.7%
|
41.9%
|
Microalbuminuria screening
|
Yes
|
62.9%
|
66%
|
60.4%
|
51.6%
|
0.571
|
No
|
29%
|
26.7%
|
30.6%
|
38.7%
|
Unknown
|
8%
|
7.3%
|
9%
|
9.7%
|
Lipid testing*
|
Yes
|
92.5%
|
95.6%
|
88.2%
|
87.1%
|
0.022
|
No
|
6.6%
|
3.4%
|
10.8%
|
12.9%
|
missing
|
0.9%
|
1%
|
0.9%
|
0%
|
Influenza immunisation
|
Yes
|
60.6%
|
58.3%
|
64%
|
64.5%
|
0.482
|
No
|
39.1%
|
41.7%
|
36%
|
32.3%
|
missing
|
0.3%
|
0%
|
0%
|
3.2%
|
Eye examination
|
Yes
|
89.4%
|
87.4%
|
91.9%
|
93.5%
|
0.225
|
No
|
10.1%
|
12.1%
|
8.1%
|
3.3%
|
missing
|
0.6%
|
0.5%
|
0%
|
3.3%
|
HbA1c measurement
|
Yes
|
82.2%
|
84%
|
80.2%
|
77.4%
|
0.542
|
No
|
4.3%
|
3.4%
|
4.5%
|
9.7%
|
Unknown
|
13.5%
|
12.6%
|
15.3%
|
12.9%
|
Legend: Bivariate relationships are evaluated with chi-squared tests for categorical variables and ANOVA for numerical variables. LHU cluster is low healthcare utilisation, IHU is intensive healthcare utilisation, ED is early death. OAD stands for oral antidiabetic medication. A star indicates that the variable is selected in the regression models.
Table 1 describes the variables from our illustrative application, for the whole sample and by cluster. We picked as case study the association between compliance with lipid screening recommendations and subsequent healthcare utilisation patterns, controlling for known confounders. However, the same reasoning and methods could be applied to any covariate.
To investigate such relationship, one generally estimates regression models specifying the typology (through cluster membership) as the dependent variable. This can be achieved either using multinomial regression or a set of logistic regressions. We rely on the second approach, where a separate logistic regression is used to estimate the probability of belonging to each cluster versus any other. By doing so, coefficients for a specific cluster do not directly depend on the coefficients estimated for the other clusters, which will be key for assessing their robustness thereafter. There were less than 2% missing values (n=6) in the regression models overall, so the corresponding individuals were ignored.
Figure 3 presents the average marginal effects (AMEs) of the multivariable logistic regressions, which measure the expected change in probability of belonging to the trajectory group for a change in the level of a variable. The numerical estimates are provided in Supplementary Table S3. We rely on AMEs here for two reasons. First, they can be interpreted on the probability scale, and are therefore easier to comprehend. Second, they can be compared across subsamples and studies, which is not the case for the logistic regressions’ coefficients or odds ratios [43].
<Insert Figure 3>
To avoid overloading the illustration, we focus on the probability to be in the “low” healthcare utilisation trajectory group against any other. Any other combination of clusters is possible, as demonstrated with further analyses, and the procedure is alike. Thus, our main quantity of interest is 0.271 (95% CI 0.066 to 0.476), which represents the expected change in probability of reporting low healthcare utilisation during follow-up for diabetic patients who complied with lipid screening recommendations, controlling for known confounders. These values indicate evidence towards a positive effect of regular lipid testing on subsequent healthcare utilisation patterns.
However, this inference does not account for the sampling error, which could impact the results’ reliability. Indeed, as introduced earlier, the typology is estimated based on the available sample, and the uncertainty involved in the estimation has carry-over effects on the subsequent use of the typology in a regression model, especially if the sample is small as is the case here. Next, we detail how resampling from the data allows to assess the robustness of the regression results and to derive new quantities that more adequately account for the sampling variation and its impact on a standard SA framework.