2.6.2 Simulation data
We adopted a data simulation generation process similar to that of Choi[7]. Two scenarios were considered, one in which the outcome was treatment-related (effect\(\ne\)0), and one in which it was treatment-independent (effect = 0). In each scenario, we considered three different deletion mechanisms. First, we generated two continuous covariates, \({X}_{1}\) and \({X}_{2}\), for each subject. \({X}_{1}\) follows a normal distribution with mean 0 and standard deviation 1. \({X}_{2}\) depends on \({X}_{1}\).
$${X}_{2i}=0.5{X}_{1i}+{\epsilon }_{i}\text{ with }{\epsilon }_{i}\sim N\left(\text{0,0.75}\right)$$
In this way, the standard deviation of \({X}_{2}\) is also 1, and the correlation between \({X}_{1}\) and \({X}_{2}\) is equal to 0.5. The treatment T was generated from the binomial distribution, with the probability for subject I to receive the treatment being equal to:
$$\text{logit}\left(P\left({T}_{i}=1|{X}_{1i},{X}_{2i}\right)\right)=-0.8+0.5{X}_{1i}+0.5{X}_{2i}$$
By this equation, about 30% of subjects were treated.
We constructed two scenarios:
Scenario 1: the outcome are affected by treatment: we assume, without losing generality, that treatment has an effect of 1 on the subject’s outcome.
$${Y}_{i}={X}_{1i}+{X}_{2i}+Trea{t}_{i}+{\epsilon }_{i},\text{ with }{\epsilon }_{i}\sim N\left(\text{0,1}\right)$$
Scenario 2
the outcome is unrelated to the treatment.
$${Y}_{i}={X}_{1i}+{X}_{2i}+{\epsilon }_{i},\text{ with }{\epsilon }_{i}\sim N\left(\text{0,1}\right)$$
To test the effect of different missing rates on effect estimation in simulated datasets, we preset 7 missing rates, including 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8. Missing values in \({X}_{2}\) are generated using three mechanisms:
(1) MCAR: In \({X}_{2}\), randomly selected given proportion of observations are set to be missing.
(2) MAR: The higher the value of \({X}_{1}\), the more likely the value of \({X}_{2}\) is missing. Taking \(M\) as the missing indicator of \({X}_{2}\), the probability of missing \({X}_{2}\) value is:
$$\text{logit}\left(P\left({M}_{i}=1\right)\right)={X}_{1i}+C$$
(3) MNAR: The higher the value of \({X}_{2}\), the more likely the value is missing. The probability of missing an \({X}_{2}\) value is:
$$\text{logit}\left(P\left({M}_{i}=1\right)\right)={X}_{2i}+C$$
C is a constant used to control the missing rate. As an example, if a missing rate of around 50% is to be controlled, C can be set to 0.
2.6.2 Real-world data
The real-world data come from a subset of the data from the treated group in the National Supported Work Demonstration (NSWD) and the comparison sample from the Population Survey of Income Dynamics (PSID). The dataset has been used by many researchers to test the effects of different propensity score analysis methods [31, 32]. There are 614 samples in this dataset (185 treatments and 429 controls). Each person has nine variables. Table S1 provides more details. Treat is the intervention variable, re78 is the outcome, and the other 7 variables are covariates. Table S2 summarizes the distribution of covariates between different treatment groups.
Our experiments used the inverse probability-weighted effect size of the propensity score calculated from the complete data as the true value. Simulations were then performed to estimate the true effect under the three missing mechanisms. We made missing values occur in both variables re74 and re75. The construction of missing values for these two variables was performed randomly separately. Similarly to the setting we used for simulated datasets, we used 7 missing rate settings for real-world datasets: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8.
-
MCAR: In both variables re74 and re75, randomly selected given proportion of observations are set to be missing.
-
MAR: The missing rate is assumed to be proportional to a linear combination of age and education. To facilitate setting the probability of missing, we standardize the age and years of education so that the mean is 0. Let \({M}_{1}\) and \({M}_{2}\) represent the missing indicators of re74 and re75, respectively, then their missing probability is:
$$\text{logit}\left(P\left({M}_{i1}=1\right)\right)=\text{a}\text{g}{\text{e}}_{i}+\text{e}\text{d}\text{u}{\text{c}}_{i}+C$$
$$\text{logit}\left(P\left({M}_{i2}=1\right)\right)=\text{a}\text{g}{\text{e}}_{i}+\text{e}\text{d}\text{u}{\text{c}}_{i}+C$$
-
MNAR: The higher the value of a variable, the more likely that value is missing. Similar to age and years of education, we also normalize re74 and re75. Then the probability of re74/re75 missing is:
$$\text{logit}\left(P\left({M}_{i1}=1\right)\right)={\text{r}\text{e}74}_{i}+C$$
$$\text{logit}\left(P\left({M}_{i2}=1\right)\right)={\text{r}\text{e}75}_{i}+C$$