The Observational Health Data Science and Informatics (OHDSI) collaboration have developed an end-to-end framework for developing patient-level prediction (PLP) models [1]. The framework requires data in a standardized data format, the OMOP common data model (CDM) [2], and enables transparent, reproducible and rapid development and validation of prediction models across diverse sets of data allowing for evaluation of previously intractable patient level prediction questions. Briefly the OMOP CDM unifies data from heterogeneous electronic health record and medical insurance claims sources with respect to terminologies and overall structure, allowing us to incorporate data from multiple health care systems into our analysis. The PLP framework applies best practices for model development and evaluation, but there are still subjective choices that need to be made during the model development process. An example is the choice of feature engineering that converts the observational data to the labelled data required for binary classification.
The observational healthcare data consists of timestamped data, which needs to be converted into features for a prediction model. Due to the temporality of the observational data, it is possible to either fully preserve the temporal nature of the data (‘temporal features’, for example as a feature matrix per patient with rows corresponding to medical events and columns corresponding to time and the entries being the medical event value at the specific time) or create a summary of the patient’s history (‘non-temporal features’, a feature vector per patient corresponding to medical events and the entries are the values, for example binary values indicating the presence or absence of an event in the patient’s history). Temporal features can be used with classifiers such as neural networks (deep learning) however, this is not possible with many conventional classifiers (such as logistic regression). In addition, there are difficulties when developing models using temporal data from healthcare claims and electronic healthcare record databases as the data come from a diversity of sources and are recorded at irregular frequencies with data often sparsely represented. This can present issues to classifiers such as neural networks when implementing the feature engineering [3], especially if the data are not large. In this paper we therefore focus on engineering non-temporal features.
Converting observational data to non-temporal data requires specifying a static lookback time where the value of the medical event is observed during the lookback period. It is possible to specify the lookback time, such as 365 days prior to index which means only the data recorded in the 365-days prior to index per patient are used when constructing the features. Alternatively, the lookback window can be specified to include all time prior, meaning all data recorded prior to index are used to construct the features. The benefits of using a longer look back are that you have a more complete picture of each patient, but there are multiple negative aspects including: i) you treat a recent illness the same as an illness experienced years ago, ii) you may have issues with left censoring as patients often do not have the same length of complete lookback and iii) you may run into issues when implementing the model in a new healthcare system if the mean complete lookback is shorter. Figure 1 represents a subject with left censoring (subject A) and a subject without left censoring (subject B). For subject B there is no missing data in the feature construction, but for subject A the left censoring means we are unable to observe her for part of the lookback time (effectively missing data).
[3]Studies using administrative data and investigating variations in the length of lookback period have been conducted in the context of incidence and effect estimation[4–6]. In a study of cancer cumulative incidence estimation the authors recommended using lookback of two or more years and discouraged the use of one year lookback but caveated that it is not possible to provide general recommendations as lookback period is dependent on the characteristics of the cancer site and the available data and the underlying research question[5]. A Korean study using a cohort database and examining lookback and estimating incidence of three gynecological diseases (uterine leiomyoma, endometriosis, and adenomyosis) found that as the lookback increased the proportion of misclassified incident cases decreased but advised that the optimal lookback for annual incidence depended on the nature and the stage of the respective diseases[6]. A comparative effect study using the Medicare beneficiary database and evaluating the effect of statin initiation on incidence of cancer recommended that a three year lookback was best but if infeasible that all available lookback is preferable to short fixed lookbacks[4]. Although these studies do not utilize the PLP methodology they illustrated that longer lookback reduces data noise for the diseases examined.
Few studies have evaluated the impact of the selection of the length of lookback time in the setting of predictive ability [7–9]. In a Korean study with data from the National Health Insurance Database evaluating in hospital mortality for patients aged 40 and older who underwent percutaneous coronary intervention the authors’ compared comorbidity measurements (Charlson comorbidity index, Elixhauser’s comorbidity, and comorbidity selection) using three years of inpatient records compared to models using one year of inpatient records and concluded the longer lookback period offered no improvement in predictive capacity [8]. Evaluation of the impact of one year vs. two year lookback in Charlson score for mortality among elderly Medicare beneficiaries using claims data reported nearly identical C-statistics [9]. An Australian study using population based hospital data examined prediction of hemorrhage in pregnancy among eight different chronic disease cohorts and evaluated six lookback periods and concluded that although longer ascertainment periods resulted in improvement of identification of chronic disease history it did not change the resulting C-statistics [7]. These studies evaluated a limited set of outcomes (mortality and hemorrhage during pregnancy). Based on the findings of these studies for the outcomes evaluated lookback period did not materially impact the results.
Thus, a systematic evaluation has not been conducted to determine the optimal lookback period for prediction models in the acute and chronic disease areas. The intent of this study is to evaluate the performance of prediction models using acute and chronic disease cohorts and using several lookback periods and multiple databases to provide a recommendation for the optimal lookback period. We hypothesis that using a 365 days prior lookback will result in well performing prediction models that are more transportable across databases as this is a trade-off between gaining a sufficient picture of each patient’s health history while reducing issues with left censoring.