4.1 Main findings
We found that algorithms using administrative health and structured EMR data to determine breast and colorectal cancer recurrence had high to moderate sensitivity and PPV, high specificity, NPV, and correct classification but low accuracy after adjusting for the prevalence of the outcome in the cohort. As expected, training cohort results were higher than validation cohort results because the algorithms were optimized on the training cohorts. We chose to include breast and colorectal cancers as these sites have relatively high survival rates, are the second and third most commonly diagnosed cancers in Manitoba (which makes chart reviews even more costly and time-consuming), and are historically more likely than aggressive cancers with poorer survival to have recurrences that can be effectively treated. Whether or not these algorithms can replace chart reviews for determining cancer recurrence necessitates weighing the costs required to conduct a chart review with the benefit of quickly applying an algorithm with less than optimal accuracy.
4.2 Comparison with other studies
Prior studies that evaluated cancer recurrence algorithms using structured data found moderate to high sensitivities and specificities but have several important limitations. Lamont et al. (2006) used Medicare claims data to measure disease-free survival in individuals ≥ 65 years of age diagnosed with breast cancer (N = 52, 15 recurrences) (3). Algorithm sensitivity and specificity were 83% and 97%, respectively. A more recent study (2016) developed a medical claims-based algorithm to identify ovarian cancer recurrence (N = 94, 32 recurrences) (4). Sensitivity was 100% and specificity was 89% but only a training cohort was assessed. Because a validation cohort was not included, the generalizability of the study was not evaluated. Both studies had small cohort sizes. Chubak et al. (2012) developed algorithms to determine recurrence among women diagnosed with stage I or II breast cancer (n = 3,152, 407 recurrences) (5). Sensitivity (89% and 96%) and specificity (99% and 95%) were higher than our study. However, they did not distinguish between cancer recurrence and a second primary (i.e., a new primary cancer unrelated to the prior cancer). This distinction is important in order to use the algorithms to evaluate outcomes such as the effectiveness of treatments in preventing a cancer recurrence. We attempted to distinguish recurrence from a second primary in our chart reviews, although this was difficult in some cases. Another large US study (2014) evaluated recurrence algorithms for lung, colorectal, breast, and prostate cancer (n = 6,227, 736 recurrences) (7). Sensitivity ranged from 75–85%. In 2017, the study was extended to include additional data; the AUROC score was > 0.92. (8). Rasmussen et al. (2019) used data national data in Denmark to identify breast cancer recurrence (n = 471, 149 recurrences) (6). Sensitivity was 97.3%, specificity was 97.2%, and PPV was 94.4%. These studies also did not distinguish between recurrence and second primaries and results were optimized on the training cohort which would have produced overly optimistic results.
To our knowledge, only two Canadian studies has developed and validated algorithms for identifying cancer recurrence using administrative health data. Xu et al. (2019) developed algorithms to identify breast cancer recurrence among women ≤ 40 years of age or those who received neoadjuvant chemotherapy in Alberta (N = 598, 121 recurrences) (9). They found higher measures of sensitivity (94.2%) and PPV (93.4%) and similar measures of specificity (98.3%) and NPV (98.5%) compared to our study. They also did not distinguish between a recurrence and a second breast cancer primary and excluded patients with second primary non-breast tumours. This may have introduced bias and reduces the generalizability of the algorithms. Cairncross et al. (2020) randomly selected 200 women (26 recurrences) diagnosed with cancer and who had ever had a pregnancy between 2003 and 2012 (10). Sensitivity was higher than our study (80.8%), specificity and PPV were lower (81.0% and 38.9%), and NPV was similar (96.6%). However, the data used to determine recurrence in this study was incomplete (e.g., hospitalizations were not included) and the study was limited to women of reproductive age.
Importantly, none of the prior studies from the US or Canada used metrics that are optimal to measure algorithm performance such as the scaled Brier score. Sensitivity and specificity are useful because they provide context about how an algorithm can be improved by identifying areas of weakness. For example, some recurrences in our study were missed because the individual did not receive treatment, which decreased sensitivity. In addition, chemotherapy for a second primary was often found among false positives, which decreased specificity. However, sensitivity and specificity ignore the rate of events in a cohort which make assessing the overall performance of an algorithm challenging. For example, a specificity or correct classification of 95% will have a high rate of false positives if the rate of event is low (e.g., 1%) but would be substantially better with a higher rate of events (e.g., 50%). The scaled Brier score, which is a summary measure that accounts for the rate of events in a cohort, does not have this limitation. Moreover, if a proposed algorithm is expected to replace a chart review, metrics of accuracy should also indicate the amount of measurement error involved. The scaled Brier score, which has a similar interpretation to the R2, indicates random association with a value of 0 and perfect prediction with a value of 1. This provides more informative output to describe accuracy than measures that only use subsections of the cohort (e.g., sensitivity and specificity).
Other methods, such as those that use natural language processing (NLP) to capture recurrence from unstructured EMR data, have been used to determine breast cancer recurrence with sensitivities from that range from 83–92% (23–26). These results are not very different that those that used structured administrative data and therefore, may also not be accurate enough at this time to replace a chart review. Another option is to use recurrence algorithms as a screening tool to reduce the number of charts that need to be manually reviewed. However, more research is needed to create algorithms with higher sensitivities to evaluate this possibility.
4.3 Strengths and limitations
We used data from previously validated, high-quality, complete, population-based administrative health databases (12, 13, 27, 28). However, our gold standard was a chart review which is subject to human error. The inter-rater reliability was strong for breast cancer and moderate for colorectal cancer (29). Therefore, there was some disagreement among the chart reviewers about what constitutes recurrence. When investigating samples of false positives and negatives in the training cohort, some misclassifications of recurrence status had occurred. This was often due to the difficulty in distinguishing recurrence from a second primary, which is expected because it is sometimes challenging for physicians to definitively make this determination. We also found that additional chemotherapy may have been due to a second cancer primary and not cancer recurrence leading to false positive cases. Like some prior studies, our definition of recurrence was not time dependent. We chose to not include this because this would only lead to poorer results.