Performance of ML in identifying the RA cases as diagnosed by the treating rheumatologist
The flexible nature of the SVM binarization cutoff (RA yes/no) enables us to choose a very precise, very sensitive or a balanced approach to the performance of the algorithm (Table 1 and Figure S1). To make sure we find the largest number of definite cases, we took a balanced approach between PPV and sensitivity of the SVM, which resulted in a probability cutoff of 0.83 based on our training data. We then applied this cutoff to the independent set of 1,000 annotated patients. In this set, the SVM based ML classifier had an AUC-ROC and AUC-PRC of 0.97 and 0.90 respectively (Fig. 2). The classifier performed very well at identifying patients that were diagnosed with RA by their rheumatologists’: sensitivity 0.85, specificity 0.99, PPV, 0.86, NPV 0.99 (Table 1). The most discriminatory features that contribute to the SVM’s decision can be found in previously published work [4].
Extent of overlap between machine learning and criteria based selections
A total of 17,662 novel patients visited the Leiden outpatient clinic since the EHR initiation in 2011. In this set, the ML identified 1,508 patients with a diagnosis of RA by their rheumatologist after one year of follow-up. In the same period, the prospective cohort included 1,376 patients with early arthritis. Patients in whom the 2010 and 1987 criteria were not assessed at all were excluded, leaving 1,212 patients for this paper’s analyses (Fig. 1).
To visualize the overlap of the ML defined RA cases to the 2010 and 1987 RA criteria selections, we rendered an upset plot (Fig. 3). In our set of 1,212 patients with both EHR data and criteria based annotation 583 unique RA cases were identified. Of these, 406 (69.6%) were identified by our ML as having RA. In the same set, 457 (78.4%) fulfilled the 2010 criteria and 386 (66.2%) the 1987 criteria. The overlap between the different selection methods was substantial: 254 (43.6%) were identified with all three methods, and an additional 94 (16.1%) were identified by both ML and one of the classification criteria (56 (9.6%) and 38 (6.5%) for 2010 and 1987 respectively). The ML identified 58 (9.9%) patients for whom all classification criteria were assessed, but who were negative on both sets, whereas 84 (14.4%) and 31 (5.3%) patients met a single classification criteria set (2010 or 1987 respectively) and were not identified by the ML. A final group of 63 (10.8%) patients met both classification criteria but not the ML cutoff. The ML defined set had slightly more overlapping patients with the 2010 criteria than the 1987 criteria (310 (53.2%) and 292 (50.1%) respectively).
Demographic and baseline differences in machine learning and criteria based selections
In Table 2 we compared the baseline characteristics of the RA cases identified by ML to the patients fulfilling the two sets of criteria. The group of patients that was diagnosed with RA by their rheumatologists had the same median age (57.4), DAS44 at baseline (2.8), prevalence of women (64%), anti-CCP-positivity (52%) and RF-positivity (57%) as patients selected based on fulfilling the 2010 or 1987 classification criteria. We found no statistically significant differences between the three groups.
Description of patients exclusively found by either the ML or criteria
To further elucidate the cases exclusively identified by the ML and those exclusively identified by the criteria, we investigated the baseline characteristics for these subgroups as well (Table 3). The ML identified 58 patients who were not found by the criteria. This group had an abundance of seronegative scoring patients, with a CCP-positivity of 6% and a RF-positivity of 19% respectively. The criteria-based approach identified 178 patients that were not found by the ML. The majority of cases that were only found by the criteria were also anti-CCP2- and RF negative: 16% and 34% respectively. There were no clear differences with regard to other patient characteristics.
Upset & Baseline table for different cutoffs
In addition to the balanced cutoff of the ML probability, we studied the effect of a more stringent and a more lenient cutoff. The ML with the stringent cutoff (0.99) was, as expected, much more precise, but less sensitive (Table 1). With this cutoff the ML identified 303 patients (Table S1), 209 of those overlapped with both 1987 and 2010 criteria selections (Figure S2). This group of patients had a similar age (57.6), prevalence of women (65%) and RF-positivity (62%) as the criteria-based selections. The anti-CCP-positivity prevalence (58%) was substantially higher compared to both the 1987 (P = 0.014) and 2010 criteria (P = 0.013).
With the lenient cutoff of 0.53 the ML was very sensitive but less precise (Table 1). Here, we identified 466 patients (Table S2) of which 266 patients fulfilled both criteria (Figure S3). The group of ML-identified cases maintained a similar prevalence of women (64%), RF-positivity (53%) and anti-CCP-positivity (47%) as those who fulfilled one or both of the two classification criteria. We did, however, find substantial differences in the disease characteristics. The median number of swollen joints (5) was significantly lower with respect to the 1987 criteria-based selection (P = 0.022), whereas the median DAS44 of 2.7 was comparatively low in contrast to the 2010 criteria-based selection (P = 0.049).