Differentiation between phenotypic ESRI developers susceptible to P/T and their isogenic P/T-resistant E. coli
Three machine learning supervised algorithms, partial least squares discriminant analysis (PLS-DA), linear support vector machines (SVMs) and random forest (RF), were applied to the peak matrix generated by the full spectrum and threshold methods of 37 ESRI developers susceptible to P/T and their isogenic P/T-resistant isolates. The average results after 20 repetitions of the 10-fold assays are recorded in Table 1.
Correct separation between both categories was achieved by applying the RF algorithm using the full spectrum and threshold methods, with 90.13% ± 0.22% and 87.50% ± 0.40% accuracy, respectively. In turn, PLS-DA and SVM achieved accuracies lower than 81% in both methods.
The 10-fold internal validation of the RF full spectrum method yielded an F1 score (the harmonic mean of the sensitivity and the accuracy of the model) of 91.43%, with a sensitivity and specificity of 86.49% and 97.3%, respectively. The positive and negative predictive values were 96.97% and 87.8%, respectively, assuming the isogenic P/T-resistant isolates as the positive category. Similar 10-fold scores, sensitivity and specificity percentages were obtained with the threshold method. Additionally, in the case of PLS-DA and SVM, the 10-fold precision was lower than 83%.
The distance plot of RF and PLS-DA analysis for discrimination between ESRI developers susceptible to P/T and their isogenic isolates resistant to P/T showed that the RF full spectrum and threshold methods achieved positive separation between both groups of isolates compared with the PLS-DA full spectrum and threshold methods (Fig. 1). In the case of SVM full spectrum and threshold methods, distance plot analysis was not performed because principal component analysis (PCA) must be applied before, and it did not discriminate between samples (data not shown).
Differentiation between phenotypic P/T-susceptible and P/T-resistant E. coli
The same three supervised algorithms were applied to the peak matrix generated by the full spectrum and threshold methods of 157 P/T-susceptible (53 of which were ESRI developers susceptible to P/T) and 53 P/T-resistant E. coli.
In the first step of the assay for the determination of P/T resistance, RF achieved 96.92% ± 0.08% and 95.52% ± 0.14% for the full spectrum and threshold methods, respectively (Table 1). In this step, the PLS-DA (in full spectrum and threshold methods) and SVM (in threshold method) algorithms also showed more than 90% accuracy (Table 1). The 10-fold internal validation of the RF full spectrum method yielded an F1 score of 94.23%, with a sensitivity and specificity of 92.45% and 98.73%, respectively. The positive and negative predictive values were 96.08% and 97.48%, respectively, assuming resistant isolates as the positive category. Similar results were achieved with PLS in both methods. Additionally, in the case of SVM, the 10-fold precision was 70% and 90.09% in the full spectrum and threshold methods, respectively.
The distance plot of RF and PLS-DA analysis for discrimination between P/T-susceptible and P/T-resistant E. coli showed that the RF full spectrum and threshold methods achieved positive separation between both groups of isolates compared with the PLS-DA full spectrum and threshold methods (Fig. 2). In the case of SVM full spectrum and threshold methods, distance plot analysis was not performed because PCA did not discriminate between samples (data not shown).
Differentiation between phenotypic P/T-susceptible and ESRI developer E. coli
To determine whether the MALDIpiptaz test is able to differentiate between P/T-susceptible and ESRI developer E. coli, we analysed two collections of P/T-susceptible E. coli (N=76) and ESRI developers susceptible to P/T but with capability for ESRI development (N=81).
RF analysis achieved the best accuracy for the classification of ESRI developers from susceptible isolates compared with PLS-DA and SVM analyses (Table 1). The RF algorithm achieved 87.77% ± 0.25% and 81.75% ± 0.29% accuracy for the full spectrum and threshold methods, respectively. In turn, PLS-DA and SVM in both methods showed accuracies lower than 75% and 62%, respectively.
The 10-fold internal validation of the RF full spectrum method yielded an F1 score of 86.79%, with 85.19% and 88.16% sensitivity and specificity, respectively, assuming ESRI developers were the positive category. The positive and negative predictive values were 88.46% and 84.81%, respectively. Additionally, in the case of PLS-DA and SVM, the 10-fold accuracy was lower than 75%.
The distance plot of RF and PLS-DA analysis for discrimination between P/T-susceptible and ESRI developer E. coli showed that the RF full spectrum and threshold methods achieved positive separation between both groups of isolates compared with the PLS-DA full spectrum and threshold methods (Fig. 3). In the case of SVM full spectrum and threshold methods, distance plot analysis was not performed because PCA did not discriminate between samples (data not shown).
Differentiation between phenotypic P/T-susceptible and ESRI developer or P/T-resistant E. coli by peak feature importance analysis
Further discrimination in the three assays between (i) ESRI developer isolates susceptible to P/T and their isogenic isolates resistant to P/T, (ii) P/T-susceptible isolates and P/T-resistant isolates, and (iii) P/T-susceptible isolates and ESRI developer isolates susceptible to P/T was attempted using a different approach for peak analysis by the feature importance of RF (Fig. 4). In these features, the higher the value is, the more important it is when splitting the samples in the classifier trees. The importance of a feature is computed as the (normalized, the values of the array sum to 100%) total reduction of the criterion brought by that feature. It is also known as the Gini importance. Feature importance was obtained from RF between phenotypic ESRI developers susceptible to P/T and their isogenic P/T-resistant isolate assays, showing a high area of interest from 2000 m/z to 3500 m/z (Fig. 4A). In addition, feature importance was also obtained from RF between phenotypic P/T-susceptible isolates and P/T-resistant isolate assays, showing a high area of interest from 2000 m/z to 3500 m/z (Fig. 4B). On the other hand, the distribution of the feature importance in differentiation between P/T-susceptible isolates and ESRI developer isolates susceptible to P/T was more homogeneous and did not show any characteristic range for the classification (Fig. 4C).