Per- and polyfluoroalkyl substances (PFAS) contamination has become a global concern due to their widespread environmental occurrenceand elevated human exposure1. It is a complex chemical class, covering almost all the compounds with at least a perfluorinated methyl (-CF3) or methylene (-CF2-) functional group, according to the revised definition by the Organisation for Economic Cooperation and Development (OECD)2. Till now, over 14,000 chemicals have been registered in the U.S. EPA PFAS structure list3. Growing evidence indicates that the abundant tiny structural variation in PFAS
can affect the respective partitioning4, transferring5, elimination6, transformation7, bioaccumulation8 behaviours, and may pose distinct health risk9. Hence, it is imperative to determine the concentration of multiple structurally similar PFAS species in various environmental matrices simultaneously.
The current analytical methods for PFAS, however, are hard to achieve precise quantification, high resolution and wide coverage concurrently. While great advances have been made over the past decades to enhance the resolution and detection limit of high-performance liquid chromatography-tandem mass spectrometry (HPLC-MS/MS)10,11, as well as gas chromatography-mass spectrometry (GC-MS)12,13, such quantitative analysis relied on the use of reference standard, of which the number of commercially available samples14 were just over 120, less than 1% of the total PFAS. In contrast, progress in high-resolution mass spectrometry (HRMS)15,16 opens the possibility to comprehensive, nontargeted screening of unknown PFAS, especially when coupled with ion mobility spectrometry (IMS)17,18. Nevertheless, the structure identified by HRMS is tentative19, and the concentration can, at best, be semi-quantified. So far, quantification of PFAS without standards remains challenging.
Herein, a nanopore based single-molecule electrochemical sensor is proposed as an emerging technology to bridge the gap between the urgent demand for quantitative PFAS monitoring and the severe shortage of authentic standards. By measuring the change of ion movement through nanopore, single-molecule sensor (SMS) can identify and quantify a specific analyte with the characteristics and frequency of the resulting current blockade20. A linear correlation was established in this work between the magnitude of ionic current and the volume of PFAS simulated by molecular dynamics (MD), and thus the current response of unknown PFAS could be accurately predicted, avoiding the need for standards. Previous studies have demonstrated positive correlations between the magnitude of current blockade and the volume21 or mass22-26 of peptides21-24 and proteins25,26 using a variety of protein pores, e.g., α-hemolysin22, aerolysin21, FraC23, ClyA25, CytK24 or YaxAB26, but a strict linear relationship (R2 = 0.9998) was realized for the first time, as far as the authors were aware, and the predicted blockade values were almost identical to experimental measurements. More importantly, a custom machine learning algorithm, based on the frequency-modulated multi-dimensional feature extraction, was developed to enhance the structural resolution of SMS, reaching an overall accuracy of 99.9% for a total of 13 per- and polyfluoroalkyl carboxylic acids (PFCA). Further optimisation of feature combination, reducing it from 43 to 21 dimensions, required only 13% of the training set for 99% accuracy. As a result, even under an interference of 100 times concentration, nanopore SMS was able to maintain 78% quantification reliability, over an order of magnitude better than ensemble analysis. Besides, a wide, interference-free linear-response range of 0.5 nM to 100 μM was achieved for trifluoroacetic acid, corresponding to the detection limit of 57 ng·L-1, which was comparable to the state-of-the-art performance of UPLC-MS/MS27 or GC28 for this ultra-short PFCA.
Establishment of a linear volume-current relationship using perfluoroalkyl carboxylic acids
Building a structure-activity relationship was the prerequisite towards the development of standard-free quantification methods, and thus a linear correlation was established in the first place between the volume of PFCA molecules and the magnitude of current blockade measured by nanopore SMS. Despite many factors were proposed to influence the translocation induced current response, it was possible to compile them into two dominant factors, i.e., steric exclusion, counterion enhancement, or a combination of those two29. As for steric exclusion, the effective volume of nanopore for signal transduction could change dynamically, indicating that the residence position of analyte in nanopore was crucial when sensing small molecules30. Therefore, polycationic peptide probes were employed in this study to control the location of tethered PFCA in nanopore, while concentrated electrolytes were adopted to reduce the contribution of surface charge, so that the magnitude of current blockade was mainly determined by steric exclusion. More specifically, eight linear perfluoroalkyl carboxylic acids (C2 to C9) were chosen as typical PFCA to form the structure-current relation, Fig. 1a. They were connected to the N-terminal of an oligo-arginine leader (PFCA-R6) and measured by the wild-type aerolysin (WT AeL) in 4 M KCl solution, Fig. 1b. It was found previously that WT AeL formed positive electrostatic barriers on both of its trans exit and cis entry under negative applied voltages36, and so the polycationic -R6 probe might be able to drive the non-ionic PFCA targets to the identical position within the AeL nanopore. The typical current traces of C2-C9 PFCA-R6 at -50 mV were shown in Fig. 1c, as well as the R6 probe (C0). It was clear that both the magnitude of current blockade and dwell time increased with the length of PFCA. Histograms of the current blockade for C0 and C2 to C9 were given in Supplementary Fig. 1.
The resulting magnitude of current blockade for C2-C9 PFCA-R6 complexes exhibited a strong linear correlation (R2 = 0.9998) with their molecular volume (shown as shallow squares in Fig. 1d). The measured differences between the blockade of C2- and C9-R6 was 11.8%, corresponding to an increase of 1.68% per -CF2- (ca. 73.5 Å3) or a slope of 0.023%·Å-3 for this straight line. The effective transduction volume of 4.82 nm3, estimated at the blockade of 100%, was identical to the inner pore volume comprised between the A224 and S236 residues of WT AeL (4.82 nm3), supporting the use of R6 probe to direct the movement of PFCA targets. A total of 61 individual measurements (at least three for each sample) were conducted to reduce experimental errors, using perfluorohexanoic acid (C6) as an internal standard for calibration31. The average of error between different measurements was as low as 0.022%, one order of magnitude smaller than the overall standard deviation (0.198%) for the histogram of current blockade, confirming the reliability of our approach. The hydrodynamic volume of PFCA-R6 was calculated via all-atom molecular dynamics simulations using GROMACS. More details of the simulation process were summarized in the “Methods” section, and the obtained raw data of volumes for C0 and C2 to C9 were given in Supplementary Figs. 2 and 3.
Accurate prediction of current response of H- / Cl-substituted polyfluoroalkyl carboxylic acids
Following the establishment of a linear correlation between the volume of perfluoroalkyl carboxylic acids and their magnitude of current blockade, the next step was to examine its prediction accuracy for other polyfluoroalkyl carboxylic acids. Among all the 622 carboxylic PFAS structures that were identified previously (CAS numbers available)5, 49% of them contained at least one C-H functional group while 5.7% had one or more C-Cl group. Therefore, five typical H- or Cl-substituted analytes, either terminal or internal, were examined in this study, including 3H-tetrafluoropropionic acid (3H), 5H-octafluoropentanoic acid (5H), 7H-dodecafluoroheptanoic acid (7H), 3Cl-tetrafluoropropionic acid (3Cl), and 3:3 fluorotelomer carboxylic acid (FTA), which were increasingly discovered in the wastewater from fluorochemical industry as well as the surrounding surface water32-34. Based on the MD simulated molecular volume, the predicted current blockades of H- or Cl-PFCA were in perfect accordance with experimental measurements, Fig. 1d and insert, with negligible deviation (0.022%) close to the observational errors, which clearly demonstrated the capability of using nanopore SMS for standard-free identification of PFAS. The current trace, histogram of blockade, and the simulated molecular volume of H- or Cl-PFCA were provide in Supplementary Figs. 4 and 5.
Factors that affected the standard-free prediction and single-molecule identification of PFCA
To further explore the origin of the observed linear correlation, different peptide structures (R6K-, R5K-, -R7 and -R6) were compared to analyse the role of probe length and orientation of connection. Perfluoropentanoic (C5), perfluorohexanoic (C6) and perfluoroheptanoic acid (C7) were linked to the lysine side chain of R6K- or R5K- probes, and to the N-terminals of -R7 or -R6. Errors of the actual blockade from prediction was much smaller for C6-R6 or -R7 than R6K- or R5K-C6 (Fig. 1e), probably due to the narrow lumen of WT AeL (diameter between 1-1.4 nm)35. Meanwhile, the slope of the linear fit for -R6 probe, i.e., signal sensitivity to the change of analyte volume, was 70% higher than -R7 (Fig. 1e), suggesting the enlarged transduction volume of the latter. Although showing little improvement in linearity (R2 = 0.9995-0.9999 for the linear fit of C5-, C6- and C7-R6 in 2-4 M KCl, (Supplementary Fig. 6), the elevated salt concentration reduced the standard deviation of blockade from 0.47% for 2 M to 0.165% for 4 M, Fig. 1f, which could triple the resolution of identification. One possible reason was the prolonged dwell time of PFCA in WT AeL, caused by the higher cis-to-trans driving force in 4 M KCl (or the lower trans-to-cis electroosmotic force)20. Nevertheless, it was noted that the sum of the three sigma of poly- and the adjacent per-fluoroalkyl carboxylic acids (highlighted as the error bar in Fig. 1d) were still greater than the difference between their blockade, indicating that the use of current blockade alone was unlikely to fully resolve the total 13 PFCA.
Frequency-modulated multi-dimensional feature extraction for 100% classification accuracy
An inherent advantage of SMS over ensemble methods was the ability to record multi-dimensional features of the signal generated by an individual molecule36. The use of five or more signal features was demonstrated recently as an effective approach to distinguish structurally similar compounds, e.g., achieving 92.4%-99.9% accuracy for the determination of saccharides37, riboses38, alditols39 or benzenediols40. Herein, frequency modulation (using five low-pass filters of 2000, 800, 500, 200 or 100 Hz, as well as the wavelet transform) was applied to extend the eight common features of single-molecule signals, i.e., the magnitude (ΔI/I0), duration (τon), standard deviation (Iσ), peak-to-peak (Ipp) of current blockade, and the peak (Hpeak), full width at half maximum (HFWHM), skewness (Hskew), kurtosis (Hkurt) of the all-points histograms of current blockade, to a total number of 43 (τon remained constant at all frequencies), Fig. 2a. All the feature inputs were normalized by an internal standard (C6, C5 or C3) and averaged by at least three parallel measurements to minimize experimental errors. The resulting 43 features of 14 analytes were all fitted with a Gaussian distribution (Supplementary Figs. 7 to 20). Interestingly, despite the closely related physical meaning of ΔI/I0 and Hpeak, or Iσ, Ipp and HFWHM, the resolving power of these features and their frequency dependent behaviours differed remarkably (Supplementary Figs. 21 to 29), implying the possibility to integrate multi-dimensional features for enhanced resolution.
In fact, the combination of eight-dimensional feature (extracted from the raw data at 2000 Hz) increased the identification accuracy from 89.1% (using ΔI/I0 only) to 96.7% for the 11 short-chain PFCA (excluding C8 or C9), and the further incorporation of frequency modulation (using the total 43 features) pushed it to 99.9%, Fig. 2b. A total of 2400 sets of data were collected for each analyte (2000 for training/validation and 400 for test). In total, 31 classifiers were evaluated, among which the Bagged trees model showed the highest identification accuracy, Supplementary Table 1. All the accuracies were calculated based on five or more repetitions of random holdout test sets, and a 10-fold cross-validation was applied for training. It was worth noting that the order of feature addition, from two to eight dimensions, for achieving the highest identification accuracy (left part of Fig. 2b), was distinct from the rank of their own one-dimensional accuracy (Supplementary Fig. 30), which suggested the significance of correlation and complementarity between features. The enhancement by feature addition reached a plateau when the number of dimensions exceeded four, meaning that certain features were less relevant or easier to replace in the classification of PFCA. Once frequency modulation was included, the identification accuracy quickly jumped over 99% (right part of Fig. 2b). The confusion matrix of all 13 PFCA plus the R6 probe (also with an overall accuracy of 99.9%) showed that the biggest errors came from the misclassification of FTA with C5 and C6, Fig. 2c, and strikingly, small but non-negligible false identifications (0.01-0.04%) were constantly observed for the pairs of PFCA that could be resolved completely using blockade only (Supplementary Fig. 22). Such phenomena stressed the necessity to reduce model complexity or the number of features.
Shortlisting high-priority features to minimise the size of training set for precise determination
To optimise the combination of model input, the change of maximal validation accuracy against the number of features was examined, under various sizes of training set for C5, C6 and FTA, i.e., 2000, 200 or 20 signals for each PFCA. A similar trend was discovered in all three curves (Fig. 2d): when more features were included, the accuracy increased initially, then reached a plateau, and dropped slightly in the end, indicating that the optimal number of features should be neither too small (e.g., to avoid the deviation in training set) nor too large (e.g., to reduce the model complexity). The height, length and position of the plateau became lower, shorter, and left-shifted for smaller set, which decreased from 99.97% for 2000 signals with 28±10 features, to 98.40% for 20 signals with 21±5 features. In the meantime, the priority of features was also affected by training size: when only 20 signals were used, a general order of importance 2000 Hz < wavelet < 100 Hz < 200 Hz < 500 Hz < 800 Hz was followed; but in the case of 200 or 2000 signals, other features such as kurtosis at 200 and 500 Hz stood out, while all the filtered blockades (either wavelet or 100-800 Hz) were no longer critical, Supplementary Table 2-4. Taking into account of their rank in all three scenarios, a total of 21 features were shortlisted, Fig 2e, which was able to decrease the amount of data for training by 7.6 folds, compared to the full 43-feature model, and to increase the maximal identification accuracy from 99.58% to 99.92%.
Maintaining nearly 80% quantification reliability with interference of 100 times concentration
The aforementioned few-shot learning mode of nanopore SMS facilitated the accurate quantification of trace targets in the presence of structurally similar interferents with much higher concentrations, which was often difficult for ensemble measurements. For instance, with regard to the determination of FTA from more concentrated C5 and C6, the pre-developed 21-feature model was able to maintain over 78% accuracy with interference of 100 times concentration, Fig. 3a, and a substantial 44% was still detectable even under 1000 times. Such performance was at least one order of magnitude better than the multi-peak fit using the 2000 Hz blockade only, which was the best one-dimensional feature in this study to mimic ensemble analysis, showing an accuracy of 69% with 10 times interference and a rapid drop to zero at 20 times. In the meantime, the quantification reliability of the shortlisted 21-feature model was consistently higher than the full 43-feature counterpart as well, in accordance with the above analysis on the influence of feature number.
Interference-free quantification of trifluoroacetic acid towards a detection limit of 57 ng·L-1
Trifluoroacetic acid (C2) was the simplest form of PFCA, with a much higher environmental level than other PFCA due to its more diverse sources28. Recent evidence suggested that the rising rate of C2 was considerable, and its potential adverse health effects could be higher than expected41. The quantitative analysis of C2, however, was arduous, and the limit of detection (LOD) was among 10-500 ng·L-1, Supplementary Table 527,28,42-46. Herein, linking the logarithm of C2 concentration and the logarithm of interval time measured at -50 mV and 20℃, a linear response range was established between 50 nM and 100 μM C2, insert of Fig. 3b. Interference such as C4, C5, C6 or tap water caused a tiny deviation less than 1%. Further adoption of an elevated voltage of -80 mV and a temperature of 30℃ pushed the LOD of C2 down to 0.5 nM or 57 ng·L-1, Fig. 3b, comparable to the state-of-the-art27,28.