Background: Missing values are a major issue in quantitative proteomics data analysis. While many methods have been developed for imputing missing values in high-throughput proteomics data, comparative assessment on the accuracy of existing methods remains inconclusive, mainly because the true missing mechanisms are complex and the existing evaluation methodologies are imperfect. Moreover, few studies have provided an outlook of current and future development.
Results: We first report an assessment of eight representative methods collectively targeting three typical missing mechanisms. The selected methods are compared on both realistic simulation and real proteomics datasets, and the performance is evaluated using three quantitative measures. We then discuss fused regularization matrix factorization, a popular low-rank matrix factorization framework with similarity and/or biological regularization, which is extendable to integrating multi-omics data such as gene expressions or clinical variables. We further explore the potential application of convex analysis of mixtures, a biologically-inspired latent variable modeling strategy, to missing value imputation. The preliminary results on proteomics data are provided together with an outlook into future development directions.
Conclusion: While a few winners emerged from our comparative assessment, data-driven evaluation of imputation methods is imperfect because performance is evaluated indirectly on artificial missing or masked values not authentic missing values. Imputation accuracy may vary with signal intensity. Fused regularization matrix factorization provides a possibility of incorporating external information. Convex analysis of mixtures presents a biologically plausible new approach.