Background
Imputation is one of the strategies for dealing with Missing values (MVs) in microarray data. Employing the best subset of genes for imputation is very important. In this study, we used mutual information gene selection before imputation to select the best subset of genes for imputation MVs and then classified imputed data. Two datasets were used, and we generate MVs with missing rates from 1 to 7 percent. K nearest neighbor, row mean imputation, and the method contains Feature Selection with Missing data by Mutual Information (FSM-MI) were employed. We used Root Mean Square Error (RMSE) for evaluating the performance of the methods. We classified complete and imputed data by random forest classifier and compare them by accuracy.
Results
FSM-MI imputation method with 0.0364 and 0.0083 mean RMSE value in GSE510 and GSE1063 datasets had the best performance, respectively. The classification accuracy of complete and imputed data in GSE510 and GSE1063 datasets were 100 and 80, and 100, 100, 83.3, and 66.7, and 80, 80, 70, and 70 percent in missing rates 1, 3, 5, and 7, respectively.
Conclusion
Feature selection before imputation MVs is as important as the selection of the best imputation method in improving the result of subsequent analyzes.