2.1 Data source
A sample of 177 pediatric patients (age ≤16 years; DF+: 69; DF−: 108) was extracted from a previous article [4]; see Appendix 1. Feature variables were extracted from the collected 19 DF-related symptoms, including (1) personal history of DF, (2) family history of DF,(3) mosquito bites within the previous two weeks, (4) fever ≥39°C, (5) biphasic fever, (6)erythema, (7) skin rash, (8) petechiae, (9) headache, (10) myalgia, (11) abdominal pain, (12) vomiting, (13) soft (watery) stool, (14) cough, (15) sore throat, (16) anorexia, (17) weak sense, (18) bone pain (arthralgia), and (19) flushed skin.
All data used in this study were downloaded from a previous article [4]. Given its design, this study does not require ethical approval according to the regulations of the Taiwan Ministry of Health and Welfare.
2.2 Combination of Algorithms to Improve DF Classification
Four prediction models, including CNN, ANN, KNN, and LR, were proposed to compare the DF classification with individual algorithms. The two CNN and ANN have been mentioned with Microsoft(MS) Excel modules in studies[24-26].
2.2.1 The KNN Model Deposited In MS Excel
A KNN model with an MS Excel module is shown in Figure 1. After extracting feature variables, the KNN algorithm was applied with the following steps:
Step 1: Computing the Distance for Each Paired Case(at panel A in Figure 1)
In the n-case training sample, there are n rows and n columns to record the Euclidean distance for each pair player. For instance, the D2(=0) is the distance in the first play himself. The E2(=9.83) is the distance between the first and the second players.
Step 2: Sorting the distances in columns for Each Player(at the panel B in Figure 1)
All distances in columns were sorted in acceding order for players in rows. The shortest distances(=0) are placed in column D, followed by other shorted distances in the row(e.g., 6.48 and 6.84 in columns E and F for the first player in row 2).
Step 3: Labeling the Classifications Sorted by Distances in Columns for Each Row(at the panel C in Figure 1).
All sorted distances in columns were replaced with the corresponding digital labels(e.g., 1 and 0 for classification). For instance, the last four cases are labeled with 1 in the first three columns from D to F, and the first five cases with 0.
Step 4: Determining the k Value
We simulate the k values(i.e., the number of columns used to predict the classification) from 1 to 10 and select the highest accuracy rate as the nearest k value used for classification.
Step 5: Using the Mode Function in MS Excel to Classify the case Label in the k Value
An example of k=3 is shown at the bottom of Figure 1. Before classification, the red circle is possibly assigned into either class A or B. The nearest three distances of cases are compared using the mode function in MS Excel. In this case, the circle is assigned to be in Class A because the mode is 2 with squares in yellow. As such, the red circle player is assigned based on the majority vote of its k(=3) neighbors in KNN.
2.2.2 The LR Model Deposited In MS Excel
The LR model with an MS Excel module is shown in Figure 2 with the following steps:
Step 1: Actual labels(in Quadrant III of Figure 2)
In the n-case training sample, there are classes 0 and 1 in green and red, respectively.
Step 2: LR Model Building(in Quadrant IV of Figure 2)
The LR model was built in Quadrant IV of Figure 2. The logit formula (=a+WX) was set for each case.
Step 3: The Probability of Classification (in Quadrant I of Figure 2)
The probability(=prob=1/(1+exp(-1xlogit))=exp(logit)/(1+exp(logit))) was also assigned for each case.
Step 4: The Predicted Labels (in Quadrant II of Figure 2)
The predicted labels were set(i.e., as 0 if prob.<0.5, otherwise as 1).
Step 5: Minimizing the model Residual (in Quadrant III of Figure 2)
The model residual was determined by the MS function of SUMXMY2(range1:range2), where range1 was composed by the actual labels for each case with two columns(i.e., (0,1)
as DF+ and (1,0) as DF-), and range2 was constructed by the corresponding
probabilities of DF+ and DF-.
The MS solver was applied to estimate parameters a and W in Quadrant IV. That is, the interception coefficient and variable coefficients were calibrated by the iteration looped from (1) to (4) in the model optimization process.
After parameters were estimated, the model accuracies in training and testing sets can be obtained through the following equations[27,28]:
The accuracy was determined by observing the higher Sensitivity(SENS), Specificity(SPEC), precision, accuracy, and AUC in both models. The definitions are listed below:
True positive (TP)=the number of predicted DF to the true DF, (1)
True negative (TN)= the number of predicted Non-DF to the true Non-DF, (2)
False-positive (FP)= the number of Non-DF minuses TN, (3)
False-negative (FN)= the number of DF minuses TP, (4)
SENS=Sensitivity = true positive rate (TPR)=TP÷(TP+FN), (5)
SPEC=Specificity= true negative rate (TNR)=TN÷(TN+FP), (6)
Precision= positive predictive value (PPV)=TP÷(TP+FP), (7)
ACC= accuracy=(TP+TN) ÷ N, (8)
N=TP+TN+FP+FN, (9)
AUC=(1- Specificity) ×Sensitivity÷2+(Sensitivity+1)×Specificity÷2, (10)
SE for AUC==√(AUC×(1-AUC)÷N), (11)
95%CI=AUC ± 1.96×SE for AUC, (12)
2.3 Three Tasks Required to Achieve
Three tasks would be achieved:
2.3.1 Extracting Feature variables(Task 1):
From the 19 observed DF variables mentioned in section 2.1, we performed LR to extract feature variables against the DF by the criterion of Type I error <0.05 shown on a forest plot[40-42].
Feature variables were extracted from 19 items mentioned in section 2.1 via the following steps: (i) standardize each variable to the mean (0) and standard deviation (i.e., SD = 1), and (ii) compare the standardized mean difference (SMD) on a forest plot [40-42].
The Chi-square test was conducted to assess the heterogeneity between variables. The forest plots (confidence interval (CI) plot) were drawn to display the effect estimates and their CIs for each study.
2.3.2 Comparing the Combined scheme With Individual Algorithms
2.3.2.1 Comparison Between Algorithms
Two scenarios of non-normalized(i.e., raw data with present and absent responses on DF) and normalized(i.e., mean=0, SD=1) data were applied to compare model accuracy and stability among algorithms, including CNN, ANN, KNN, LR, and others yielded from WEKA software (University of Waikato, Wellington, New Zealand) [43], such as Support Vector Machines(SVM)[44], LIBSVM [45], BauesNET, Naïve Bayes[45], Random Forest Classification[47], REPTtree [48], Logistic regression[48], artificial neural network(ANN)[49], and CNN[24-26]; see Appendix 1. The criteria of AUC≥0.8 in training set and AUC≥0.7 in testing set were taken into account for determining an acceptable model accuracy and stability in prediction of DF.
2.3.2.2 Comparison Within the Combined Scheme
The model accuracies and stabilities within the combined scheme(i.e., CNN/ANN/KNN/LR) were compared based on several scenarios(e.g., including KNN and excluding KNN, etc.) using the mode to determine the classification of DF and Non-DF.
Due to the hypothesis that a combined scheme of algorithms can improve the prediction accuracy of DF in children, the combined effects of accuracy and stability based on AUC were examined. That is, the accuracy and stability in the combined scheme greater than other individual algorithms are required for verifications.
2.3.3 Developing an APP for patients, family members, and clinicians.
An app for the detection of DF in children was designed and developed. Model parameters
were embedded in the computer module. The results of the classification (i.e. DF+ and DF-)
instantly appear on smartphones. The visual representation with binary (i.e. DF+ and DF-) categories is shown on a dashboard displayed on Google Maps.
2.4. Statistical Tools and Data Analysis
IBM SPSS Statistics 22.0 for Windows (SPSS Inc., Chicago, US) and MedCalc 9.5.0.0 for Windows (MedCalc Software, Ostend, Belgium) were used to obtain the descriptive statistics and frequency distributions among groups and to compute the model prediction indicators expressed in Equations (1) to (12). The significance level of Type I errors was set at 0.05. The four proposed models of CNN, ANN, KNN, and LR were performed on MS Excel and deposited in Appendix 1. The study flowchart is present in Figure 3. The abstract video is provided in Appendices 2 to 5.