Study population
Data about Confirmed case related to conjunctivitis between August 1, 2014 to August 1, 2019 were obtained from
the Eye Center of the hospital, one of the largest ophthalmology clinics in Zhejiang Province. Collected data
included visit’s date, gender, age, home address, and whether the visit was the patient's first visit or a
re-visit. The International Classification of Diseases (10th Revision, including H10.901, H10.301, and H10.402)
was used to diagnose conjunctivitis.
Air Pollution And Weather Data
There are six air quality monitoring stations in Hangzhou, providing daily values of PM2.5 (µg/m3),
PM10 (µg/m3), SO2 (µg/m3), O3 (µg/m3),
NO2 (µg/m3), CO (µg/m3), and highest and lowest air temperatures.
Hangzhou daily air pollution parameters and temperatures, between January 1, 2014 and August 31, 2019, were
downloaded from the China Meteorological Administration (http://data.cma.cn/). After calculating the hourly average pollutant
concentration of the six stations, the 24-hour average pollutant concentration was calculated. Severity of the
pollution was assigned one of four quartiles of the Air Quality Index (AQI). The AQI was calculated based on the six
air indicators mentioned above. When AQI < 100, it means no pollution,
101 < AQI < 150 means mild pollution,
151 < AQI < 200 means moderate pollution, and AQI > 201 means
severe pollution (USEPA, https://www.gpo.gov/).Considering the effect of air pollution on
conjunctivitis, environmental and weather parameters were calculated as the average of the previous three days. For
example, the value of the environmental factors for January 4, 2015 was calculated as the average of the values of
January 1–3, 2015.The details of the Affiliated Hospital of Hangzhou Normal University and the specific air
quality monitoring stations are showed in appendix Fig. 1.
Machine Learning Algorithms
We mainly consider the multiple regression model:\(Y={X}^{T}\beta\), \(X=({x}_{1},{x}_{2},{x}_{3},{x}_{4},{x}_{5},{x}_{6},{x}_{7},{x}_{8})\)where\(y\) representing number of patients, \({x}_{1},{x}_{2},{x}_{3,}{x}_{4},{x}_{5},{x}_{6},{x}_{7},{x}_{8}\)respectively
representing environmental variables: CO, O3, NO2, SO2, PM10, PM2.5, highest and
lowest temperature. Seven typical machine learning algorithms to train the regression model. These included Lasso
penalized liner model [15], Decision tree [16], Boosting regression [17], Bagging regression [18],
Random forest [19], Support vector [20], and Artificial neural network [21]. In each method, the reliability of the result was judged, using the 10-fold
cross validation. Data were randomly split into test and training datasets at a ratio of 3:7, respectively. Machine
learning models were built on the training dataset, prediction results using test sets. Figure 1 shows the
flowchart of this work. Machine learning techniques were implemented in R using the package for lars, rpart.plot,
mboost ,ipred, randomForest, rminer and nnet respectively.
Statistical Analyses
To evaluate the accuracy of the seven methods, normalized mean square error (NMSE) was used to assess the accuracy of
each method:
where \({\text{y }}\)represents the actual number
of patients per day, \(\overline {y}\)
represents the actual average number of patients per day, and \(\widehat {y}\) is the predicted number of patients for the test set based
on the model from the training set. The NMSE range is 0–1,the smaller the value, the higher the accuracy. At
the same time, the prediction deviation distribution is compared, the deviation is equal to the absolute value of
the predicted value minus the real value. Pearson correlation coefficient was used to compare between the predicted
number of patients and the true number of patients in the selected optimal ML method. The importance of the variable
was measured in terms of the average NMSE decline caused by the deletion of the variable, the larger the value, the
more important for variable. The overall predictive number of cases for all patients were calculated using all seven
methods. Further analyses by sex (men and women) and age at diagnosis (༜45, 45–60, ༞60 years) were also
conducted. All analyses were performed using the statistical programming environment R (version 3.6.0).