We have optimized the geometry of complexes (or individual molecule as in the case of t-DCTN) 1 to 15 whose data (interatomic distance and atomic charges) were used to complement the dataset for primary testing and for both codes, which also includes data from the complexes previously studied in our other works on LPED. [6, 7, 23]
From their optimized geometries, we have analyzed their QTAIM properties (used to obtain their corresponding local potential energy density) by means of their molecular graphs, which are displayed in Fig. 2. The element symbols indicated in this figure represent the interacting atoms of these complexes or individual molecule (8). The bond paths in the molecular graphs indicate chemical bonds and intra/intermolecular interactions. A pair of bond paths are united by a bond critical point (the smaller red point in the molecular graph).
The interactions in 8 (t-DCTN) are intramolecular, and they are responsible for its most stable conformation, which was already analyzed by experimental and theoretical techniques.[26] A couple of these intramolecular interactions (hydrogen-hydrogen bonding) are less known in chemical books, but they have been widely explored in the literature recently.[27, 28] Hydrogen-hydrogen bonds are interactions between hydrogen atoms bonded to carbon atoms (CH—HC).
Complex 11 was taken from the CYP17-Pregnenolone complex, which underwent classical molecular dynamics, docking, and ONIOM calculations. From this optimized complex, two selected CYP17 amino acids interacting with pregnenolone and pregnenolone itself were isolated from the rest of the enzyme, and their QTAIM data were collected. Then, complex 11 is made up of pregnenolone, arginine, and asparagine (from CYP17). In this complex, there are hydrogen bonds, hydrogen-hydrogen bonds, and a usual H-H bond. [29]
Table 1 shows all the parameters used in LPED calculation: the atomic charge of the pair of interacting atomic basins, q(A) and q(B), where A and B are atomic basins identified in each entry of Table 1. Important to note that atomic basin A always corresponds to the most electronegative interacting atomic basins containing the negative atomic charge. Table 1 also contains the localization index of the interacting atomic basins, LI(A) and LI(B); the effective atomic number of the interacting atoms, Zeff(A) and Zeff(B); the charge density of the bond critical point connected by the pair of bond paths to both atomic basins A and B, ρbcp; and the bond path length of both bond paths of the interaction or bond involving the atomic basins A and B, r(1−bcp) and r(2−bcp). The LPED of the complexes or dimers 1 to 21 is given in atomic units (au.) and in kcal mol− 1 Bohr− 3. All these values depicted in Table 1 were obtained from the topological analysis of the ωB97X-D[17]/6-311 + + G(2d,2p) wave function.
Table 1
QTAIM atomic charge, q, localization index, LI, effective atomic number, Zeff, of the interacting atomics basins A and B, in au., charge density of the bond critical point, ρbcp, of inter/intramolecular interaction, in au.; bond path length, in Bohr unit, from atomic basin A to bcp, r (A-bcp), and from atomic basin B to bcp, r (B-bcp); local potential energy density, LPED, in au. and in kcal mol− 1 Bohr− 3, of all studied dimers and complexes (1 to 21) from ωB97X-D[17]/6-311 + + G(2d,2p) level of theory. The atomic basin A whenever possible corresponds to the most electronegative atom of the interacting pair.
Entry
|
q(A) /
au.
|
LI(A) / au.
|
Zeff(A) / au.
|
q(B) /
au.
|
LI(B) / au.
|
Zeff(B) / au.
|
ρbcp /
au.
|
r(A−bcp) /Bohr
|
r(B−bcp) /Bohr
|
LPED /
au.
|
LPED
(a)
|
1(b)
|
-1.280
|
8.420
|
0.860
|
0.915
|
9.960
|
0.125
|
-0.0290
|
2.320
|
1.8700
|
-0.0063
|
-3.98
|
2(c)
|
-1.067
|
6.593
|
1.474
|
0.372
|
0.162
|
0.466
|
-0.0170
|
2.630
|
1.5810
|
-0.0073
|
-4.56
|
3(d1)
|
-0.584
|
5.946
|
1.638
|
0.62
|
0.058
|
0.322
|
-0.0280
|
2.429
|
1.3600
|
-0.0128
|
-8.00
|
3(d2)
|
-0.789
|
7.698
|
1.091
|
0.606
|
0.065
|
0.329
|
-0.0150
|
2.590
|
1.6560
|
-0.0046
|
-2.92
|
4(e)
|
-0.961
|
6.464
|
1.497
|
0.025
|
16.161
|
0.814
|
-0.0296
|
2.450
|
2.4300
|
-0.0140
|
-8.79
|
5(f)
|
-0.012
|
6.827
|
1.185
|
0.25
|
0.224
|
0.526
|
-0.0165
|
2.574
|
1.5260
|
-0.0066
|
-4.17
|
6(g)
|
-1.133
|
8.175
|
0.958
|
0.613
|
0.059
|
0.328
|
-0.0238
|
2.316
|
1.4830
|
-0.0076
|
-4.74
|
7(h1)
|
-1.098
|
8.151
|
0.947
|
1.050
|
3.271
|
1.679
|
-0.014
|
2.654
|
2.4690
|
-0.0073
|
-4.55
|
7(h2)
|
-1.121
|
8.202
|
0.919
|
0.058
|
0.400
|
0.542
|
-0.0110
|
2.677
|
1.9740
|
-0.0034
|
-2.13
|
8(i1)
|
-1.148
|
8.188
|
0.960
|
0.042
|
0.394
|
0.564
|
-0.0110
|
2.744
|
1.9900
|
-0.0035
|
-2.19
|
8(i2)
|
-0.019
|
0.449
|
0.570
|
0.022
|
0.421
|
0.557
|
-0.0079
|
2.283
|
2.3630
|
-0.0019
|
-1.20
|
9(j)
|
0.018
|
16.238
|
0.744
|
0.001
|
0.437
|
0.563
|
-0.004
|
3.610
|
2.320
|
-0.0009
|
-0.56
|
10(k1)
|
-0.323
|
16.643
|
0.680
|
0.893
|
9.967
|
0.140
|
-0.012
|
3.147
|
2.156
|
-0.0017
|
-1.06
|
10(k2)
|
-0.877
|
17.681
|
0.196
|
0.133
|
0.318
|
0.549
|
-0.014
|
3.133
|
1.713
|
-0.0027
|
-1.68
|
11(l1)
|
-1.074
|
6.453
|
1.621
|
0.032
|
0.400
|
0.568
|
-0.008
|
3.074
|
2.725
|
-0.0029
|
-1.80
|
11(l2)
|
0.422
|
0.146
|
0.432
|
-0.026
|
0.458
|
0.568
|
-0.003
|
2.624
|
2.725
|
-0.0006
|
-0.39
|
11(l3)
|
-1.134
|
8.132
|
1.002
|
0.537
|
0.085
|
0.378
|
-0.023
|
2.357
|
1.390
|
-0.0080
|
-5.03
|
11(l4)
|
-1.134
|
8.132
|
1.002
|
0.032
|
0.400
|
0.568
|
-0.008
|
3.192
|
2.127
|
-0.0022
|
-1.37
|
11(l5)
|
0.414
|
0.143
|
0.443
|
0.036
|
0.398
|
0.566
|
-0.008
|
2.417
|
2.620
|
-0.0015
|
-0.96
|
11(l6)
|
-1.225
|
6.635
|
1.590
|
0.049
|
0.392
|
0.559
|
-0.008
|
2.919
|
2.553
|
-0.0030
|
-1.87
|
11(l7)
|
-0.895
|
7.785
|
1.110
|
0.477
|
0.110
|
0.413
|
-0.018
|
2.511
|
1.482
|
-0.0065
|
-4.07
|
11(l8)
|
-0.895
|
7.785
|
1.110
|
0.056
|
0.383
|
0.561
|
-0.008
|
2.876
|
1.998
|
-0.0025
|
-1.59
|
12(m)
|
-1.126
|
8.177
|
0.949
|
0.598
|
0.064
|
0.338
|
-0.025
|
2.381
|
1.323
|
-0.0082
|
-5.15
|
12(m)
|
-1.158
|
8.432
|
0.726
|
0.069
|
0.379
|
0.552
|
-0.009
|
2.722
|
1.968
|
-0.0024
|
-1.51
|
13(n)
|
-0.067
|
6.646
|
1.421
|
0.010
|
3.932
|
2.058
|
-0.013
|
2.446
|
2.871
|
-0.0084
|
-5.25
|
14(o)
|
-1.131
|
8.162
|
0.969
|
0.641
|
0.051
|
0.308
|
-0.047
|
2.155
|
1.130
|
-0.0169
|
-10.58
|
15(p1)
|
-0.033
|
3.966
|
2.067
|
0.020
|
0.411
|
0.569
|
-0.006
|
3.191
|
2.287
|
-0.0026
|
-1.66
|
15(p2)
|
0.020
|
0.411
|
0.569
|
0.020
|
0.408
|
0.572
|
-0.011
|
2.127
|
2.126
|
-0.0029
|
-1.80
|
15(p3)
|
-0.005
|
3.975
|
2.030
|
0.0104
|
3.963
|
2.027
|
-0.004
|
3.449
|
3.706
|
-0.0024
|
-1.53
|
16(q)
|
-1.208
|
6.875
|
1.333
|
0.597
|
0.065
|
0.338
|
-0.0195
|
2.532
|
1.415
|
-0.0075
|
-4.68
|
17(r)
|
-1.167
|
8.422
|
0.745
|
0.535
|
0.089
|
0.376
|
-0.0450
|
2.136
|
1.095
|
-0.0156
|
-9.77
|
18(s1)
|
-1.096
|
7.899
|
1.197
|
0.560
|
0.077
|
0.363
|
-0.0180
|
2.493
|
1.470
|
-0.0065
|
-4.11
|
18(s2)
|
-1.167
|
8.217
|
0.950
|
0.607
|
0.061
|
0.332
|
-0.0250
|
2.365
|
1.314
|
-0.0082
|
-5.13
|
19(t1)
|
-1.177
|
8.238
|
0.939
|
0.078
|
0.369
|
0.553
|
-0.0082
|
2.818
|
2.062
|
-0.0025
|
-1.55
|
19(t2)
|
1.557
|
2.822
|
1.621
|
1.559
|
2.820
|
1.621
|
-0.0054
|
3.027
|
3.027
|
-0.0029
|
-1.81
|
19(t3)
|
-1.087
|
7.905
|
1.182
|
-1.083
|
7.899
|
1.184
|
-0.0076
|
2.877
|
2.877
|
-0.0031
|
-1.96
|
20(u)
|
-1.092
|
7.898
|
1.194
|
0.601
|
0.062
|
0.337
|
-0.0293
|
2.320
|
1.254
|
-0.0115
|
-7.20
|
21(v1)
|
-1.154
|
8.185
|
0.969
|
0.598
|
0.064
|
0.338
|
-0.0243
|
2.363
|
1.322
|
-0.0081
|
-5.08
|
21(v2)
|
-1.152
|
8.429
|
0.723
|
0.032
|
0.411
|
0.557
|
-0.0068
|
2.929
|
2.254
|
-0.0017
|
-1.05
|
(a) Unit in kcal mol-1 Bohr-3; |
(b) O-Na interaction; |
(c) N-H interaction; |
(d1) N-H interaction; (d2) O-H interaction; |
(e) N-Cl interaction; |
(f) O-H interaction; |
(g) O-H interaction; |
(h1) O-C interaction; (h2) O-HC interaction; |
(i1) O-HC interaction; (i2) H-H interaction; |
(j) Cl-H interaction; |
(k1) Cl-Na interaction; (k2) Cl-HC interaction; |
(l1) N-H (ASN-PREG) interaction; (l2) H-H (ASN-PREG) interaction; (l3) O-H (ASN-PREG) interaction; (l4) O-H (ASN-PREG) interaction; (l5) NH-H (ARG-PREG) interaction; (l6) N-H (ARG-PREG) interaction; (l7) O-HN (PREG-ARG) interaction; (l8) O-H (PREG-ARG) interaction;
(m) O-H interaction;
(n) O-C interaction;
(o) O-H interaction;
(p1) C-H interaction; (p2) H-H interaction; (p3) C-C interaction;
(q) N-H interaction;
(r) O-H interaction;
(s1) O(sp3)-H interaction ; (s2) O(sp2)-H interaction;
(t1) O(sp2)-H interaction; (t2) O(sp2)-O(sp2) interaction; (t3) O(sp3)-O(sp3) interaction;
(u) O-H interaction;
(v1) O-H interaction; (v2) O-H(C) interaction.
The dataset for training the ML models in the primary testing from the first code and for training the best ML model in the code implemented in Flask framework is depicted in Table 2. Each row is identified by index. The intra/intermolecular interactions in each row are represented in the Interaction column (or variable). The interatomic distance is in Angstrom units. The atomic charges in the interacting atoms A and B were obtained from the MK scheme. The LPED column (or variable) has the actual values of each interaction to be used in the train/test split and predictive algorithm.
The complexes from which we had previously their optimized geometries and LPED values [5–7] are indicated in the “Interaction” variable from index 1 to 24. The remaining index values (from 25 to 53) correspond to the complexes included in this work. In the “Interaction” categorical feature used before data wrangling, in some rows, ‘avg’ appears representing the averaged values (regarding the numerical features) from similar interactions in the corresponding complex in these rows. It is worth noting that interatomic distance and atomic charges from MK scheme are geometrical and electrostatic potential properties, respectively, and they do not have any physical relation with the topological data to obtain LPED (depicted in Table 1).
Table 2
Dataset for training the ML models in both web application codes. Interatomic distances are in the Angstrom unit. Atomic charges were obtained from the MK scheme. LPED unit is kcal mol− 1 Bohr− 3.
Index
|
Interaction
|
Interatomic distance
|
Atomic charge A (MK)
|
Atomic charge B (MK)
|
LPED
|
1
|
Cl-Ar (HCl-Ar)
|
3.976
|
-0.212
|
0.001
|
-0.18
|
2
|
Cl-H (chloromethane dimer) (avg)(a)
|
3.121
|
-0.165
|
0.154
|
-0.88
|
3
|
F-F (tetrafluoromethane dimer) (avg)
|
3.063
|
-0.210
|
-0.184
|
-0.53
|
4
|
O-H (methyl ether dimer) (avg)
|
2.771
|
-0.295
|
0.087
|
-1.17
|
5
|
O-O (methyl ether dimer)
|
3.220
|
-0.295
|
-0.271
|
-1.42
|
6
|
O-Cl (formaldehyde-Cl2)
|
3.304
|
-0.439
|
0.004
|
-0.80
|
7
|
O-H (water-ethyne)
|
2.179
|
-0.824
|
0.410
|
-2.48
|
8
|
H-H (propane dimer) (avg)
|
2.636
|
0.083
|
0.082
|
-0.46
|
9
|
C-C (benzene dimer) (avg)
|
3.529
|
-0.166
|
-0.083
|
-2.07
|
10
|
O-H (nitrophenol)
|
1.723
|
-0.471
|
0.461
|
-10.93
|
11
|
O-H (formic acid dimer)
|
1.634
|
-0.653
|
0.571
|
-11.11
|
12
|
O-H (formamide dimer)
|
1.987
|
-0.539
|
0.350
|
-4.56
|
13
|
O-H (methanol dimer)
|
1.921
|
-0.582
|
0.391
|
-5.24
|
14
|
N-H (methylamine dimer)
|
2.247
|
-0.818
|
0.301
|
-4.21
|
15
|
S-H (methanethiol dimer)
|
2.852
|
-0.301
|
0.165
|
-2.07
|
16
|
O-H (water dimer)
|
1.916
|
-0.771
|
0.396
|
-3.72
|
17
|
N-H (ammonia dimer)
|
2.317
|
-0.777
|
0.264
|
-3.16
|
18
|
F-H (HF dimer)
|
1.811
|
-0.402
|
0.392
|
-2.43
|
19
|
S-H (H2S dimer)
|
2.845
|
-0.233
|
0.118
|
-1.82
|
20
|
O-Ca (Fosfomycin-Ca+ 2) (avg)
|
2.209
|
-0.780
|
1.475
|
-13.18
|
21
|
O-H (guanine-cytosine) (avg)
|
1.820
|
-0.715
|
0.629
|
-7.83
|
22
|
N-H (guanine-cytosine)
|
1.891
|
-0.931
|
0.484
|
-10.95
|
23
|
O-Ca (EDTA-Ca+ 2) (avg)
|
2.325
|
-0.957
|
1.497
|
-8.86
|
24
|
N-Ca (EDTA-Ca+ 2) (avg)
|
2.511
|
0.531
|
1.497
|
-10.38
|
25
|
O-Na (sodium-formate) (avg)
|
2.223
|
-0.776
|
0.892
|
-6.60
|
26
|
N-H (methanimine dimer)
|
2.202
|
-0.652
|
0.343
|
-4.56
|
27
|
N-H (methanoxime-dimer)
|
1.972
|
-0.101
|
0.277
|
-8.00
|
28
|
O-H (methanoxime-dimer)
|
2.196
|
-0.326
|
0.393
|
-2.92
|
29
|
O-HC (DCTN)
|
2.456
|
-0.561
|
0.076
|
-2.19
|
30
|
H-H (DCTN)
|
2.285
|
0.119
|
0.086
|
-1.20
|
31
|
N-Cl (ammonia-chlorine)
|
2.582
|
-0.344
|
-0.155
|
-8.79
|
32
|
O-H (hydroxy-acetaldehyde - dimer)
|
1.898
|
-0.416
|
0.358
|
-4.74
|
33
|
O-H (oxygen-hydrogen chloride)
|
2.149
|
0.055
|
0.147
|
-4.17
|
34
|
O-C (formaldehyde-dimer)
|
2.708
|
-0.431
|
0.315
|
-4.55
|
35
|
O-HC (formaldehyde-dimer)
|
2.417
|
-0.427
|
0.016
|
-2.13
|
36
|
N-H (ASN-Preg)(b)
|
2.707
|
-1.365
|
-0.072
|
-1.80
|
37
|
H-H (ASN-Preg)
|
2.599
|
-0.065
|
0.494
|
-0.39
|
38
|
O-HO (ASN-Preg)
|
1.954
|
-0.668
|
0.511
|
-5.03
|
39
|
O-HC (ASN-Preg)
|
2.704
|
-0.668
|
-0.072
|
-1.37
|
40
|
NH-H (ARG-Preg)
|
2.222
|
0.004
|
0.518
|
-0.96
|
41
|
N-H (ARG-Preg)(c)
|
2.726
|
-1.626
|
0.067
|
-1.87
|
42
|
O-HN (Preg-ARG)
|
2.089
|
-0.947
|
0.695
|
-4.07
|
43
|
O-H (Preg-ARG)
|
2.549
|
-0.947
|
0.103
|
-1.59
|
44
|
Cl-H (trimethylamine-chlorine)
|
3.120
|
-0.002
|
0.154
|
-1.87
|
45
|
Cl-Na (methyl chloride-NaCl)
|
2.810
|
-0.120
|
0.765
|
-4.07
|
46
|
Cl-H (NaCl-methyl chloride)
|
2.540
|
-0.785
|
0.000
|
-1.59
|
47
|
O-H (butanal-water)
|
1.935
|
-0.465
|
0.359
|
-5.15
|
48
|
O-H (water-butanal)
|
2.455
|
-0.828
|
0.060
|
-1.51
|
49
|
O-C (+NO2-benzene) (avg)
|
2.801
|
0.367
|
-0.224
|
-5.25
|
50
|
O-H (β-hydroxy acrolein)
|
1.713
|
-0.631
|
0.546
|
-10.58
|
51
|
C-H (cis-butene-dimer)
|
2.862
|
-0.219
|
0.001
|
-1.66
|
52
|
H-H (cis-butene-dimer)
|
2.090
|
0.094
|
0.001
|
-1.80
|
53
|
C-C (cis-butene-dimer)
|
3.510
|
-0.013
|
0.010
|
-1.53
|
(a) (avg) means averaged values |
(b) ASN = asparagine from CYP17, Preg = pregnenolone |
(c) ARG = arginine from CYP17 |
We validated the optimal size of our dataset using three metrics: coefficient of determination, R2, mean absolute error, MAE, and root mean squared error, RMSE. We tested the dataset size with 33, 43, 53, and 63 samples for the best ML model, Linear Regression (see ahead), in their corresponding optimal random_state values (see Table 3). Bouasria et al. have done a similar analysis in a much more complex dataset with 29 numerical features for 25, 50, 100, 200, 300, and 500 samples and found the best dataset size with 300 samples.[30] Our dataset is much simpler, having only 3 numerical features and its optimal size has 53 samples. The rule of thumb in linear regression and logistic regression models is a minimum of 10–20 samples for each feature[31, 32] and our optimal dataset is within the minimum statistically determined range.
Table 3
Metrics of the Linear Regression model (coefficient of determination, R2, mean absolute error, MAE, in kcal mol− 1 Bohr³, and root mean squared error, RMSE, in kcal mol− 1 Bohr³) for different dataset sizes and their corresponding optimal random_state values.
Dataset size
|
Random_state
|
R2
|
MAE(a)
|
RMSE(a)
|
33
|
5
|
0.75
|
1.42
|
2.04
|
43
|
5
|
0.75
|
1.12
|
1.46
|
53
|
8
|
0.88
|
0.72
|
0.91
|
63
|
0
|
0.83
|
0.94
|
1.26
|
(a) In kcal mol-1 Bohr³ |
The statistical analysis of this dataset after data cleaning and data wrangling is depicted in Table 4 where ‘count’ represents the total number of rows; ‘std’ means standard deviation; ‘min’ is the smallest value of each variable ‘25%’ is the first quartile that gives the value where the first quarter of the observations lie below this one, after ordering of the values from the smallest to the highest value; ‘50%’ (median) and ‘75%’ with similar concepts in comparison with ‘25%’ but applied to 50% and 75% of the observations below the values indicated in ‘median’ and ‘75%’, respectively; and ‘max’ is the largest value observed in each variable.
For the ‘Atomic charge A’ and ‘Atomic charge B’ columns, the mean and median (50%) are close, indicating a normal distribution and low dispersion of values. However, the mean and median in ‘Lennard_Jones potential’ and LPED columns are not close, indicating a non-normal distribution and more dispersed values. In Fig. 4, the histogram plots confirm the normal distribution of ‘Atomic charge A’ and ‘Atomic charge B’ features and the non-normal distribution of the Lennard_Jones potential and LPED variables.
Table 4
Statistical analysis of the dataset used in the web application.
|
Atomic charge A (MK)
|
Atomic charge B (MK)
|
Lennard_Jones potential
|
LPED
|
count
|
53
|
53
|
53
|
53
|
mean
|
-0.447
|
0.299
|
-0.0382
|
-4.01
|
std
|
0.409
|
0.392
|
0.0430
|
3.37
|
min
|
-1.626
|
-0.271
|
-0.1991
|
-13.18
|
25%
|
-0.771
|
0.016
|
-0.0476
|
-5.15
|
50%
|
-0.431
|
0.264
|
-0.0200
|
-2.48
|
75%
|
-0.166
|
0.461
|
-0.0088
|
-1.59
|
max
|
0.531
|
1.497
|
-0.0010
|
-0.18
|
In Fig. 5, the edges of the box represent the first quartile (Q1) and third quartile (Q3) of the interquartile range (IQR). The extremes delimited are the upper and lower whiskers, indicating variability outside the upper and lower quartiles. The variables ‘Atomic charge A’, ‘Atomic charge B’, and ‘Lennard_Jones potential’ have zero, two, and five outlier(s), respectively. However, they are mild outliers because they are close to the upper or lower whiskers. Since they are mild outliers, they were not removed.
The correlation measures the strength and direction of a linear relationship between two variables. In Table 5, the correlation values between variables are shown. The feature ‘Atomic charge B’ (where the great majority of positive atomic charges are placed) has a negative and highest (in magnitude) correlation with LPED, followed by Lennard_Jones potential and ‘Atomic charge A’ variables with positive correlations.
Table 5
Correlation values between the variables ‘Atomic charge A’, ‘Atomic charge B’, and ‘Lennard_Jones potential’
|
Atomic charge A (MK)
|
Atomic charge B (MK)
|
Lennard_Jones potential
|
LPED
|
Atomic charge A (MK)
|
1.00
|
-0.12
|
0.20
|
0.15
|
Atomic charge B (MK)
|
-0.12
|
1.00
|
-0.32
|
-0.66
|
Lennard_Jones potential
|
0.20
|
-0.32
|
1.00
|
0.62
|
LPED
|
0.15
|
-0.66
|
0.62
|
1.00
|
In the primary testing in our first code, we searched for the best ML model and the best random seed varying random_state value from 0 to 9 (called iteration). The metrics used to evaluate the ML model are R2, MAE, MSE, and RMSE. According to the metrics depicted in Table 6, the best model in the best random seed value is Linear Regression with random_state = 8 having R2 = 0.88, MAE = 0.72 kcal mol− 1 Bohr− 3, MSE = 0.82 kcal2 mol− 2 Bohr− 6, and RMSE = 0.91 kcal mol− 1 Bohr− 3.
Table 6
R2, MAE, MSE, and RMSE metrics for the tested ML models according to the range 0–9 of random_state (iteration).
Model
|
Iteration
|
Split
|
R2
|
MAE(a)
|
MSE(b)
|
RMSE(a)
|
Linear Regression
|
0
|
1
|
0.54
|
1.12
|
2.07
|
1.44
|
1
|
2
|
0.45
|
1.92
|
9.31
|
3.05
|
2
|
3
|
0.58
|
1.58
|
4.63
|
2.15
|
3
|
4
|
-0.30
|
2.12
|
10.65
|
3.26
|
4
|
5
|
0.33
|
1.62
|
4.46
|
2.11
|
5
|
6
|
0.54
|
1.13
|
3.15
|
1.77
|
6
|
7
|
0.56
|
1.77
|
5.32
|
2.31
|
7
|
8
|
0.42
|
1.46
|
6.78
|
2.60
|
8
|
9
|
0.88
|
0.72
|
0.82
|
0.91
|
9
|
10
|
0.67
|
1.46
|
3.73
|
1.93
|
Decision Tree
|
0
|
1
|
-1.91
|
2.66
|
13.26
|
3.64
|
1
|
2
|
-0.18
|
3.14
|
19.93
|
4.46
|
2
|
3
|
0.07
|
2.60
|
10.20
|
3.19
|
3
|
4
|
-1.14
|
3.03
|
17.48
|
4.18
|
4
|
5
|
-1.13
|
2.89
|
14.18
|
3.77
|
5
|
6
|
-1.11
|
2.76
|
14.47
|
3.80
|
6
|
7
|
0.69
|
1.37
|
3.77
|
1.94
|
7
|
8
|
0.35
|
1.62
|
7.53
|
2.74
|
8
|
9
|
-0.73
|
2.50
|
11.47
|
3.39
|
9
|
10
|
0.11
|
2.38
|
9.98
|
3.16
|
Random Forest
|
0
|
1
|
-2.07
|
2.85
|
13.98
|
3.74
|
1
|
2
|
0.43
|
2.20
|
9.68
|
3.11
|
2
|
3
|
0.55
|
1.88
|
5.01
|
2.24
|
3
|
4
|
-0.23
|
2.36
|
10.10
|
3.18
|
4
|
5
|
0.31
|
1.87
|
4.60
|
2.15
|
5
|
6
|
0.32
|
1.68
|
4.65
|
2.16
|
6
|
7
|
0.50
|
1.58
|
5.98
|
2.44
|
7
|
8
|
0.56
|
1.26
|
5.10
|
2.26
|
8
|
9
|
0.83
|
0.87
|
1.15
|
1.07
|
9
|
10
|
0.53
|
1.75
|
5.27
|
2.30
|
XGBoost
|
0
|
1
|
-0.99
|
2.24
|
9.04
|
3.01
|
1
|
2
|
0.20
|
2.64
|
13.45
|
3.67
|
2
|
3
|
0.18
|
2.34
|
9.00
|
3.00
|
3
|
4
|
-0.63
|
2.88
|
13.37
|
3.66
|
4
|
5
|
-0.21
|
2.15
|
8.08
|
2.84
|
5
|
6
|
-0.71
|
2.47
|
11.73
|
3.42
|
6
|
7
|
0.58
|
1.34
|
4.99
|
2.23
|
7
|
8
|
0.44
|
1.55
|
6.52
|
2.55
|
8
|
9
|
0.64
|
1.27
|
2.39
|
1.54
|
9
|
10
|
0.31
|
1.99
|
7.80
|
2.79
|
Gradient Boosting
|
0
|
1
|
-1.65
|
2.56
|
12.07
|
3.47
|
1
|
2
|
0.46
|
2.07
|
9.18
|
3.03
|
2
|
3
|
0.55
|
1.85
|
4.99
|
2.23
|
3
|
4
|
-0.35
|
2.48
|
11.06
|
3.33
|
4
|
5
|
-0.25
|
2.16
|
8.35
|
2.89
|
5
|
6
|
-0.92
|
2.29
|
13.15
|
3.63
|
6
|
7
|
0.74
|
1.23
|
3.15
|
1.78
|
7
|
8
|
0.46
|
1.35
|
6.22
|
2.49
|
8
|
9
|
0.80
|
0.96
|
1.31
|
1.15
|
9
|
10
|
0.32
|
1.99
|
7.68
|
2.77
|
(a) kcal mol-1 Bohr-3 |
(b) kcal2 mol-2 Bohr-6 |
The regression equation of LPED from the interatomic distance and electrostatic potential atomic charges is:
y = -1.5993 + (-0.2092 *x1) + (-4.3116 *x2) + (35.4112 *x3)
Where y = LPED, x1 = Atomic charge A (MK), x2 = Atomic charge B (MK), and x3 = Lennard_Jones potential.
After finding the best model in the best random seed, we applied secondary testing to validate the model further. We used complexes 16 to 21 (Fig. 3) to obtain the MK atomic charges of their interacting atoms, their corresponding interatomic distance in Angstroms, and their corresponding LPED values from their topological data (see Table 1), which we compared with the predicted LPED. The input data for the secondary testing has the “complex” categorical variable and the numerical variables ‘Interatomic distance’, ‘Atomic charge A’, and ‘Atomic charge B’, where most of the values in ‘Atomic charge A’ are negative and most values in ‘Atomic charge B’ (see Table 7). This table, in xlsx extension, was changed into csv file to be uploaded in our web application [10].
Table 7
Input data for the secondary testing.
Complex
|
Interatomic distance
|
Atomic charge A
|
Atomic charge B
|
N-H (propanenitrile-water)
|
2.063
|
-0.440
|
0.382
|
O-H (methylammonium-water)
|
1.688
|
-1.039
|
0.412
|
O(a)-H (methyl acetate-water)
|
2.070
|
-0.136
|
0.262
|
O(b)-H (methyl acetate-water)
|
1.920
|
-0.550
|
0.331
|
O(b)-H (methyl acetate-dimer) (avg)
|
2.543
|
-0.584
|
0.132
|
C(b)-C(b) (methyl acetate-dimer)
|
3.172
|
0.820
|
0.830
|
O(a)-O(a) (methyl acetate-dimer)
|
3.043
|
-0.377
|
-0.382
|
O-H (diethyl ether - water)
|
1.866
|
-0.422
|
0.404
|
O-H (2-pentanone-water)
|
1.922
|
-0.564
|
0.367
|
O-HC (2-pentanone-water)
|
2.642
|
-0.870
|
0.013
|
(a) Tetrahedral oxygen |
(b) Trigonal oxygen |
After uploading and obtaining the prediction values, we downloaded a CSV file called ‘predictions.’ We changed the extension of this file back to xlsx. The ‘predictions’ file has two additional columns: Predicted_LPED and Predicted_SME. We compared the values of the Predicted_LPED with their corresponding actual LPED values and applied the same metrics of the primary testing. The results are displayed in Table 8. For the secondary testing, we computed the metrics using a spreadsheet. Firstly, we obtained the absolute and squared differences between actual and predicted LPED values, represented by Abs_diff and Sqr_diff columns, respectively, in Table 7. After that, we obtained MAE (1.19 kcal mol− 1 Bohr− 3), MSE (1.59 kcal2 mol− 2 Bohr− 6), and RMSE (1.26 kcal mol− 1 Bohr− 3). These values are moderately close to those from the primary testing (MAE = 0.72 kcal mol− 1 Bohr− 3, MSE = 0.82 kcal2 mol− 2 Bohr− 6, and RMSE = 0.91 kcal mol− 1 Bohr− 3.
Then, we could validate the Linear Regression model with random_state = 8 as the best ML model for predicting the LPED values of intermolecular interactions in complexes or intramolecular interactions in individual molecules and, consequently, obtain their corresponding local supramolecular energy (SME) from the linear relation between LPED and SME obtained in previous work[7].
If there is more than one local SME, the total SME of the complex is the sum of all local SMEs. This web application represents a simpler and easier way to obtain local SME of complex with multiple interactions without resorting to QTAIM calculations, and it shows the utility of machine learning modeling in obtaining a topologically derived information, such as LPED, using non-topological data.
Table 8
Predicted and actual values of LPED along with their corresponding absolute and squared differences for the input data of the secondary testing.
Predicted_LPED
|
Actual_LPED
|
Abs_diff
|
Sqr_diff
|
-4.97
|
-4.68
|
0.29
|
0.08
|
-9.02
|
-9.77
|
0.75
|
0.57
|
-4.48
|
-4.11
|
0.37
|
0.14
|
-5.68
|
-5.13
|
0.55
|
0.30
|
-2.57
|
-1.55
|
1.02
|
1.05
|
-5.49
|
-1.81
|
3.68
|
13.51
|
-0.05
|
-1.96
|
1.91
|
3.65
|
-6.53
|
-7.20
|
0.67
|
0.45
|
-5.82
|
-5.08
|
0.74
|
0.55
|
-1.89
|
-1.05
|
0.84
|
0.70
|