In this work, the proposed method consists of three steps to recognize structural features: pre-treatment of the molecules and images, feature extraction, and post-treatment with convex hull constructions, as shown in Fig. 1.
1. Pre-treatment of molecules and images
To remove redundancy, the chemical bonds are eliminated from molecular structure images, since the molecular structures are determined solely by the atom positions. As a result, the problem of recognizing molecules is equivalent to identify a set of scattered atoms/dots.
The molecular image is the basis for pattern recognition. Providing the atoms have been well aligned(25), the molecular 3D image is generated from its XYZ coordinates (Fig. 2), taking the C60 system as an example. The 60 carbon atoms are scattered after removing all chemical bonds. For easier visualization and treatment in the latter stage, the azimuthal angle and the elevation angle were set as 90 degrees for exhibiting the image (along Z axis). Unless otherwise stated, the following discussion is based on this projection angle.
For complicated molecules, it is increasingly difficult to find a projection angle that all atoms can be projected on a plane without overlaps. One may recognize the molecular features by analyzing its facet of profile. However, the core structure cannot be recognized by this way. To circumvent this problem, we sliced the whole molecule into layers, and took snapshots for each layer to extract features (Fig. 2). Nonetheless, it is not a trivial work to slice the molecule, as the double counting of atoms may take place. Eventually, we sliced the molecule along the projection angle, and set the distance between layers to be 0.7 angstroms. This value is close to a H-H bond distance. For any reasonably determined structure, it is not possible to have two atoms with a distance smaller than 0.7 angstroms. Therefore, the layers separated by 0.7 angstroms can well slice the whole molecule into different layers.
As any other pattern recognition applications, the quality of the picture is essential. Following the parameters given by the blob detection study, we set the picture height and width of 10 * 10 inches with 80 dots per inch (dpi). Accordingly, the final resolution of the figure is 800 * 800 pixels.
2. Feature extraction
To recognize the atoms on each layer, we took advantage of cluster analysis to filter the image matrices. The image matrices are non-diagonal sparse matrices, with dimensions equal to the resolution. In the previous blob detection study, the colored image was first converted to gray scale. Consequently, if atoms were assigned with close color codes, the grayscale conversion would mistakenly consider the different atoms as identical ones.
In this work, the colored image matrix was first separated into three primary-color matrices, namely the R(ed) matrix, G(reen) matrix and B(lue) matrix. As a result, if an atom in the molecule is substituted by its isotope or its family member, the color matrices can reveal its trace. And the atoms with close color codes can be distinguished.
By converting a graph into an image matrix, an atom in the graph is represented by a group of pixel coordinates. Ideally, the atom size determines the number of pixel coordinates. However, such a number is not unambiguously determined, as the boundary of an atom may be blurred especially if the resolution is low. The number of pixel coordinates is thus subject to the round-off error. As a consequence, it is generally not helpful to directly compare the image matrices.
Instead, the K-means algorithm(26) of cluster analysis was used in this work. For a given primary-color matrix, the local extreme values were first filtered out. The local extremes represent pixel occupation for each atom in the image matrix. Whether the local extreme is a maximum or a minimum depends on the background color being black or white. Supposing the background color is black, the non-zero elements were first fileted out as the basis for cluster analysis.
For the K-means algorithm, it clusters the matrix elements by relocating each point to its new nearest center. In the context of feature extraction, this corresponds to determine the center of each set of pixel coordinates. The metric mean of the member points to corresponding cluster centers was calculated, and such relocating-and-updating process iterated until the desired number of cluster centers was found, which was the number of atoms in each layer.
By clustering the centers, the arrays containing each center position were obtained. The Euclidean norm between centers of two structures was compared. If the norm differed by more than 5 pixels, the corresponding atoms were considered as occupying different locations.
To find out which atoms differ in the two structures, a register table was first established to map 2D pixel coordinates and 3D atomic coordinates. The table was constructed by mapping atomic coordinates and pixel coordinates atom by atom. Next, the cluster analysis was carried out for the second structure. By differentiating the cluster centers out of two structures, the atoms at different positions can be filtered out by mapping with the register table.
3. Post-treatments of constructing convex hulls
The convex hull is the smallest polyhedron that encloses a set of points, where intersections between any points in the polyhedron are still in the polyhedron. Originally, the concept of convex hulls was used in other disciplines such as computational geometry, functional analysis, image processing, etc. It depicts a set of n-dimensional (usually 2-dimensional) data, with many mathematically proved theorems or properties such as the separating hyperplane theorem. Such theorems are very appealing in the context of molecule recognition that if properly used, one may readily know the molecular properties without complicated calculations. Therefore, we are particularly interested in studying the convex hull for molecules, as established theorems and properties of convex hulls can be borrowed from other disciplines to study molecular interactions.
To construct the 3-dimentional convex hull for a molecule, the QHull algorithm was used.(27) A polyhedron enclosed the molecule was generated based on atomic coordinates. The molecular surface area was calculated as the total surface area of all facets of the convex hull. The specific molecular area was calculated as the molecular surface area over the molecular mole mass. Similarly, the molecular density was calculated as the molecular mole mass over the total volume of the convex hull.