Herein, I discuss the use of the local invariants and propose nonlocal invariants and improvements of the refinement procedure that can resolve some ambiguities in the Weininger’s method of the corresponding primes [3]. Then, I describe a procedure to obtain a canonical absolute SMILES (with chirality).
Local invariants. Degree, atomic number, and bond type are used as the local invariants of atom in the original Morgan algorithm [7]. Weininger has proposed to use number of connections, number of non-hydrogen bonds (degree), atomic number, the sign of charge, absolute charge, and number of attached hydrogens as the local invariants [3]. For complete representation of the local invariants of atom, we can classify them as properties of the atom proper (atomic number – count of protons in atomic nuclei, atomic mass – the sum of counts of protons and neutrons in atomic nuclei, charge – count of protons in atomic nuclei minus count of electrons of the atom) and properties of the nearest neighborhood of the atom (degree – count of explicit connections, count of attached hydrogens, connectivity – total count of connections, valence – the sum of bond orders of all connections). For use as an initial rank of the atom, we must combine all local invariants in the atomic vector in some order. Weininger has pointed out [3] that a terminal atom is preferred to start of traversal of the molecular graph (if it is possible). For this reason, the degree of the atom must be at the start of a combined atomic vector (1 character). Next positions concatenated consecutively: atomic number (3 characters), count of attached hydrogens (1 character), the sign of charge (1 character: ‘0’ – for no or positive charge, ‘1’ – for a negative charge), absolute charge (1 character), connectivity (1 character), valence (1 character), atomic mass (3 characters, ‘000’ if unspecified). The atomic vector of the local invariants has 12 characters.
Chirality invariant. In SMILES, a designation of chirality depends on the enumeration order of atoms, and from this reason, simple parity symbols ‘@’ and ‘@@’ cannot be used as invariant since its meaning changes with the transposition of the atoms. But we can determine the order of the atoms around a chiral center relative to the symmetry classes of its neighbors [15, 16]. To determine this order we should use the extended connectivity algorithm to the atomic invariants (local and ring and distance nonlocal described further) without chirality designation and obtain the symmetry classes for the non-chiral structure. For all atoms with no chiral parity, the chirality invariant is equal to 0 (by definition). If two or more neighbors of the atom with the chiral parity have the same symmetry class then the chirality invariant of this atom is equal to 0 too. If the order of the symmetry classes can be obtained by an even number of swaps from the direct ascending order of these numbers then the chirality invariant is equal to count of the sign ‘@’ in parity, and otherwise the chirality invariant is equal to 3 - count of the sign ‘@’ in parity. Such a definition of the chirality invariant gives us the truly invariant property of the chiral atom that is independent of the order of atoms in SMILES because the symmetry classes of the atoms are independent of its order. This invariant can be concatenated to the atomic vector of each atom (1 character). After concatenation of the chirality invariant, we must recalculate all atomic ranks in molecular graph.
Thus, the complete atomic vector of each atom must contain 13 characters: 12 for the local invariants and 1 character for the chirality invariant. Since all characters of the atomic vector are digits then we can transform the atomic vector to number.
Ring invariant of atom. If we use only traditional local invariants for each atom of the molecular graph from Fig.1 we will obtain only two symmetry classes after the refinement procedure of the extended connectivity algorithm: one for the atoms with degree = 3 and another one for the atoms with degree = 2. But, there are three symmetry classes in this structure. How do we know that? Because eight atoms with degree = 2 belong to the 6-membered rings, but another four atoms with degree = 2 do not. This very simple observation gives us obvious nonlocal invariant for each ring atom – the ring size of the smallest ring to which this atom belongs (for non-ring atom this invariant is 0 by definition). This value is invariant, i.e. it does not depend on the order of the atoms in molecular graph. Now, for molecular graph from Fig.1, for the eight atoms with degree = 2 this invariant is 6, and for other four atoms with degree = 2 this invariant is 12. Taking into account the local invariants of each atom we obtain three symmetry classes for the atoms of this molecular graph at once and the refinement procedure does not change this.
Fig.1 ‘Pathological’ graph with the sizes of the smallest rings for the atoms
But if we look at the structure in Fig.2, we see that this invariant is insufficient because all atoms belong to the 3-membered rings and therefore have the same ring size of the smallest ring to which this atom belongs.
Fig.2 Complex graph with the minimal and maximal ring sizes of the bonds incident with all atoms
To solve this issue, we can first calculate an invariant for each bond – the size of the smallest ring to which it belongs. We can use a breadth-first search (BFS) from the first atom incident with this bond to the second atom incident with this bond for obtaining the shortest non-trivial path (if such path does not exist then this is non-ring bond). After obtaining this ring invariant for each bond, we can choose for each atom minimal and maximal values of the ring sizes of the bonds incident with this atom. The minimal ring size of the bonds incident with the atom will be the ring size of the smallest ring to which this atom belongs, but maximal will be the maximum of the minimal ring sizes of the bonds incident with this atom. Both of these values will be invariants of each atom. For the molecular graph in Fig.2, four atoms have these invariants equal to 3, but four other atoms have the minimal ring size 3, but the maximal ring size is 6. The minimal and maximal ring sizes of the bonds incident with the atom in combination with the local invariants are usually sufficient for distinguishing all symmetry classes of the atoms in ‘pathological’ molecular graphs. But to avoid any possible ambiguity we propose to use the product of the corresponding primes for the ring size of each bond of the atom as unambiguous ring invariant of the atom (contribution for a non-ring bond to the product will be 1). As an example, for the molecular graph in Fig.2, this ring invariant will be 5*5*5 = 125 for the atoms with the maximum ring size of the bonds equals to 3 and 5*5*13 = 325 for the atoms with the maximum ring size of the bonds equals to 6 (since 5 is third prime and 13 is the sixth prime number). The atomic ring invariant will be unambiguous since reverse factorization of it gives us such invariants as count of the ring bonds of the atom and the minimal sizes of rings to which all bonds of the atom belong.
Distance invariant of the atom. Unfortunately, our local and ring invariants are insufficient for the molecular graphs of the fullerenes from the article of Laidboeur et al [17]. For example, for the molecular graph in Fig.3 (SMILES: C1(C2C3C4C15)C6C7C2C8C3C9C%10C4C%11C5C6C%12C%11C%10C%13C%12C7C8C9%13) our algorithm with the ring invariants of the atoms determines only one symmetry class for the atoms since all atoms of this molecular graph have degree equal to 3 and all bonds of this molecular graph belong to the smallest 5-membered rings while a complete search of the automorphisms gave us two symmetry class of the atoms [17].
Fig.3 Example of the 24-fullerene molecular graph for which the ring invariant is not sufficient
To solve this problem, we tried to use various invariants of the atom which reflect the topological distance of each atom from all other atoms in the molecular graph. The first invariant is the distance sum – the sum of the topological distances from the current atom to all other atoms in the molecular graph [1]. This is a powerful atom invariant but it is hampered by the degeneracy (i.e., two or more topologically distinct vertices can have identical numerical values) [1]. Another proposed distance invariant of the atom is the number of the outermost occupied neighbor sphere (NOON) of an atom [18]. This is the minimum number of neighbor spheres necessary to accommodate all atoms of a molecule starting at that atom (in fact, the eccentricity of the vertex in the molecular graph – the maximum of the topological distances from the current atom to all other atoms in molecular graph). For the determination of all topological distances from the current atom to all other atoms in the molecular graph we can use the breadth-first search (BFS) from the current atom. Our experiments showed that the weighted sum of the topological distances from the current atom to all other atoms in molecular graph has the best separation ability: we define the distance invariant of the atom as the sum of the products of the count of the atoms which have the topological distance d from the current atom to Nd, where N – some integer. For example, we can use N = 10, and thus the decimal digits of our distance invariant are the counts of the atoms in such topological distance from the current atom as the digit position number from the end of the decimal representation. The NOON of the atom will be equal to the overall count of the digits in the decimal representation of the distance invariant. As example, for the 24-fullerene 12 atoms have the distance invariant equal to 25763, but 12 other atoms have the distance invariant equal to 35663 (but all atoms have the NOON equal to 5). Such defined the distance invariant of an atom with the refinement step allows us to easily determine all symmetry classes of the atoms in the fullerenes from the article of Laidboeur et al [17], but, unfortunately, all our local, ring and distance invariants of the atoms are not sufficient for the accurate determination of all symmetry classes for some non-planar regular graphs (see below, Results and Discussion).
Now, we have the 13-digit atomic vector, the numeric ring invariant, and the numeric distance invariant for each atom. We can use the product of the numeric value of the atomic vector, the atomic ring invariant, and the distance invariant as the initial rank of the atom in the further refinement step of the extended connectivity algorithm. Another possible approach is to sort all atoms of the molecular graph by the numeric values of their atomic vector, ring, and distance invariants and assign the rank in the results of sorting as the initial rank of each atom for the refinement step.
Refinement procedure. After obtaining the initial ranks for all atoms we can rank them to obtain consecutive small integer ranks, and this manipulation does not change their relative order. Further manipulations, in general, correspond to the CANON algorithm described by Weininger et al [3], but some changes in this algorithm are necessary. If you look at the simple structure in Fig.4 then you see that after obtaining the initial ranks we have four symmetry classes. If we follow the original description of the CANON algorithm then after first iteration we obtain products of the corresponding primes for the atoms: CH3 – 7, CH – 2*3*5 = 30, CH2 – 5*7 = 35, O – 3*7 = 21. After ranking of these integers we obtain new ranks:
Fig.4 Example of the ranks rotation by the original CANON algorithm
As you can see, there is a rotation of the ranks in the ring. This is an infinite loop of the ranks rotations that cannot give us stable symmetry classes of the atoms in such type of structures. The first idea for solving this problem is to multiply the corresponding prime of the current rank of each atom with the product of the corresponding primes for the ranks of its neighbors. Employing this approach, the products for first iteration are: CH3 – 2*7 = 14, CH – 7*2*3*5 = 210, CH2 – 3*5*7 = 105, O – 5*3*7 = 105. After ranking we obtain that now O and CH2 have the same rank, although previously they had different ranks! This issue arises since the products of primes are the unambiguous function only if we do not consider the transposition of the number in the product. But in the 3-membered rings, this fact led to the leveling of the previously different ranks. A solution to this problem for the 3-membered rings is to multiply the square of the corresponding prime of the current rank of each atom with the product of the corresponding primes for the ranks of its neighbors. Using this approach, the products for the first iteration are: CH3 – (2*2)*7 = 28, CH – (7*7)*2*3*5 = 1,470, CH2 – (3*3)*5*7 = 315, O – (5*5)*3*7 = 525. After ranking: CH3 – 1, CH – 4, CH2 – 2, O – 3. As you can see, the ranks are stable after such calculation for this structure. It could be easily proved that such a method of obtaining new ranks from previous ones has never led to leveling or swapping the previously different ranks in the 3-membered rings. But for structure in Fig.5, squaring the corresponding prime of the current rank does not help us: after the first round of the refinement, we have the previously different ranks leveled.
Fig.5 Example of the ranks leveling at the refinement step
If we try to use the third power of the corresponding prime of the current rank for this structure then there is no leveling. Generally speaking, this issue arises when we have two or more neighbors of the atom with the same rank. Since in the molecular graphs we never have atom degree higher than 6 (structures like SF6), we propose to use the eighth power of the corresponding prime of the current rank to guarantee that leveling or swapping the ranks in the structure never occurs at the refinement step (since 8 = 23, the eighth power can be calculated quickly). Another approach, for which one of the reviewers of this article pointed out, is using the comparator to track the previous rank and current rank which allows to split ties only within each cell of the partition. Such a comparator is also sufficient to prevent ranks swapping at the refinement step.
Another issue with the original CANON algorithm is that it does not take into account the bond orders between the atoms. For example, it determined only four symmetry classes for the structure in Fig.6 while there are five.
Fig.6 Incorrect (left) and correct (right) symmetry classes for the example of annulene
This fact has led to arbitrary choosing between the atoms and generation of the different SMILES by the traversal of the molecular graph. The common workaround of this problem is to choose the bond at the branch point when generating the SMILES, for example, O'Boyle proposes [4]: "Rule C: At each branch point, multiple bonds are favoured over single or aromatic bonds, and lower canonical labels over higher". Such a solution lets to avoid ambiguity of the generation of SMILES but does not solve the issue of the incorrect symmetry perception and this fact can lead to the problems during the canonical traversal of other parts of the molecular graph. We propose to solve this issue by multiplying the eighth power of the corresponding prime of the current rank with the corresponding primes of the ranks of its neighbors in the powers equal to the orders of the bonds connecting the current atom with its neighbors.
Pseudocode of the Modified CANON algorithm is the following:
Set atomic ranks to the initial invariants;
Rank these values to obtaining the consecutive small integer ranks;
Do
Set the new value of the rank for each atom as a product of the eighth power of the corresponding prime of the current rank with the corresponding primes of the ranks of its neighbors in the powers equal to the orders of the bonds connecting the current atom with its neighbors;
Rank these values to obtaining the consecutive small integer ranks;
Loop until no invariant partitioning observed.
The Modified CANON algorithm terminates when the count of the different values of the ranks ceases to increase from the previous step of the algorithm. After its termination, we must have the same ranks only on the symmetrically equivalent atoms, i.e. the ranks of atoms are their symmetry classes [3]. These symmetry classes can be used for the calculation of the chirality invariant. For acyclic and simple cyclic structures they also can be used directly for further generation of the canonicalized SMILES. If after the use of the Modified CANON algorithm the count of various ranks in the molecular graph is less than the count of atoms, then to obtain unambiguous order of the molecular graph traversal we need to perform breaking ties procedure, as Weininger et al pointed out [3]. Our method for this is the same as described: doubling all ranks and reducing the value of the atom, which is tied, by one. The set is then treated as a new invariant set, and the previous algorithm for generating an invariant partitioning is repeated. But Weininger et al assumption that “the double-and-tie-break step does not introduce ambiguity into the ordering since only otherwise equivalent atoms will be tied at any point” can be wrong in the general case. Ivanciuc pointed out [1]: “Two vertices from different atom invariant class cannot be automorphic, while two vertices from the same atom invariant class are not necessarily automorphic. Despite numerous efforts, no vertex graph invariant is known which is sufficient to establish the automorphism partitioning, because for certain graphs non-automorphic vertices are partitioned in the same class.” For this reason, we suggest that only a rigorous way to produce canonical ordering for any molecular graph is to generate permutations for all atoms with the same atom invariant class, obtain SMILES for each permutation by traversal and select lexicographically minimal (or maximal) [1]. Our approach to doing this is: after obtaining symmetry classes choose the ring atoms with the maximal tied rank, perform tiebreaking for each of them, refine each partition by the Modified CANON algorithm, and save it. If these new partitions have ring atoms with tied ranks then repeat tiebreaking for each of them as long as the ring atoms with the tied ranks exist in the partitions. For all partitions with the completely partitioned ranks for the ring atoms generate SMILES by the traversals and select the lexicographically minimal one as the canonical SMILES for this molecular graph.
After establishing the canonical order of the atoms by the Modified CANON algorithm with, possibly, the tiebreaking further procedure to obtain canonical SMILES corresponds to described by Weininger et al [3] (procedure GENES): the structure is treated as a tree and a SMILES string is generated that corresponds to a depth-first search (DFS) of this tree. The lowest canonically numbered atom is chosen as the starting point. This atom becomes the root of a tree for a subsequent DFS. Branching decisions are making by directs branching toward the lowest labeled atom at the fork in the branch. For the cyclic structures, Weininger et al have used a two-pass method: on the first pass, the DFS algorithm detects which bonds will become ring bonds (chords) and on the second pass, ring closure indicators (digits) could be appended to the chords node symbols.
After the second pass of the DFS algorithm, there is complete information about the canonical order of the atoms in the neighborhood of chiral centers. We use the third pass of the DFS algorithm to obtain canonical SMILES with the correct parity (absolute SMILES): if the canonical order of the atoms on the second pass of the DFS algorithm can be obtained by even number of swaps from the original order of atoms in the molecular graph then the parity symbol (‘@’ or ‘@@’) is preserved. If not then the parity symbol is inverted. We cannot use order relative to the canonical ranks for the correct designation of the chirality because of the structures that have only relative configurations and no chiral centers (structure in Fig.7 for example).
Fig.7 Example of the structure with only dependent chirality and no chiral centers
For such structure, all chirality invariants are equal to 0 and this fact has led to two possible traversals of this molecular graph. Previously described tiebreaking procedure allows us to avoid this ambiguity by the selection of lexicographically minimal SMILES as the canonical one.