Software Description
PIKAChU is implemented in Python (v3.9.7). Its only dependency is the common Python package matplotlib (v3.4.3). PIKAChU can be run on Windows, MacOS, and Linux systems.
Parsing molecules from SMILES
PIKAChU takes a SMILES string as input and from it builds a graph object, in which nodes represent atoms and edges represent bonds (Figure 1). For each atom, PIKAChU initially stores information on chirality, aromaticity, charge, and connectivity. For each bond, it stores bond type (single, double, triple, quadruple, or aromatic), neighbouring atoms, and cis-trans stereochemistry for double bonds. Once all atoms, bonds, and their connectivities have been stored, implicit hydrogens are added to the structure, electron shells and orbitals are constructed for each atom, electrons are allocated to σ or π bonds, and atom hybridisation is determined. Next, all cycles in the graph are detected using an open-source Python implementation [16] of the simple cycle detection algorithm described by D. Johnson in 1975 [17]. From these cycles, PIKAChU identifies the set of unique cycles larger than two atoms, and defines cyclic systems, where cycles are considered part of a cyclic system if they share a bond with at least one other cycle in the system. Where aromaticity is not directly inferred from a SMILES string, it is determined for the atoms and bonds of each cycle and each cyclic system by applying Hückel’s 4n + 2 rule on planar ring systems [18]. Five-membered aromatic heterocycles where a lone pair of electrons occupies a p-orbital are considered on a case-to-case basis: if exactly four atoms are sp2-hybridised and exactly one sp3-hybridised heteroatom carries a lone pair, then the lone pair of the heteroatom is promoted to a p-orbital and the heteroatom’s hybridisation is adjusted to sp2.
Finally, a structure object is returned which can be visualised, kekulised, analysed through substructure matching and molecular fingerprinting, and altered through an assortment of built-in and custom chemical reactions.
If a SMILES string yields a structure object that is chemically incorrect due to too many or too few bonds being attached to an atom or valence shells not being filled appropriately in the case of organic atoms, PIKAChU gives a SmilesError, informing the user that the SMILES is wrong, and what is wrong with it.
Visualisation and kekulisation
Prior to visualisation, aromatic systems within a structure are kekulised so that aromatic systems can be represented by alternating single and double bonds. PIKAChU kekulises aromatic systems using a Python implementation (Yorkyer 2020) of Edmonds’ Blossom Algorithm for maximum matching [20]. Next, atoms are positioned using PIKAChU’s powerful drawing software, capable of visualising complex molecules, such as heavily cyclised molecules, chiral centres, and cis- and trans double bonds. PIKAChU’s python-based drawing algorithm was adapted and improved from SmilesDrawer [21], a JavaScript library that constitutes one of the best open-source molecular drawing software packages currently available. While written in different programming languages, the algorithms underlying the drawing software of PIKAChU and SmilesDrawer are virtually identical. We will briefly recap this algorithm below; more detailed descriptions of the algorithm’s elements can be found in the SmilesDrawer paper [21].
First, if indicated, PIKAChU’s drawing algorithm removes hydrogens from the graph. Next, it finds the smallest set of smallest rings in the structure graph and classifies them into one of three groups: simple rings, overlapping rings, and bridged rings. Simple rings are standalone rings that do not have any overlapping atoms with any other rings. Overlapping rings are rings that overlap with one or more other rings, where the overlap between any two rings can comprise at most two atoms, any atom in the overlap is part of at most two rings, and no atoms in the ring overlap with bridged rings. Finally, bridged rings are rings that share more than two atoms with another ring, contain atoms that are part of three or more rings, or share atoms with another bridged ring (Figure 2A).
After ring systems have been identified, atoms are placed onto a 2D coordinate system. If the molecule contains rings, positioning starts with the placement of an atom in a ring, prioritising bridged rings over simple and overlapping rings. Then, the graph is traversed one atom at a time in depth-first fashion. If an atom is part of a ring, the entire ring or ring system get placed at once. In the case of simple and overlapping rings, ring placement can be done using simple polygon geometry. For bridged rings, atoms are positioned using the force-spring model described by Kamada and Kawai [22], where all atoms of the bridged system are initially placed in a circle, and then pulled towards their optimal positions by minimising the difference between the desired bond length and the distance between neighbouring atoms, and maximising distances between non-neighbouring atoms. Non-ring atoms are positioned a bond length away from the previous atom, where the angle with respect to the previous atom is determined by the number of neighbours the atom has (Figure 2B), and the size of the molecular subtree behind each neighbouring atom (Figure 2C). Stereochemically restricted double bonds are always forced into the appropriate cis- or trans conformation. Unlike SmilesDrawer, which directly infers bond stereochemistry from the SMILES string, PIKAChU draws this information from bond objects stored in the molecular graph.
Once all atoms have been assigned initial coordinates, atoms adjacent to rings are flipped outside of their ring where possible. Then, the drawing is checked for overlaps between atoms, and these overlaps are resolved by rotating branches of the molecule around single bonds. Finally, some bonds are replaced with backward and forward wedges around chiral centres. They are placed such that they do not neighbour more than one chiral centre where possible, they are not part of a ring, and point in the direction of the shortest branch leading from a chiral centre, in that order of priority. The resulting image is subsequently written to a .svg or .png file or displayed directly in matplotlib.
Substructure matching
PIKAChU detects occurrences of a substructure in a superstructure in five steps. In all steps, hydrogens are ignored. First, PIKAChU checks for each atom type in the substructure if enough atoms of these types are accounted for in the superstructure. Second, it assesses for each atom in the substructure whether an atom exists in the superstructure with the same connectivity, looking at directly neighbouring bonds and atoms. Third, using the atom with the most diverse connectivity as a seed, it finds matches of the substructure in the superstructure using a depth-first search algorithm, ignoring stereochemistry. By first looking at atom type and atom connectivity, and by using atoms of diverse connectivity as seeds for substructure matching, the number of calls to the computationally expensive depth-first search function is minimised. Fourth, for each match, it determines if all chiral centres in the substructure have the same orientation as corresponding chiral centres in the superstructure. Fifth, PIKAChU checks if cis-trans orientation of double bonds in the substructure matches that of double bonds in the superstructure. Chiral centre and double bond stereochemistry checks can be toggled by the user independently of one another. If chirality of bonds and atoms are considered, substructures with undefined stereochemistry will still match to parent structures with defined stereochemistry. This does not apply in reverse: if a stereocentre or stereobond is defined for a substructure, it will not match to parent structures with undefined stereochemistry.
Substructures can be easily visualised through a range of functions in PIKAChU’s general library.
Fingerprinting
PIKAChU uses an improved version of the classical Morgan fingerprinting, ECFP [15], to perform similarity searches and convert molecules to bit vectors for machine learning featurisation. Using Python’s inbuilt hashlib library, PIKAChU initialises each atom to a 32-bit hash, derived from a tuple containing information on heavy neighbours, valence, atomic number, atomic weight, charge, hydrogen neighbours, and ring membership. Then, each atom hash is iteratively updated with hashes from its neighbours, as well as the distance from the neighbour to the atom and stereochemical information if the atom is a chiral centre. The number of iterations depends on a radius which can be set to any number (default = 2 for ECFP-4 fingerprinting). The ECFP algorithm was described in detail by Rogers and Hahn in 2010 [15]. Finally, duplicate hashes are removed, as well as different hashes representing the same substructure, yielding a set of 32-bit hashes that constitutes a molecule’s fingerprint.
Using ECFP fingerprinting, PIKAChU can calculate Jaccard/Tanimoto distance and/or similarity between any two molecules. Furthermore, PIKAChU can convert molecule libraries into bit vectors of varying lengths (default = 1024) and an accompanying list of substructures represented by those bit vectors that can be used in downstream machine learning algorithms.
Defining reaction targets
In order to facilitate implementation of reactions and reaction pathways, PIKAChU lets users define target bonds or atoms within substructures with a set of dedicated functions. These functions take a SMILES string representing a substructure, and either one or two integers that define an atom or a bond between two atoms respectively. For example, the SMILES string C(=O)NC, accompanied by the integers 0 (pointing to the first C atom) and 2 (pointing to the N atom), represents a peptide bond. The occurrences of these bonds/atoms are then detected within a superstructure and returned as a list of bonds/atoms. Subsequently, the returned bonds/atoms can be used as reaction targets, for instance for bond hydrolysis or atom methylation, using functions in PIKAChU for breaking or creating bonds and adding or removing atoms.
Characterisation and visualisation of the polyketide ketoreduction reaction
We demonstrated the implementation of reactions using PIKAChU by characterising and visualising a polyketide ketoreduction reaction. We built the ketoreduction reaction by first defining a reaction target as described above, in this case a β-keto bond, and detecting its position in a polyketide chain. Next, we wrote a function that reduces the double carbonyl bond to a single bond, which identifies and removes the π-electrons in the double bond, sets the bond type to single, adjusts the hybridisation of the atoms involved and finally updates the structure object through PIKAChU’s refresh functions. To finalize the reaction, two hydrogen atoms were added to the carbon and oxygen atoms of the former carbonyl bond using PIKAChU’s add_atom function. Finally, to visualise the reaction, we highlighted the atoms and bonds of the newly formed hydroxyl group in red and drew the molecule.
Detailed instructions on how to make full use of PIKAChU’s range of functionalities, as well as the script used to implement the ketoreduction reaction, can be found in the online documentation.
Speed assessment
For PIKAChU and RDKit, speed was assessed by timing the drawing of all molecules in NP Atlas. We used the Linux ‘time’ command with the calls to the Python scripts performing the visualisation. To assess SmilesDrawer’s speed, we used the developers’ online drawing portal, which also reports computational time for generating a drawing. For ChemDraw, drawing speed could not be assessed accurately, but for the molecules we tested image generation seemed instant.