Algorithm
The main implementation of the sugar removal algorithm is made in Java version 11 with the support of the Chemistry Development Kit (CDK) [19] version 2.3. It is downloadable on GitHub (https://github.com/JonasSchaub/SugarRemoval) along with the freely available code. The SRU offers multiple functionalities to detect and remove sugar moieties from submitted molecules along with a range of options to configure these processes for a specific application. For greater modularity, the detection of sugar moieties (Figure 2) and their removal (Figure 3) are being done in different sequential steps, described below.
Detection of candidate structures for sugar moieties
The detection of glycosidic substructures in a query molecule is done distinctly for circular and linear sugar moieties in order to use specific approaches and detect these structurally different substructures in the most precise way possible.
Detection of circular sugar candidates
The detection of candidate structures for circular sugar moieties is done in three steps. First, using the CDK class RingSearch, all isolated cycles are extracted from the molecule. An isolated cycle has at most one atom in common with another cycle or cyclic system, as opposed to a ‘fused’ cycle that shares more atoms with others [20]. The definition of isolated cycles includes spiro ring systems where two cycles share one atom (Figure 4a). These are filtered from the detected isolated rings but an option can be set to include them in the detection of circular sugars. Next, the detected cycles are matched to the predefined patterns for circular sugars. By default, these are tetrahydrofuran, tetrahydropyran, and oxepane matching five-, six-, and seven-membered sugar rings (Figure 5). The SRU offers the option to add further rings to this list, like oxocane to match eight-membered sugar rings, or even to use one’s own collection of circular sugars to be detected. Only candidate moieties that match the given substructures are kept for the next step. Last, all rings that have exocyclic double or triple bonds are discarded.
Two additional options can also be selected for an even more specific circular sugar detection: counting of connected exocyclic oxygen atoms and detection of glycosidic bonds. If only sugars attached to the parental structure or to another sugar moiety by an O-glycosidic bond should be removed, this option should be selected. Sugar moieties having a carbon-carbon connection or an S-, C- N-glycosidic bond connecting them to other substructures in the molecule instead of an O-glycosidic bond are therefore preserved. Note, however, that molecules that are themselves single-cycle circular sugars are not discarded even with this option selected and still treated as sugar candidates to be removed because there is no other structure in the molecule to bind to via an O-glycosidic bond. It is also important to note that the algorithm detects glycosidic bonds as oxygen atoms connected to the sugar ring in any place and to another non-hydrogen atom via single bonds. This definition is not very strict and includes non-classical glycosidic linkages like ester bonds, for example (compare Figure 1b in blue on the left-hand side of the circular sugar moiety).
The second optional circular sugar detection step consists in counting connected exocyclic oxygen atoms and discarding substructures that do not have a sufficient number of attached exocyclic oxygens. This sufficient number is defined by a ratio of connected exocyclic oxygen atoms to the number of atoms in the ring which can be configured in the SRU. A ratio of 0.5, for example, means that a six-membered suspected sugar ring needs at least three connected exocyclic oxygen atoms to be regarded as a sugar moiety that should be removed. All candidates not reaching this threshold are discarded and therefore not treated as removal-worthy sugar moieties. In the web application, the default threshold only is available.
All candidate structures for circular sugar moieties removal that have been selected in these steps are then being processed for sugar removal.
Detection of linear sugar candidates
The detection of candidate structures for the presence of linear sugar moieties (single-bonded, simple carbon chains where nearly all carbon atoms have one hydroxy or keto group) is performed with a substructure matching against the whole molecule in five steps. First, a predefined set of linear sugar structures is matched to the query molecule using the CDK class DfPattern and all matching substructures are treated as primary linear sugar candidates. This predefined set contains multiple aldoses, ketoses, and sugar alcohols sized between 3 and 7 carbons (Figure 6). It has been compiled with special regards to the occurrence of linear sugars in NPs and can be modified regarding specific needs. One possible modification of the set is the addition of five sugar-acid structures that are not included using the default options (Figure 6).
The substructures extracted by pattern matching in this first step may overlap, which can lead to ambiguities in the following steps. Therefore, in the second step, all overlapping candidates are combined to one single candidate structure. The output of this step is a set of distinct, non-overlapping sugar-like substructures of the query molecule. However, it may also combine substructures to one linear sugar candidate when they should be regarded as multiple, inter-linked sugar units. To separate these, in the third step candidates are split on ether, ester, and peroxide bonds (Figure 7a, b and c) resulting into clean, distinct candidates (Figure 7d and Figure 1c). Only bonds that are located in a cycle are left intact to facilitate the detection of circular sugars among the linear sugar candidates in the following step. For example, the six-membered sugar alcohol hexitol (Figure 8a), which is part of the linear sugar pattern set, matches an ɑ-glucose sugar ring (Figure 8b) and through the combination of overlapping matches, the whole sugar ring gets extracted as a linear sugar candidate. Therefore, to detect linear and circular sugar moieties separately, all atoms that are part of circular sugar moieties (i.e. isolated, non-spiro cycles that match the circular sugar patterns and have only exocyclic single bonds) are discarded. However, this does not guarantee that there cannot be any bigger cycles, or parts of them, in the remaining candidates for removal. For instance, in NPs, linear sugar moieties may be substructures of macrocycles, like it is the case in ossamycin (Figure 9a). Pseudosugars (Figure 9b), molecules that differ from true circular sugars only by the absence of an oxygen atom in their ring [6] are also to be cared of. They are undetectable for the presented circular sugar detection algorithm but they can still be among the detected linear sugar candidates in this stage. The removal of these linear sugars would, therefore, break the macrocycles or pseudosugars. To avoid this, not removing linear sugars that are part of cycles is an optional step. When selected, all atoms in rings get removed from the candidate substructures for removal. Finally, the last step of the linear sugar detection is to check the length of the detected candidate substructures. By default, all linear sugars that have less than four and more than seven carbon atoms are discarded, but these thresholds can be manually configured in the standalone application. The algorithm returns all candidate structures for linear sugar moieties that have been selected as substructures that should be removed.
Removal of detected sugar moieties
The removal of sugar moieties is comprised of the same steps for both linear and circular sugars. It is possible to remove all detected sugar moieties or only the terminal ones (Figure 1a and b, Figure 10, and Figure 11). In the first case, the deglycosylated molecule may consist of two or more disconnected structures when returned. Whereas in the latter case, a recursive algorithm picks one candidate and removes it if it is terminal until no further terminal candidate can be found. The deglycosylated molecule is therefore always consisting of one connected structure.
The determination of terminal and non-terminal moieties heavily depends on an option named “preservation mode”. This option determines whether a substructure that gets disconnected from the molecule by the removal of a sugar moiety is worth keeping or can get removed along with the sugar. The best example where this is relevant is hydroxy groups of circular sugars. Following the algorithm presented above for the detection of circular sugars, these groups are not handled as part of the sugar candidate structure, even though their occurrence may be taken into account when deciding on whether to remove a sugar ring or not (see optional step above). When the ring is removed in this step, the hydroxy groups and all other structures formerly attached to the cycle get disconnected from the remaining structure. One-by-one, they are then evaluated according to the set preservation mode and removed or kept as disconnected structures. In the former case, the removed sugar ring qualifies as terminal, and in the latter case, it does not and therefore not get removed if only terminal sugar moieties are removed. For the determination of terminal sugar moieties, it is also a necessary condition that no structure belonging to another sugar candidate gets disconnected by the removal of the candidate in question.
The “preservation mode” has three different settings available:
- Keep all structures. If only terminal moieties are removed, no sugar ring that has any hydroxy groups gets removed.
- Judge by a heavy atom count threshold.
- Judge by a molecular weight threshold.
These options are mutually exclusive and the default threshold values of the options 2 and 3 (five heavy atoms or 60 Da, respectively) can be altered.
If only terminal sugar moieties are to be removed from the molecule, any disconnected structure resulting from each removal step is too small to preserve according to the preservation mode and is cleared away. If all the candidate sugars are to be removed from the query molecule, the disconnected structures that are too small are only cleared once at the end of the routine. If multiple disconnected structures remain, routines of the SRU can be used either to select the biggest remaining substructure, or to split them in different entities and sort them. Note again, that when removing all circular and linear sugars, the routine is run only once; however, when removing only terminal sugars, the routine is iterated several times to ensure the unity of the parent structure.This is also done to detect and remove, for example, a linear sugar moiety that only becomes terminal after the removal of a circular moiety and vice-versa (Figure 1a). This is the reason why the detection and removal of all terminal sugars at once may, in some particular cases, produce a slightly different deglycosylated parent structure compared to a sequential, detection and removal of circular, then linear sugars, which is also possible using the present implementation.
In the case where, for the detection of circular sugar moieties, the option was chosen to also detect spiro rings as possible sugar rings, the atom shared by one of these rings with another does not get removed in order to not break up the adjacent cycle (Figure 4b).
A molecule only composed of sugars (Figure 12) will be completely removed, and an empty object returned. However, if a molecule is composed of several sugar units that are not linked by O-glycosidic bonds, and the detection of O-glycosidic bonds is set, the query molecule will be returned unaltered. As mentioned before, only single-cycle carbohydrates must not adhere to this option set. In the case of commonly known sugars, like lactose (Figure 12b) or sucrose, that are disaccharides linked by glycosidic bonds, both sugar moieties are detected and removed using the SRU.
Molecules that do not contain any of the sugar moieties selected for removal are returned unaltered.
Web application
The single-page web application allowing to remove sugar units is freely available at https://sugar.naturalproducts.net/. It is implemented in Java 11 using Spring Boot MVC and Javascript. The corresponding code for this web application is available at https://github.com/mSorok/SugarRemovalWeb. The web application implements all functionalities available in the standalone application, such as sugar removal of both linear and circular types and both terminal and non-terminal, with default options. For ring sugar removal, it is also possible to use the O-glycosidic bond option, to remove only sugars attached to the rest of the molecule by such a bond. The size of linear sugars to be removed is set between four and seven carbons and of ring sugars between five and seven atoms in the ring. Linear sugars that are part of bigger cyclic structures are not removed. Only deglycosylated substructures of more than 4 heavy atoms are returned. The query molecule submission is possible in three ways: by submitting a file (SDF, MOL or SMILES), by directly pasting a SMILES string, or by drawing the query molecular structure. The result of the deglycosylation is displayed in a table containing structures and SMILES representations of the submitted molecule(s) together with the produced deglycosylated moieties. The result table can be easily exported in a CSV format or copied to the clipboard.