TFinder: a Python web tool for predicting Transcription Factor Binding Sites

doi:10.21203/rs.3.rs-3782387/v1

Download PDF

software

TFinder: a Python web tool for predicting Transcription Factor Binding Sites

https://doi.org/10.21203/rs.3.rs-3782387/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background: Transcription is a key cell process that consists in synthesizing several copies of RNA from a gene DNA sequence. This process is highly regulated and closely linked to the ability of transcription factors to bind specifically to DNA. TFinder is an easy-to-use Python web portal allowing the identification of Individual Motifs (IM) such as Transcription Factor Binding Sites (TFBS).

Results: Using the NCBI API, TFinder extracts either promoter or gene terminal regulatory regions, through a simple query of NCBI gene name or ID. It enables simultaneous analysis across five different species for an unlimited number of genes. TFinder searches for Individual Motifs in different formats, including IUPAC codes and JASPAR entries. Moreover, TFinder also allows de novo generation of a Position Weight Matrix (PWM) and the use of already established PWM. Finally, the data are provided in a tabular and a graph format showing the relevance and the P-value of the Individual Motifs found as well as their location relative to the Transcription Start Site (TSS) or the terminal region of the gene. The results are then sent by email to users facilitating the subsequent data analysis and sharing.

Conclusion: TFinder is written in Python and freely available on GitHub under the MIT license: https://github.com/Jumitti/TFinder. It can be accessed as a web application implemented in Streamlit at https://tfinder-ipmc.streamlit.app. Resources are available on Streamlit “Resources” tab. TFINDER strenght is that it relies in an all in one intuitive tool allowing users inexperienced with bioinformatics tools to retrieve gene regulatory regions sequences in multiple species and to search for individual motifs in huge number of genes.

software

prediction

transcription factor binding sites

promoter

A DNA Individual Motif (IM) is a short nucleotide sequence conserved between species in which proteins like Transcription Factors (TFs) can specifically bind a trigger gene regulation. TFs specifically recognize a nucleotide IM sequence called Transcription Factor Binding Site (TFBS) mostly in gene promoter or terminator regions. TFBSs characterization is an empirical discipline of genomics and is a key step prior to TFBS functional validation either by gel shift assays (EMSA) or by chromatin immunoprecipitation (ChIP). Both techniques examine of the physical interaction between a TF and a DNA sequence (1).

The in-silico research of IM can be tedious and time-consuming, especially for academics or biologists not familiar with bioinformatics. First it is necessary to retrieve the sequences of regulatory regions (regulatory DNA elements, promoter and terminator). Several databases such as NCBI, UCSC or Ensembl can be used, but they are neither intuitive nor user-friendly for a novice. Second, after recovering the regulatory region sequences, one can use TF databases such as JASPAR (2) and TRANSFAC (3) to identify the IM/TFBS of interest, but they have their limitations. For instance, these platforms do not permit the search of TFBS for an unreferenced TF and may be subject to fees. Other tools such as PROMO (4) or TFBIND (5) allow searching multiple TFBSs in a unique nucleotide sequence but, they all use JASPAR and TRANSFAC databases and can’t search a custom IM. Some tools, like PWMScan (6), can search for custom IMs, but in the entire genome and thus they require significant data processing and therefore expertise in bioinformatics. Finally, a few web tools like FiMO, a module of MEME Suite (7, 8) or RSAT Metazoa (9) enable queries of an unreferenced IM/TFBS. FiMO and RSAT are very powerful and allows very in-depth analyses. However, a lot of data can be generated and scientists not familiar with bioinformatics can find themselves in difficulty.

TFinder is an intuitive, easy-to-use, fast analysis and open-source software that permits both the retrieval of sequences and the search of IM in a UNIQUE WEB APPLICATION. We wanted to create a software that is easy to use and accessible mainly for academic biologists and scientists not familiar with bioinformatics and with heavy data processing. (1) TFinder analyses very quickly an unlimited number of genes; (2) allows selection of up to five different species (human, mouse, rat, drosophila, zebrafish); (3) does the screening of either promoter and/or terminator gene regions; (4) does the search of IM/TFBS (with IUPAC code, JASPAR ID or a Position Weight Matrix (PWM)); (5) TFinder searches for the four “written forms of sequence” an IM/TFBS can have (sense, antisense orientation ,complementary and reverse-complementary). (6) TFinder exports the resulting analyses by email.

TFinder is written in Python 3. The source code is available at https://github.com/Jumitti/TFinder and can be accessed as a web application implemented in Streamlit at https://tfinder-ipmc.streamlit.app. More details are available in Fig. 2 and “Resources” tab on TFinder.

TFinder makes it easy to recover the regulatory sequences of genes using genes name, ENTREZ_GENE_IDs or transcriptional variants using the XM, NM, XR, NR accession code thanks to the NCBI API (10). By configuring the species, the region of a gene and the number of upstream/downstream base pairs, TFinder extracts an unlimited number of sequences in FASTA format in the 5′ end to 3′ end direction of the gene. The Transcription Start Site (TSS) and the end of the gene coordinates are found in the GeneBank in the ACCESSION category and refer to the chromosomal location. We use these coordinates to directly extract the surrounding region via the NCBI API. TFinder displays the sequences with the corresponding chromosomal coordinate (See “Resources” on TFinder for more details). It is also possible to paste any nucleotides sequence recovered from other databases.

The IM/TFBS used in the query can be a custom IM/TFBS not listed in the databases. TFinder supports the IUPAC code. It uses a PWM to facilitate calculation time and generates a Weblogo with LogoMaker (11). Similarly, you can use several TFBS sequences experimentally validated (or not) to generate a PWM and use it as an IM/TFBS. Importantly, TFinder accepts JASPAR IDs thanks to JASPAR API (10.11). Using the gene region extraction tool, TFinder indicates the position of the IM found relative to the TSS or relative to the end of the gene (Rel Position). By default, it uses the start of the nucleotides sequences as a reference position (Position). For each k-mer of the PWM’s length (nucleotides sequence of PWM’s length), a score is calculated by summing the corresponding frequencies of each nucleotide (A, T, C or G) at each position. To refine the Score k-mer calculation, the Relative Score (Rel Score) is calculated as follows:

$$Relative Score=\frac{Score kmer-Min Score PWM}{Max Score PWM-Min Score PWM}$$

The Rel Score uses the maximum and the minimum scores of the reference PWM. The Rel Score determines the similarity between TFinder identified and referenced IMs. Thus, the closer the Rel Score is to 1, the more likely a true positive IM is present in the analyzed sequence. TFinder employs an automatic Rel Score Threshold to filter out less relevant results but allows users to tailor this parameter according to their preferences. TFinder also provides a statistical analysis of the Rel Score by the calculation of a P-value, according to the following formula:

$$P-value=\frac{Nb Rel Score random kmer\ge Rel Score \text{I}\text{M}/TFBS }{Nb random kmer}$$

For this purpose, one million of k-mer random sequences are generated according to the proportion of A, T, G and C in the analyzed sequence. TFinder also allow the parametrizing of nucleotide frequency (A: 0.275, T: 0.275, G: 0.225, C: 0.225). Since the Streamlit cloud server has resource limits, the nucleotide frequency is constrained to a maximum of ten sequences for the calculation of the P-value. For each k-mer randomly generated, the Rel Score is calculated as described above. The P-value represents the probability that a random sequence has a Rel Score equal or greater than the Rel Score of IM/TFBS found.

The results are presented in a table and graph form, showing the Rel Score of the IM found (IM found = the line "sequence") as a function of its Position or Rel Position. Finally, the results may be exported to facilitate subsequent data analysis and sharing. TFinder workflow is summarized in Fig. 1.

The user provides one or multiple gene names or ID then TFinder extracts the gene’s regulatory regions using the NCBI API. Five species are available for screening. Consecutively TFinder finds the Individual Motif (IM) of your choice. TFinder accepts TFBS from JASPAR and Position Weight Matrix (PWM). The results are displayed in a table and interactive graphic. The table format provides the relative position to the TSS and/or gene end (Rel Position), the Relative Score (Rel Score), the IM found and the P-value. Graph represents the Rel Score in function of the position. Data can be downloaded and exported by e-mail.

We provide here the analysis of the promoter region of human NOS1 gene obtained with TFinder or with other tools so that the users can compare. The extraction parameters are present in the legend of Fig.2A. To best evaluate the results given with 3 software programs (FiMO, RSAT and TFinder), you must understand how the P-values are calculated. As said above, it is possible to impose the nucleotide frequency or to configure it based on the sequence that is being analyzed. This nucleotide frequency will make it possible to generate a large quantity of random sequences of the size of the pattern that we are looking for. Fig.2B shows the nucleotide frequencies used by FiMO, RSAT and TFinder. These frequencies were directly recovered from the software in question. We can already observe that FiMO uses imposed frequencies. RSAT and TFinder use nucleotide frequencies based on the sequence of interest. It is interesting to note that RSAT and TFinder do not have the same frequencies. We have no explanation for this, perhaps it is an artificial background noise for RSAT. Our sum of frequencies is indeed equal to 1 unlike RSAT. We did the analysis with several sequences and the same results came out. To understand why it is interesting to have a faithful nucleotide frequency based on the sequence of interest and not a fixed frequency, we generated 24,000 random truncations with a size ranging from 100 bp to 20,000 bp in the human genome (GRCh38). We calculated the nucleotide frequencies and carried out a dispersion calculation of the nucleotide frequencies to obtain a dispersion coefficient. This dispersion coefficient represents the distance between the different nucleotide frequencies. If they are all equal to 0.25 then the dispersion coefficient is equal to 0. In Fig.2C you can see that the dispersion coefficient of nucleotides frequencies of the entire human genome is approximately 0.173. This coefficient is significantly higher than the coefficient calculated with the FiMO nucleotides frequencies. Furthermore, in Fig.2D, we notice that the shorter the sequences observed, the higher the coefficient. Even with the truncations of 20,000 bp we never reached the dispersion coefficient of the human genome. This suggests that it is important for the calculation of a P-value to use nucleotides frequencies based on the sequence of interest to avoid over-estimating a P-value. This is especially valid for sequences between 100 bp and 1500 bp, since for longer sequences the dispersion stabilizes. We then searched for the GCCGGAG motif in the NOS1 promoter sequence using different software. Fig.3A represents the nomenclature and patterns searched by FiMO, RSAT and TFinder. In green, we can see that three software search for the pattern on the + and - strand and in the 5' end to 3' end orientation. TFinder takes the analyze further and searches for the pattern in the 3' end to 5' end orientation on the + and - strand (in blue). Figure 3B-D shows the five best results obtained with FiMO (Fig.3B), RSAT (Fig.3B) and TFinder (Fig.3D). Close analysis of the data obtained demonstrates that the unique common parameter analyzed by the three software concerns the P-value calculation. Interestingly, TFinder's P-value hits are comparable to FiMO’s ones, but significantly different from RSAT’s ones. Even if TFinder calculates the P-value in the same way as FiMO, the results are slightly different due the fact that TFinder randomly generates the million sequences necessary for the P-value calculation for each job. Interestingly, when we asked TFinder to make a P-value calculation with the nucleotide frequency of the sequence and not the frequencies of FiMO, we observe a factor 10 of difference, clearly demonstrating that the P-value calculation obtained with fixed frequencies as does FiMO leads to an overestimation of the P-value. Another consequence of this random generation of sequences is that each query performed by TFinder may lead to a slightly different P-calculation for the same site (the differences are never by a 10 factor).

The P-value calculation method is not explicitly detailed in RSAT but is based on two scientific articles (14,15). What should especially be noted is that RSAT uses nucleotides frequencies not representative of nucleotides frequency of the sequence of interest. Thus, ln-pvalue (natural logarithm of the P-value) and sig (-log(pvalue) representing the significance) are just extrapolated representations of a P-value calculation. We have chosen the NOS1 promoter sequence because we know that we cannot retrieve the full match of the pattern we are looking for in it. Therefore, the best outputs can only be patterns with a single mismatch. So normally, the scores (weight) and P-value between the different hits should be identical. Thus for un unexperienced scientist not familiar with the discipline or bioinformatics, it is difficult to interpret the weight and P-value and ln-pval or “sig” provided by RSAT. Furthermore, if we analyze the sequences found, nucleotide by nucleotide, we quickly see that the five best results from FiMO and TFinder only have one mismatch compared to the reference motif. Intriguingly, 4 out of 5 results given by RSAT have two or more mismatches, raising concerns about the calculations performed by RSAT.

Data tables provided for the analysis of the NOS1 promoter regions by the three web tools and shown in figure 3 also illustrate the analysis specificities of each software. Our Rel Score which is a normalization of the Score (or Weight) is necessarily between 0 and 1. It is therefore easier to understand the similarity between the pattern found and the reference pattern. FiMO provides a q-value calculation that is defined as the minimal false discovery rate at which a given motif occurrence is deemed significant. This information, although useful, do not allow us to understand without a trained mind to what extent the pattern found is similar or not to the reference pattern. RSAT calculates the weight of a sequence segment (Ws) that relies on the log-ratio between the probability to generate the sequence segment given the matrix and the background model. As a corollary, calculation of the weight is an interesting parameter evaluated by RSAT but much more difficult to understand and to consider for the elimination of false positive IMs, knowing that it can be negative as well as positive due to its logarithmic function.

TFinder is designed to be an easy-to-use and open-source tool. According to the user needs, one can retrieve an unlimited number of gene regulatory regions sequences for rapid mapping of TF motifs. Moreover, TFinder accepts custom or database sequences. You can impose an IM/TFBS or choose a JASPAR_ID. TFinder extracts around twelve sequences/min and analyzes at 80 kbp/sec. The results are provided as a table and an interactive visual graphic. The graph represents the Rel Score according to the position on the sequence or in relation to the TSS or gene end. In addition, you can zoom or select on a single population. Hovering over the points displays all the characteristics of the pattern in question. But still, the graph can be exported as an image. TFinder easily deals with the in-silico analysis of different omics data (genes list extracts from RNA-seq and proteomic studies…) and thus gives a reliable fast screening of putative targets directly regulated by a new or validated TF. Thus, TFinder reduces the number of TF-regulated targets identified by genes-batch or RNA-seq, that must be experimentally validated as functional by EMSA, analysis of promoter activity and ChIP approaches. Moreover, ChIPseq big data may be screened for the identification of putative direct TF target genes. Note that several additional techniques can improve the relevance of the prediction, namely phylogenetic foot printing as the most common (16). In addition, the Rel Score calculated by TFinder improves the results relevance because it reduces the variability of score calculation between different TFs. Moreover, the P-value calculation provides an additional criteria enabling a robust IM/TFBS identification (13.14). The comparison of TFinder to other available software (Table 1) clearly demonstrates that TFinder proposes several features in a unique tool. PWMTools does not actually allow you to extract a specific sequence. This remarkable software makes it possible to screen an entire genome for a specific motif. But this requires knowledge of data processing, and TFinder goes against this as we are looking for simplicity and speed of execution. RSAT and TFinder both do the extraction of promoter and terminal regions of genes. However, TFinder is intended to be user-friendly and aimed at scientists who want a quick answer. TFinder integrates all the components into a single application in a fluid and interconnected manner. There are also separate modules to work on a case-by-case basis. To achieve this with RSAT, you must search on their site where to extract the sequences, and then the user must provide a consistent list of parameters which can discourage novices not familiar with all the technical terms. FiMO and RSAT have their own servers, what implies that the analysis may be queued depending on the number of users connected to the website. TFinder does not have this problem because Streamlit allows you to have a single session per user and therefore no queuing. FiMO, RSAT and TFinder are similar in many ways. Although we want TFinder to be a user-friendly and convenient tool to be used by scientists not familiar with bioinformatics, we make a point of honor on the relevance of the calculations carried out. Thus, as mentioned previously, the calculation of the P-value is done through the generation of many random sequences based on a nucleotide frequency. There are two ways to do it, either the nuclear frequency is imposed (FiMO), or it is determined from the sequence of interest (RSAT and TFinder). A P-value calculation using a fixed nucleotides frequency remains convenient if one analyzes sequences longer than 2000 bp, but when a nucleotides frequency representative of the organism studied is needed it is better to generate a consequent number of random sequences. Thus, by using the nucleotides frequency based on the sequence of interest, TFinder overcomes the heterogeneity of the genome and avoids P-value overvaluation. Moreover, certain regions of the genome are extremely rich in G and C and this can bias the calculation of the P-value. The way RSAT and TFinder calculate their P-value overcomes this problem. However, it is important to note that RSAT does not use the exact determinations of nucleotide frequency, which reduces the robustness of the calculation of their P-value. The weight score is also a useful parameter when studying IM/TFBS, but it is sometimes difficult to interpret. The weight score depends on the PWM used. If the PWM of one TF is longer than that of another TF, then the score will be drastically higher for the longer PWM. Therefore, it is difficult to compare these two TFs with each other. We can overcome this problem by two different ways. The first, thanks to P-value calculation (9, 17). This gives a better intuition about the risk associated with each prediction. The second, by simply normalizing the obtained weight score as TFinder and JASPAR.

Considering that several studies have shown a triviality of transcription factors to attach to a pattern in one direction or the other (18). One of the assets that distinguishes TFinder from most software programs, concerns the fact that it can search for TFBS motifs from the 5’ end to 3' end and from the 3' end to 5' end directions. Importantly, this TFinder specificity can be used in other applications such as restriction enzyme site searching, bipartite TFBS finding or even PCR primer verification.

As far as databases are concerned, TFinder uses the JASPAR database. We have focused exclusively on JASPAR because it is a rich database, regularly updated and easily accessible. PWMTools, FiMO and RSAT databases are very rich, but stored directly on their servers and are not systematically updated with the latest versions.

Finally, we believe that the accessibility of a software requires a clear, fun and sufficiently structured interface for users to be easily guided. These characteristics also concern the display of the results. Thus, Tfinder proposes the recovery of the results not only in the form of tables (structured EXCEL, csv table) but also by means of an interactive graph. Importantly, Tfinder allows email exporting facilitating data sharing and storage. FiMO and RSAT offer GFF/GF3 export that are complicated to use for scientists not familiar with bioinformatics.

TFinder offers a user-friendly web tool that enables a broad scientific community, from the novice to the most expert, to easily perform in-silico analysis of transcription factor potential binding sites on regulatory regions from multiple species and, consequently, accelerate OMICS screening and investigation of transcription factor binding properties.

ChIP	Chromatin Immunoprecipitation
EMSA	Electrophoretic Mobility Shift Assay
IM	Individual Motif
PWM	Position Weight Matrix
Rel Position	Relative Position
Rel Score	Relative Score
TF	Transcription Factor
TFBS	Transcription Factor Binding Site
TSS	Transcription Start Site

Availability and requirements:

Project name: TFinder

Project home page: https://github.com/Jumitti/TFinder

Operating system(s): Web https://tfinder-ipmc.streamlit.app/

Programming language: e.g. Python

Other requirements: No

License: MIT

Ethics approval and consent to participate: not applicable

Consent for publication: not applicable

Availability of data and materials: the source code of TFinder is publicly available at the following GitHub repository: https://github.com/Jumitti/TFinder

Competing interests: the authors declare no conflict of interest.

Funding: This research was supported by the Agence Nationale de la Recherche (ANR) under project ANR-20-CE16-0008 and the Labex Distalz.

Authors' contributions: Conceptualization: all authors; Investigation: JM; Funding acquisition: CAC, FC; Project administration: CAC, ED; Supervision CAC, ED; Writing-original draft: JM; Writing review and editing: all authors.

Acknowledgements: We would like to thank Drs. Romain Gautier, Kevin Lebrigand, and Marin Truchi for critical reading of this paper and Ms. Pauline Minniti for artwork.

Jayaram N, Usvyat D, R. Martin AC. Evaluating tools for transcription factor binding site prediction. BMC Bioinformatics. déc 2016;17(1):547.
Castro-Mondragon JA, Riudavets-Puig R, Rauluseviciute I, Berhanu Lemma R, Turchi L, Blanc-Mathieu R, et al. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Research. 7 janv 2022;50(D1):D165‑73.
Matys V. TRANSFAC(R) and its module TRANSCompel(R): transcriptional gene regulation in eukaryotes. Nucleic Acids Research. 1 janv 2006;34(90001):D108‑10.
Farre D. Identification of patterns in biological sequences at the ALGGEN server: PROMO and MALGEN. Nucleic Acids Research. 1 juill 2003;31(13):3651‑3.
Tsunoda T, Takagi T. Estimating transcription factor bindability on DNA. Bioinformatics. 1 juill 1999;15(7):622‑30.
Ambrosini G, Groux R, Bucher P. PWMScan: a fast tool for scanning entire genomes with a position-specific weight matrix. Hancock J, éditeur. Bioinformatics. 15 juill 2018;34(14):2483‑4.
Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 1 avr 2011;27(7):1017‑8.
Bailey TL, Johnson J, Grant CE, Noble WS. The MEME Suite. Nucleic Acids Res. 1 juill 2015;43(W1):W39‑49.
Turatsinze JV, Thomas-Chollier M, Defrance M, Van Helden J. Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc. oct 2008;3(10):1578‑88.
Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, et al. Database resources of the national center for biotechnology information. Nucleic Acids Research. 7 janv 2022;50(D1):D20‑6.
Tareen A, Kinney JB. Logomaker: beautiful sequence logos in Python. Valencia A, éditeur. Bioinformatics. 1 avr 2020;36(7):2272‑4.
Khan A, Mathelier A. JASPAR RESTful API: accessing JASPAR data from any programming language. Wren J, éditeur. Bioinformatics. 1 mai 2018;34(9):1612‑4.
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Research. 4 janv 2018;46(D1):D260‑6.
Staden R. Methods for calculating the probabilities of finding patterns in sequences. Bioinformatics. 1989;5(2):89‑96.
Bailey TL, Gribskov M. Combining evidence using p-values: application to sequence homology searches. Bioinformatics. 1 janv 1998;14(1):48‑54.
Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. avr 2004;5(4):276‑87.
Touzet H, Varré JS. Efficient and accurate P-value computation for Position Weight Matrices. Algorithms Mol Biol. déc 2007;2(1):15.
Lis M, Walther D. The orientation of transcription factor binding site motifs in gene promoter regions: does it matter? BMC Genomics. déc 2016;17(1):185.

Table 1 Comparison of the features given by TFinder to the ones given by existing software.

IM: Individual Motif; TFBS: Transcription Factor Binding Site; TF: Transcription Factor; TSS: Transcription Start Site; PWM: Position Weight Matrix; P: analysis done on the entire genome

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

TFinder: a Python web tool for predicting Transcription Factor Binding Sites

Status:

Version 1

Abstract

Figures

Background

Implementation

Results and comparison with other software

Discussion

Conclusion

Abbreviations

Declarations

References

Table

Additional Declarations

Status:

Version 1