TFinder is written in Python 3. The source code is available at https://github.com/Jumitti/TFinder and can be accessed as a web application implemented in Streamlit at https://tfinder-ipmc.streamlit.app. More details are available in Fig. 2 and “Resources” tab on TFinder.
TFinder makes it easy to recover the regulatory sequences of genes using genes name, ENTREZ_GENE_IDs or transcriptional variants using the XM, NM, XR, NR accession code thanks to the NCBI API (10). By configuring the species, the region of a gene and the number of upstream/downstream base pairs, TFinder extracts an unlimited number of sequences in FASTA format in the 5′ end to 3′ end direction of the gene. The Transcription Start Site (TSS) and the end of the gene coordinates are found in the GeneBank in the ACCESSION category and refer to the chromosomal location. We use these coordinates to directly extract the surrounding region via the NCBI API. TFinder displays the sequences with the corresponding chromosomal coordinate (See “Resources” on TFinder for more details). It is also possible to paste any nucleotides sequence recovered from other databases.
The IM/TFBS used in the query can be a custom IM/TFBS not listed in the databases. TFinder supports the IUPAC code. It uses a PWM to facilitate calculation time and generates a Weblogo with LogoMaker (11). Similarly, you can use several TFBS sequences experimentally validated (or not) to generate a PWM and use it as an IM/TFBS. Importantly, TFinder accepts JASPAR IDs thanks to JASPAR API (10.11). Using the gene region extraction tool, TFinder indicates the position of the IM found relative to the TSS or relative to the end of the gene (Rel Position). By default, it uses the start of the nucleotides sequences as a reference position (Position). For each k-mer of the PWM’s length (nucleotides sequence of PWM’s length), a score is calculated by summing the corresponding frequencies of each nucleotide (A, T, C or G) at each position. To refine the Score k-mer calculation, the Relative Score (Rel Score) is calculated as follows:
$$Relative Score=\frac{Score kmer-Min Score PWM}{Max Score PWM-Min Score PWM}$$
The Rel Score uses the maximum and the minimum scores of the reference PWM. The Rel Score determines the similarity between TFinder identified and referenced IMs. Thus, the closer the Rel Score is to 1, the more likely a true positive IM is present in the analyzed sequence. TFinder employs an automatic Rel Score Threshold to filter out less relevant results but allows users to tailor this parameter according to their preferences. TFinder also provides a statistical analysis of the Rel Score by the calculation of a P-value, according to the following formula:
$$P-value=\frac{Nb Rel Score random kmer\ge Rel Score \text{I}\text{M}/TFBS }{Nb random kmer}$$
For this purpose, one million of k-mer random sequences are generated according to the proportion of A, T, G and C in the analyzed sequence. TFinder also allow the parametrizing of nucleotide frequency (A: 0.275, T: 0.275, G: 0.225, C: 0.225). Since the Streamlit cloud server has resource limits, the nucleotide frequency is constrained to a maximum of ten sequences for the calculation of the P-value. For each k-mer randomly generated, the Rel Score is calculated as described above. The P-value represents the probability that a random sequence has a Rel Score equal or greater than the Rel Score of IM/TFBS found.
The results are presented in a table and graph form, showing the Rel Score of the IM found (IM found = the line "sequence") as a function of its Position or Rel Position. Finally, the results may be exported to facilitate subsequent data analysis and sharing. TFinder workflow is summarized in Fig. 1.
The user provides one or multiple gene names or ID then TFinder extracts the gene’s regulatory regions using the NCBI API. Five species are available for screening. Consecutively TFinder finds the Individual Motif (IM) of your choice. TFinder accepts TFBS from JASPAR and Position Weight Matrix (PWM). The results are displayed in a table and interactive graphic. The table format provides the relative position to the TSS and/or gene end (Rel Position), the Relative Score (Rel Score), the IM found and the P-value. Graph represents the Rel Score in function of the position. Data can be downloaded and exported by e-mail.