1. About
TRGdb is a database of taxonomically-restricted genes (TRGs) in Bacteria. The website allows users to browse and search for TRGs, at the genus and species levels, across different taxonomic units of Bacteria. It provides full taxonomic information on bacterial species as well as Separation Index of Organism, which quantifies the separation degree of given species/genus from closest bacteria. Finally, the website provides information on every TRG protein sequence including their properties (e.g., level of disorder and complexity, and tendency to aggregation).
2. Methods
2.1. Sequence data and taxonomy of Bacteria
We obtained sequence data, information about the taxonomic classification, and a phylogenetic tree of Bacteria from the Genome Taxonomy Database (GTDB) 08-RS214 (28th April 2023). This data includes 80,789 representative genomes, with one genome per species. We chose to use the GTDB database because it provides high-quality data on bacterial taxonomy and carefully selects the best representative genome for each species. These representative genomes in GTDB have the highest assembly quality, the least amount of contamination in the sequence, and the most complete set of genes.
2.2. TRG identification
The process of identifying TRG genes was divided into three steps:
- We used DIAMOND v2.0.15 to perform all-versus-all comparison between protein sequences from 80,789 bacterial species (nproteins = 247,617,414). We then removed any query protein that had homologous sequences (E-value ≤ 10-3) belonging to bacterial species outside the genus of the query species. These remaining sequences were classified as candidate TRG genes at the genus level.
- Next, we verified the candidate TRG genes by using BLAST+ v2.13.1. Specifically, the candidate TRG sequences that did not show significant similarity (E-value ≤ 10-3) to any sequence outside the query genus were identified as genus-specific TRG genes.
- Finally, we extracted a list of bacterial species-specific genes from the obtained list of genus-specific genes. The species-specific genes were defined as those that did not have homologs outside the query species and the genus of query species encompasses at least two species according to the GTDB taxonomy.
2.3. Clustering TRG proteins
We grouped TRG proteins into clusters using MMseqs2 v14-7e284 with the cluster
option. We set the maximum E-value at 10-3 and used the bidirectional coverage mode (--cov-mode 0
) with a minimum coverage of 80% (-c 0.8
).
2.4. Isolation Index of Organism
To determine the degree of phylogenetic isolation of individual bacterial taxa, we calculated a measure called Isolation Index of Organism (IIO). The IIO is a measure of how far apart a certain species or genus is from the nearest other species or genus in a phylogenetic tree. The greater the IIO value, the more different (evolutionary distant) the bacteria genus/species is from the closely related different genus/species. The IIO parameter was calculated based on the phylogenetic tree of Bacteria in the GTDB database.
2.5. Protein sequence properties
-
DisorderWe calculated the disorder parameter to assess how much of a TRG protein is made up of intrinsically disordered regions. For this purpose, we used the IUPred2 tool that assigns each amino acid of the protein a score between 0 and 1, showing how likely the amino acid is to be part of a disordered region. The number of residues with a disorder score above the threshold of
0.5
was divided by sequence length to give a disorder parameter for each protein in the datasets.AggregationWe calculated protein average aggregation of TRG protein sequences to assess the degree of their aggregation regions. We used the statistical mechanics algorithm TANGO v. 2.3.1 and presented as a frequency of potential aggregating segments defined as hexapeptides with an aggregation score above 5% over all amino acid residues.Protein complexityWe calculated Shannon entropy of TRG protein sequences to assess their sequence complexity. We used the SciPy Python package.3. RESTFul (web) API
The TRGdb RESTFul API allows to query the darabase using a clean URL syntax.
4. Citation
If you find TRGdb useful in your work, please cite: