SwissTree (swisstree)

Dataset description

The reference dataset pertains to protein sequences collected from SwissTree release 2017.9. SwissTree is a collection of large- and high-confidence gene family phylogenies with different types of challenges for sequence comparison and species from all domains.

The FASTA file is a mixture of 11 SwissTree gene families.

# Gene Family short name Gene Family name Number of sequences
1 ST001 Popeye domain-containing protein family 49
2 ST002 NOX 'ancestral-type' subfamily NADPH oxidases 54
3 ST003 V-type ATPase beta subunit 49
4 ST004 Serine incorporator family 115
5 ST005 SUMF family 29
6 ST007 Ribosomal protein S10/S20 60
7 ST008 Bambi family 42
8 ST009 Asterix family 39
9 ST010 Cited family 34
10 ST011 Glycosyl hydrolase 14 family 159
11 ST012 Ant transformer protein 21

Dataset file structure

Dataset File Structure

The dataset has 1 directory containing 651 FASTA files

├── ST001_001.fasta
├── ST001_002.fasta
├── ST001_003.fasta
├── ST001_004.fasta
├── ST001_005.fasta
├── ST001_006.fasta
├── ST001_007.fasta
├── ST001_008.fasta
├── ST001_009.fasta
├── ST001_010.fasta
├── ...

Benchmark Protocol

The test evaluates an accuracy of alignment-free distance measures in phylogenetic reconstruction of gene families.

Specifically, the benchmark procedure takes as input user's file containing the distances between all sequence pairs present in the dataset file. Only distances between protein pairs from the same family are extracted.

For each family, the distances are used as input into the neighbour-joining algorithm (fneighbor from EMBOSS:6.3.1 PHYLIPNEW:3.69) to generate the corresponding method tree. To assess the accuracy of method tree we computed the Robinson-Foulds distance between a tree computed using that method (the "test tree") and the corresponding reference tree in SwissTree, using ftreedist (EMBOSS:6.3.1 PHYLIPNEW:3.69).

To facilitate comparison of results sequence sets (hence trees) of different sizes N, we normalize this distance according to the maximum possible distance between two unrooted trees, 2(N − 3). We denote this normalised Robinson-Foulds distance as nRF, with a value from 0 to 1 that can be interpreted as the proportion of false or missing bipartitions in the test-tree topology compared to the reference topology. When RF = 0 the test and reference topologies are identical, implying high accuracy for the method. Conversely, at RF = 1 no bipartition in the reference is recovered. The overall assesment of method accuracy is weighted average of nRFs across all 11 gene families.

Testing your own method

Download from the server the dataset file: Unzip the downloaded file.
Use the unzipped FASTA file as an input to your method and calculate either, the distances between every pair of sequences or phylogenetic tree (see File Formats below).
Save the results to a text file.
Submit you predictions to the web server.
The web service benchmarks the uploaded results and presents a report with the submitted method's performance and comparison to other available methods. Additionally, you can choose to make the report publicly available.

Download dataset

File name:

File size: 298.2 KB

MD5sum: 813d5f7f9c50fdd57001592c9da9e5f1

File formats supported in benchmark

Benchmark supports one of the following file formats:

Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.

Example of Text File Format (4 sequences)

A   B   8.876
A   C   6.120
A   D   4.321
B   C   5.231
B   D   3.983
C   D   0.663

Square Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

A        0.000 8.876 6.120 9.321
B        8.876 0.000 2.231 3.983
C        6.120 2.231 0.000 0.663
D        9.321 3.983 0.663 0.000

Lower-triangle Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

B        8.876
C        6.120 2.231
D        9.321 3.983 0.663