E.coli strains (ecoli)

Dataset description

The data set pertains to assembled 29 E.coli/Shigella strains. The data set was originally compiled by Yin and Jin (2013) and has been used in the past by other groups to evaluate AF tools.

Dataset file structure

The dataset has 1 directory containing 29 FASTA files

assembled-ecoli
├── 536.fasta
├── APEC01.fasta
├── ATCC8739.fasta
├── B18BS512.fasta
├── B4Sb227.fasta
├── BW2952.fasta
├── CB9615.fasta
├── CFT073.fasta
├── D1Sd197.fasta
├── DH10B.fasta
├── E234869.fasta
├── E24377A.fasta
├── ED1a.fasta
├── EDL933.fasta
├── F2a2457T.fasta
├── F2a301.fasta
├── F5b8401.fasta
├── HS.fasta
├── IAI1.fasta
├── IAI39.fasta
├── MG1655.fasta
├── S88.fasta
├── Sakai.fasta
├── SE11.fasta
├── SMS35.fasta
├── SSSs046.fasta
├── UMN026.fasta
├── UTI89.fasta
└── W3110.fasta

Benchmark Protocol

The test evaluates an accuracy of alignment-free distance measures in reconstructing species phylogeny based on whole genome sequences.

Specifically, the benchmark procedure takes as input user's file with, either all-versus all distances or phylogenetic tree in Newick format. The distances are used as input into the neighbour-joining algorithm (fneighbor from EMBOSS:6.3.1 PHYLIPNEW:3.69) to generate the corresponding method tree. To assess the accuracy of method tree we computed the Robinson-Foulds distance between a tree computed using that method (the “test tree”) and the corresponding species tree, using ftreedist (EMBOSS:6.3.1 PHYLIPNEW:3.69).

Testing your own method

Download from the server the dataset file: assembled-ecoli.zip. Unzip the downloaded file.
Use the unzipped FASTA file as an input to your method and calculate either, the distances between every pair of sequences or phylogenetic tree (see File Formats below).
Save the results to a text file.
Submit you predictions to the web server.
The web service benchmarks the uploaded results and presents a report with the submitted method's performance and comparison to other available methods. Additionally, you can choose to make the report publicly available.

Download dataset

File name: assembled-ecoli.zip

File size: 42.1 MB

MD5sum: de88729e76a47c1de7f06a6c59298cb8

File formats supported in benchmark

Benchmark supports one of the following file formats:

Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.

Example of Text File Format (4 sequences)

A   B   8.876
A   C   6.120
A   D   4.321
B   C   5.231
B   D   3.983
C   D   0.663

Square Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

   4
A        0.000 8.876 6.120 9.321
B        8.876 0.000 2.231 3.983
C        6.120 2.231 0.000 0.663
D        9.321 3.983 0.663 0.000

Lower-triangle Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

   4
A
B        8.876
C        6.120 2.231
D        9.321 3.983 0.663

Tree in Newick format

Example of Newick Format (4 sequences)

(B,(C,D),A);

Branch lengths can be incorporated, but are not required.

(B:2.13125,(C:0.90675,D:1.56975):0.64425,A:6.74475);