E.coli strains (ecoli)

Dataset description

The dataset pertains to unassembled reads from 29 E.coli/Shigella strains. The data set was originally compiled by Yin and Jin (2013). This set of genomes was prepared separately for 7 different sequencing coverages: 0.03125, 0.0625, 0.125, 0.25, 0.5, 1.0, 5.0. ART software was used to simulate unassembled reads from complete genome sequences.

Dataset file structure

The dataset has 7 directories, each directory has 29 files, 203 files in total.

unassembled-ecoli
├── coverage_0.03125
│   ├── 536.fasta
│   ├── APEC01.fasta
│   ├── ATCC8739.fasta
│   ├── B18BS512.fasta
│   ├── B4Sb227.fasta
│   ├── BW2952.fasta
│   ├── CB9615.fasta
│   ├── CFT073.fasta
│   ├── D1Sd197.fasta
│   ├── DH10B.fasta
│   ├── E234869.fasta
│   ├── E24377A.fasta
│   ├── ED1a.fasta
│   ├── EDL933.fasta
│   ├── F2a2457T.fasta
│   ├── F2a301.fasta
│   ├── F5b8401.fasta
│   ├── HS.fasta
│   ├── IAI1.fasta
│   ├── IAI39.fasta
│   ├── MG1655.fasta
│   ├── S88.fasta
│   ├── Sakai.fasta
│   ├── SE11.fasta
│   ├── SMS35.fasta
│   ├── SSSs046.fasta
│   ├── UMN026.fasta
│   ├── UTI89.fasta
│   └── W3110.fasta
├── coverage_0.0625
│   ├── 536.fasta
│   ├── APEC01.fasta
│   └── ...
├── coverage_0.125
│   ├── 536.fasta
│   ├── APEC01.fasta
│   └── ...
├── coverage_0.25
│   ├── 536.fasta
│   ├── APEC01.fasta
│   └── ...
├── coverage_0.5
│   ├── 536.fasta
│   ├── APEC01.fasta
│   └── ...
├── coverage_1
│   ├── 536.fasta
│   ├── APEC01.fasta
│   └── ...
└── coverage_5
    ├── 536.fasta
    ├── APEC01.fasta
    └── ...

Benchmark Protocol

The test evaluates an accuracy of alignment-free distance measures in reconstructing species phylogeny based on unassembled genome sequences and varying set of sequencing coverages.

Specifically, the benchmark procedure takes as input 7 user's files with, either all-versus all distances or phylogenetic tree in Newick format. For each file, the distances are used as input into the neighbour-joining algorithm (fneighbor from EMBOSS:6.3.1 PHYLIPNEW:3.69) to generate the corresponding method tree. To assess the accuracy of method tree we computed the Robinson-Foulds distance between a tree computed using that method (the “test tree”) and the corresponding species tree, using ftreedist (EMBOSS:6.3.1 PHYLIPNEW:3.69). The overall assesment of method accuracy is an average of RF distance values across 7 trees.

Testing your own method

Download from the server the dataset file: unassembled-ecoli.zip. Unzip the downloaded file.
Use the unzipped 7 FASTA files as an input to your method and calculate either, the distances between every pair of sequences or phylogenetic tree (see File Formats below).
Save the results to 7 text files.
Submit you predictions (upload 7 output files) to the web server.
The web service benchmarks the uploaded results and presents a report with the submitted method's performance and comparison to other available methods. Additionally, you can choose to make the report publicly available.

Download dataset

File name: unassembled-ecoli.zip

File size: 308.6 MB

MD5sum: 22fa806b474670aecc97ca74b7e1c843

File formats supported in benchmark

Benchmark supports one of the following file formats:

Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.

Example of Text File Format (4 sequences)

A   B   8.876
A   C   6.120
A   D   4.321
B   C   5.231
B   D   3.983
C   D   0.663

Square Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

   4
A        0.000 8.876 6.120 9.321
B        8.876 0.000 2.231 3.983
C        6.120 2.231 0.000 0.663
D        9.321 3.983 0.663 0.000

Lower-triangle Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

   4
A
B        8.876
C        6.120 2.231
D        9.321 3.983 0.663

Tree in Newick format

Example of Newick Format (4 sequences)

(B,(C,D),A);

Branch lengths can be incorporated, but are not required.

(B:2.13125,(C:0.90675,D:1.56975):0.64425,A:6.74475);