The dataset pertains to unassembled reads from 29 E.coli/Shigella strains. The data set was originally compiled by Yin and Jin (2013). This set of genomes was prepared separately for 7 different sequencing coverages: 0.03125, 0.0625, 0.125, 0.25, 0.5, 1.0, 5.0. ART software was used to simulate unassembled reads from complete genome sequences.
The dataset has 7 directories, each directory has 29 files, 203 files in total.
unassembled-ecoli ├── coverage_0.03125 │ ├── 536.fasta │ ├── APEC01.fasta │ ├── ATCC8739.fasta │ ├── B18BS512.fasta │ ├── B4Sb227.fasta │ ├── BW2952.fasta │ ├── CB9615.fasta │ ├── CFT073.fasta │ ├── D1Sd197.fasta │ ├── DH10B.fasta │ ├── E234869.fasta │ ├── E24377A.fasta │ ├── ED1a.fasta │ ├── EDL933.fasta │ ├── F2a2457T.fasta │ ├── F2a301.fasta │ ├── F5b8401.fasta │ ├── HS.fasta │ ├── IAI1.fasta │ ├── IAI39.fasta │ ├── MG1655.fasta │ ├── S88.fasta │ ├── Sakai.fasta │ ├── SE11.fasta │ ├── SMS35.fasta │ ├── SSSs046.fasta │ ├── UMN026.fasta │ ├── UTI89.fasta │ └── W3110.fasta ├── coverage_0.0625 │ ├── 536.fasta │ ├── APEC01.fasta │ └── ... ├── coverage_0.125 │ ├── 536.fasta │ ├── APEC01.fasta │ └── ... ├── coverage_0.25 │ ├── 536.fasta │ ├── APEC01.fasta │ └── ... ├── coverage_0.5 │ ├── 536.fasta │ ├── APEC01.fasta │ └── ... ├── coverage_1 │ ├── 536.fasta │ ├── APEC01.fasta │ └── ... └── coverage_5 ├── 536.fasta ├── APEC01.fasta └── ...
The test evaluates an accuracy of alignment-free distance measures in reconstructing species phylogeny based on unassembled genome sequences and varying set of sequencing coverages.
Specifically, the benchmark procedure takes as input 7 user's files with, either all-versus all distances or phylogenetic tree in Newick format. For each file, the distances are used as input into the neighbour-joining algorithm (fneighbor from EMBOSS:6.3.1 PHYLIPNEW:3.69) to generate the corresponding method tree. To assess the accuracy of method tree we computed the Robinson-Foulds distance between a tree computed using that method (the “test tree”) and the corresponding species tree, using ftreedist (EMBOSS:6.3.1 PHYLIPNEW:3.69). The overall assesment of method accuracy is an average of RF distance values across 7 trees.
File name: unassembled-ecoli.zip
File size: 308.6 MB
MD5sum: 22fa806b474670aecc97ca74b7e1c843
Benchmark supports one of the following file formats:
Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.
Example of Text File Format (4 sequences)
A B 8.876 A C 6.120 A D 4.321 B C 5.231 B D 3.983 C D 0.663
Square Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A 0.000 8.876 6.120 9.321 B 8.876 0.000 2.231 3.983 C 6.120 2.231 0.000 0.663 D 9.321 3.983 0.663 0.000
Lower-triangle Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A B 8.876 C 6.120 2.231 D 9.321 3.983 0.663
Tree in Newick format
Example of Newick Format (4 sequences)
(B,(C,D),A);
Branch lengths can be incorporated, but are not required.
(B:2.13125,(C:0.90675,D:1.56975):0.64425,A:6.74475);