The data set pertains to 27 full genome sequences of E.coil/Shigella strains, taken from Bernard et al. (2016).
The dataset has 1 directory containing 27 FASTA files.
unsimulated-ecoli_shigella ├── NC_000913.fasta ├── NC_002655.fasta ├── NC_002695.fasta ├── NC_004337.fasta ├── NC_004431.fasta ├── NC_004741.fasta ├── NC_007384.fasta ├── NC_007606.fasta ├── NC_007613.fasta ├── NC_007779.fasta ├── NC_007946.fasta ├── NC_008253.fasta ├── NC_008258.fasta ├── NC_008563.fasta ├── NC_009800.fasta ├── NC_009801.fasta ├── NC_010468.fasta ├── NC_010498.fasta ├── NC_010658.fasta ├── NC_011415.fasta ├── NC_011601.fasta ├── NC_011741.fasta ├── NC_011742.fasta ├── NC_011745.fasta ├── NC_011748.fasta ├── NC_011750.fasta └── NC_011751.fasta
The test evaluates an accuracy of alignment-free distance measures in reconstructing species phylogeny based on whole genome sequences.
Specifically, the benchmark procedure takes as input user's file with, either all-versus all distances or phylogenetic tree in Newick format. The distances are used as input into the neighbour-joining algorithm (fneighbor from EMBOSS:6.3.1 PHYLIPNEW:3.69) to generate the corresponding method tree. To assess the accuracy of method tree we computed the Robinson-Foulds distance between a tree computed using that method (the “test tree”) and the corresponding species tree, using ftreedist (EMBOSS:6.3.1 PHYLIPNEW:3.69).
File name: unsimulated-ecoli_shigella.zip
File size: 39.2 MB
MD5sum: e4282d59f4dae2fd6e4914cb747e5566
Benchmark supports one of the following file formats:
Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.
Example of Text File Format (4 sequences)
A B 8.876 A C 6.120 A D 4.321 B C 5.231 B D 3.983 C D 0.663
Square Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A 0.000 8.876 6.120 9.321 B 8.876 0.000 2.231 3.983 C 6.120 2.231 0.000 0.663 D 9.321 3.983 0.663 0.000
Lower-triangle Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A B 8.876 C 6.120 2.231 D 9.321 3.983 0.663
Tree in Newick format
Example of Newick Format (4 sequences)
(B,(C,D),A);
Branch lengths can be incorporated, but are not required.
(B:2.13125,(C:0.90675,D:1.56975):0.64425,A:6.74475);