The dataset pertains to full genome sequences of 8 Yersinia strains, taken from Bernard et al. (2016).
The dataset has 1 directory containing 8 FASTA files.
unsimulated-yersinia ├── AAKT020000.fasta ├── BX936398.fasta ├── NC_003143.fasta ├── NC_004088.fasta ├── NC_005810.fasta ├── NC_008149.fasta ├── NC_008150.fasta └── NC_009381.fasta
The test evaluates an accuracy of alignment-free distance measures in reconstructing species phylogeny based on whole genome sequences.
Specifically, the benchmark procedure takes as input user's file with, either all-versus all distances or phylogenetic tree in Newick format. The distances are used as input into the neighbour-joining algorithm (fneighbor from EMBOSS:6.3.1 PHYLIPNEW:3.69) to generate the corresponding method tree. To assess the accuracy of method tree we computed the Robinson-Foulds distance between a tree computed using that method (the “test tree”) and the corresponding species tree, using ftreedist (EMBOSS:6.3.1 PHYLIPNEW:3.69).
File name: unsimulated-yersinia.zip
File size: 10.9 MB
MD5sum: 694d8bc9fe7dba6de7fe6f1ff5ee9a95
Benchmark supports one of the following file formats:
Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.
Example of Text File Format (4 sequences)
A B 8.876 A C 6.120 A D 4.321 B C 5.231 B D 3.983 C D 0.663
Square Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A 0.000 8.876 6.120 9.321 B 8.876 0.000 2.231 3.983 C 6.120 2.231 0.000 0.663 D 9.321 3.983 0.663 0.000
Lower-triangle Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A B 8.876 C 6.120 2.231 D 9.321 3.983 0.663
Tree in Newick format
Example of Newick Format (4 sequences)
(B,(C,D),A);
Branch lengths can be incorporated, but are not required.
(B:2.13125,(C:0.90675,D:1.56975):0.64425,A:6.74475);