The dataset pertains to unassembled reads from 14 plant species. This data set was originally compiled by Hatje and Kollmar (2012) This set of genomes was prepared separately for 7 different sequencing coverages: 0.015625, 0.03125, 0.0625, 0.125, 0.25, 0.5, 1.0. ART software was used to simulate unassembled reads from complete genome sequences.
The dataset has 7 directories, each directory has 14 files, 98 files in total.
unassembled-plants ├── coverage_0.015625 │ ├── cacao.fasta │ ├── camaldule.fasta │ ├── clementin.fasta │ ├── grandis.fasta │ ├── halophilu.fasta │ ├── lyrata.fasta │ ├── papaya.fasta │ ├── parvulum.fasta │ ├── raimondii.fasta │ ├── rapa.fasta │ ├── rubella.fasta │ ├── sinensis.fasta │ ├── thalian.fasta │ └── vinifera.fasta ├── coverage_0.03125 │ ├── cacao.fasta │ ├── camaldule.fasta │ └── ... ├── coverage_0.0625 │ ├── cacao.fasta │ ├── camaldule.fasta │ └── ... ├── coverage_0.125 │ ├── cacao.fasta │ ├── camaldule.fasta │ └── ... ├── coverage_0.25 │ ├── cacao.fasta │ ├── camaldule.fasta │ └── ... ├── coverage_0.5 │ ├── cacao.fasta │ ├── camaldule.fasta │ └── ... └── coverage_1 ├── cacao.fasta ├── camaldule.fasta └── ...
The test evaluates an accuracy of alignment-free distance measures in reconstructing species phylogeny based on unassembled genome sequences and varying set of sequencing coverages.
Specifically, the benchmark procedure takes as input 7 user's files with, either all-versus all distances or phylogenetic tree in Newick format. For each file, the distances are used as input into the neighbour-joining algorithm (fneighbor from EMBOSS:6.3.1 PHYLIPNEW:3.69) to generate the corresponding method tree. To assess the accuracy of method tree we computed the Robinson-Foulds distance between a tree computed using that method (the “test tree”) and the corresponding species tree, using ftreedist (EMBOSS:6.3.1 PHYLIPNEW:3.69). The overall assesment of method accuracy is an average of RF distance values across 7 trees.
File name: unassembled-plants.zip
File size: 2.7 GB
MD5sum: 90ae0609e1c124e15461e49197c8d4c6
Benchmark supports one of the following file formats:
Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.
Example of Text File Format (4 sequences)
A B 8.876 A C 6.120 A D 4.321 B C 5.231 B D 3.983 C D 0.663
Square Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A 0.000 8.876 6.120 9.321 B 8.876 0.000 2.231 3.983 C 6.120 2.231 0.000 0.663 D 9.321 3.983 0.663 0.000
Lower-triangle Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A B 8.876 C 6.120 2.231 D 9.321 3.983 0.663
Tree in Newick format
Example of Newick Format (4 sequences)
(B,(C,D),A);
Branch lengths can be incorporated, but are not required.
(B:2.13125,(C:0.90675,D:1.56975):0.64425,A:6.74475);