Simulated genomes (sim_hgt)

Dataset description

The dataset pertains to 33 full genome sequences simulated with Evolsimulator, analogously as in Bernard et al (2016). This set of genomes was prepared separately at different extents of HGT as determined by the mean number of HGT events per iteration.

Specifically, EvolSimulator was used to simulate horizontal gene transfer in microbial genomes. Each set of genomes was simulated under a birth-and-death model at speciation rate = extinction rate = 0.5. The number of genomes in each set was allowed to vary from 25 to 35, with each containing 2000–3000 genes of length 240–1500 nucleotides. HGT receptivity was at set at minimum 0.2, mean 0.5 and maximum 0.8, mutation rate m = 0.4–0.6 and number of generations i = 5000. The varying extent of HGT is simulated using the mean number of HGT events attempted per iteration l = 0, 250, 500, 750 and 1000, and divergence factor d = 2000 (transferred genes that are of high sequence divergence, i.e. >2000 iterations apart, will not be successful). All other parameters in this simulation follow Beiko et al. 2007.

Dataset file structure

The dataset has 5 directories, each contaiing 33 FASTA files.

simulated-sim_hgt
├── hgt_0
│   ├── Species1.fasta
│   ├── Species2.fasta
│   ├── Species3.fasta
│   ├── Species4.fasta
│   ├── Species5.fasta
│   ├── Species6.fasta
│   ├── Species7.fasta
│   ├── Species8.fasta
│   ├── Species9.fasta
│   ├── Species10.fasta
│   ├── Species11.fasta
│   ├── Species12.fasta
│   ├── Species13.fasta
│   ├── Species14.fasta
│   ├── Species15.fasta
│   ├── Species16.fasta
│   ├── Species17.fasta
│   ├── Species18.fasta
│   ├── Species19.fasta
│   ├── Species20.fasta
│   ├── Species21.fasta
│   ├── Species22.fasta
│   ├── Species23.fasta
│   ├── Species24.fasta
│   ├── Species25.fasta
│   ├── Species26.fasta
│   ├── Species27.fasta
│   ├── Species28.fasta
│   ├── Species29.fasta
│   ├── Species30.fasta
│   ├── Species31.fasta
│   ├── Species32.fasta
│   └── Species33.fasta
├── hgt_250
│   ├── Species1.fasta
│   ├── Species2.fasta
│   └── ..
├── hgt_500
│   ├── Species1.fasta
│   ├── Species2.fasta
│   └── ..
├── hgt_750
│   ├── Species1.fasta
│   ├── Species2.fasta
│   └── ..
└── hgt_1000
    ├── Species1.fasta
    ├── Species2.fasta
    └──..

Benchmark Protocol

The test evaluates an accuracy of alignment-free distance measures in reconstructing species phylogeny based on unassembled genome sequences and varying set of sequencing coverages.

Specifically, the benchmark procedure takes as input 5 user's files with, either all-versus all distances or phylogenetic tree in Newick format. For each file, the distances are used as input into the neighbour-joining algorithm (fneighbor from EMBOSS:6.3.1 PHYLIPNEW:3.69) to generate the corresponding method tree. To assess the accuracy of method tree we computed the Robinson-Foulds distance between a tree computed using that method (the “test tree”) and the corresponding species tree, using ftreedist (EMBOSS:6.3.1 PHYLIPNEW:3.69). The overall assesment of method accuracy is an average of RF distance values across 5 trees.

Testing your own method

Download from the server the dataset file: simulated-sim_hgt.zip. Unzip the downloaded file.
Use the unzipped 5 FASTA files as an input to your method and calculate either, the distances between every pair of sequences or phylogenetic tree (see File Formats below).
Save the results to 5 text files.
Submit you predictions (upload 5 output files) to the web server.
The web service benchmarks the uploaded results and presents a report with the submitted method's performance and comparison to other available methods. Additionally, you can choose to make the report publicly available.

Download dataset

File name: simulated-sim_hgt.zip

File size: 109.3 MB

MD5sum: 967b15c79b44524a8f3c389e66344279

File formats supported in benchmark

Benchmark supports one of the following file formats:

Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.

Example of Text File Format (4 sequences)

A   B   8.876
A   C   6.120
A   D   4.321
B   C   5.231
B   D   3.983
C   D   0.663

Square Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

   4
A        0.000 8.876 6.120 9.321
B        8.876 0.000 2.231 3.983
C        6.120 2.231 0.000 0.663
D        9.321 3.983 0.663 0.000

Lower-triangle Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

   4
A
B        8.876
C        6.120 2.231
D        9.321 3.983 0.663

Tree in Newick format

Example of Newick Format (4 sequences)

(B,(C,D),A);

Branch lengths can be incorporated, but are not required.

(B:2.13125,(C:0.90675,D:1.56975):0.64425,A:6.74475);