About

About service

Background

Service Goal

The AFproject service aims at simplifying and standardizing alignment-free benchmarking. And for the users, the benchmarks provide a way to identify the most effective methods for the problem at hand.

Service Features

Characterize performance of all well-established AF programs under different ecolutionary scenarios.
Create a catalogue of most effective methods for the problem at hand.
Support developers during method implementation process by allowing testing of their tools at different stages of progress and offering opportunity to disseminate the results publicly.
Provide platform for definition of novel dataset depending of technological development: users and developers can request changes or new datasets.

Service Content

The server benchmarks AF tools against 12 reference datasets, which can be classified into 5 application categories.

# Research application Reference data set Sequence type Read more
1 Regulatory Sequences Cis-regulatory modules (CRM) non-coding DNA
2 Protein Sequence Classification Low sequence identity (<40%) protein
High sequence indentiy (≥40%) protein
3 Gene Tree Inference SwissTree protein
4 Genome-based Phylogeny 29 E.coli/Shigella strains unassembled reads
29 E.coli/Shigella strains full genomes
25 fish mitochondrial genomes full genomes
14 plant species unassembled reads
14 plant species full genomes
5 Horizontal Gene Transfer 27 E.coil/Shigella strains full genomes
7 Yersinia species full genomes
33 simulated genomes full artifical genomes

How does it work?

AF method developer downloads from the server the FASTA dataset from one reasearch category.
Developer uses the downloaded dataset as an input to his/her alignment-free program. The output file should contain all-versus-all pairwise sequence distances, either in TSV or Phylip formats
Developer uploads the output file to the server.
The server benchmarks the uploaded predictions and presents a report with the submitted method's performance and comparison to other available methods. The developer can choose to make the report publicly available.

Formats

TSV (Tab-separated value) format

Simple text file with three tab-separated columns. First two columns store identifiers of two sequences being compared. Third column has a numerical distance value of this comparison. TSV can have more than 3 columns (the extra columns will be omitted).

Example of Text File Format (4 sequences)

A   B   8.876
A   C   6.120
A   D   4.321
B   C   5.231
B   D   3.983
C   D   0.663
Phylip format (symmetric distance matrix)
   4
A        0.000 8.876 6.120 9.321
B        8.876 0.000 2.231 3.983
C        6.120 2.231 0.000 0.663
D        9.321 3.983 0.663 0.000
Phylip format (lower-triangle distance matrix)
   4
A
B        8.876
C        6.120 2.231
D        9.321 3.983 0.663
Newick format
(B,(C,D),A);

Branch lengths can be incorporated, but are not required.

(B:2.13125,(C:0.90675,D:1.56975):0.64425,A:6.74475);