AF | Help | About

Background

Alignment-free sequence analysis (AF) tools have exploded into biological research. As these programs offer computational speed many hundreds of times faster than the comparable alignment-based approaches, they have been applied to problems such as NGS analysis, whole genome phylogeny, identification of recombined and horizontally transferred genes -- and many more. Because of the wide range of possible applications, benchmarking of alignment-free predictions remains a diffult challenge for methods developers and users.

Service Goal

The AFproject service aims at simplifying and standardizing alignment-free benchmarking. And for the users, the benchmarks provide a way to identify the most effective methods for the problem at hand.

Service Features

Characterize performance of all well-established AF programs under different ecolutionary scenarios.

Create a catalogue of most effective methods for the problem at hand.

Support developers during method implementation process by allowing testing of their tools at different stages of progress and offering opportunity to disseminate the results publicly.

Provide platform for definition of novel dataset depending of technological development: users and developers can request changes or new datasets.

Service Content

The server benchmarks AF tools against 12 reference datasets, which can be classified into 5 application categories.

#	Research application	Reference data set	Sequence type
1	Regulatory Sequences	Cis-regulatory modules (CRM)	non-coding DNA
2	Protein Sequence Classification	Low sequence identity (<40%)	protein
2	Protein Sequence Classification	High sequence indentiy (≥40%)	protein
3	Gene Tree Inference	SwissTree	protein
4	Genome-based Phylogeny	29 E.coli/Shigella strains	unassembled reads
		29 E.coli/Shigella strains	full genomes
		25 fish mitochondrial genomes	full genomes
		14 plant species	unassembled reads
		14 plant species	full genomes
5	Horizontal Gene Transfer	27 E.coil/Shigella strains	full genomes
		7 Yersinia species	full genomes
		33 simulated genomes	full artifical genomes

How does it work?

AF method developer downloads from the server the FASTA dataset from one reasearch category.

Developer uses the downloaded dataset as an input to his/her alignment-free program. The output file should contain all-versus-all pairwise sequence distances, either in TSV or Phylip formats

Developer uploads the output file to the server.

The server benchmarks the uploaded predictions and presents a report with the submitted method's performance and comparison to other available methods. The developer can choose to make the report publicly available.

Formats

Benchmarks of all 12 datasets accept pairwise sequence distances in TSV or Phylip format.

TSV (Tab-separated value) format

Simple text file with three tab-separated columns. First two columns store identifiers of two sequences being compared. Third column has a numerical distance value of this comparison. TSV can have more than 3 columns (the extra columns will be omitted).

Example of Text File Format (4 sequences)

A   B   8.876
A   C   6.120
A   D   4.321
B   C   5.231
B   D   3.983
C   D   0.663

Phylip format (symmetric distance matrix)

   4
A        0.000 8.876 6.120 9.321
B        8.876 0.000 2.231 3.983
C        6.120 2.231 0.000 0.663
D        9.321 3.983 0.663 0.000

Phylip format (lower-triangle distance matrix)

   4
A
B        8.876
C        6.120 2.231
D        9.321 3.983 0.663

Newick format

Information! Benchmarks procedures of two categories (Whole Genome Phylogeny and Horizontal Gene Transfer) support Newick format as the analysis is based on comparing tree topologies. Therefore, you can optionally provide data in Newick format instead of tsv or phylip.

(B,(C,D),A);

Branch lengths can be incorporated, but are not required.

(B:2.13125,(C:0.90675,D:1.56975):0.64425,A:6.74475);

About

About service