high identity (high-ident)

Dataset description

The dataset encompasses 2128 protein sequences belonging to 1064 protein families. The sequences were collected from SCOPe database. The SCOPe database consists of Protein Data Bank (PDB) entries and provides a hierarchical classiffication of proteins at four structural levels.

# Structural Level Short description
1 family (fa) Clear evolutionarily relationship
2 superfamily (sf) Most probable common evolutionary oigin
3 fold (cf) Major structural similarity
4 class (cl) Overall structural similarity

Dataset preparation

  • exclude sequences with unknown amino acids
  • exclude families with less than 5 proteins
  • include only two members for each family (randomly chosen)
  • include only the four major classes
    • α class: constituted mainly by proteins with α helix
    • β class: essentially formed by β-sheet structures
    • α/β class: proteins with mixtures of α-helices and β-strands
    • α + β class: those where α-helices and β-strands are largely segregated

Dataset file structure

The dataset has 1 directory containing 2128 FASTA files

protein-high-ident/
├── d19hca_.fasta
├── d1a15b_.fasta
├── d1a17a_.fasta
├── d1a25a_.fasta
├── d1a3qa1.fasta
├── d1a44a_.fasta
├── d1a68a_.fasta
├── d1a7ge_.fasta
├── d1a7sa_.fasta
├── d1a8oa_.fasta
├── ...

Benchmark Protocol

The test evaluates a capacity of alignment-free distance measures in recognition of SCOPe relationships (i.e., family, superfamily, fold, class).

Specifically, the benchmark procedure takes as input user's file containing the distances between all sequence pairs present in the dataset file. The distances between all proteins pairs are subsequently sorted, from maximum to minimum similarity (i.e. from the closest to the farthest pair).

The comparative test procedure is based on a binary classification of each protein pair, where 1 corresponds to the two proteins sharing the same group in SCOP database, 0 otherwise. A perfect metric would completely separate negative from positive relationships, i.e. the maximum similarity would correspond always to the same group and the binary classification obtained after this distance sorting would be the vector (1, ... ,1,1,0,0, ... ,0). Since the group can be defined at any one of the four different levels of the database (family, superfamily, fold, class), each protein pair is associated to four binary classifications (one for each level). At each SCOPe level, ROC curves and AUC values are obtained to give a unique number of the relative accuracy of each metric and level, according to the SCOPe classification scheme. The overall assesment of method accuracy is an average of AUC values across 4 SCOPe levels.

Testing your own method

Download from the server the dataset file: protein-high-ident.zip. Unzip the downloaded file.
Use the unzipped FASTA file as an input to your method and calculate either, the distances between every pair of sequences or phylogenetic tree (see File Formats below).
Save the results to a text file.
Submit you predictions to the web server.
The web service benchmarks the uploaded results and presents a report with the submitted method's performance and comparison to other available methods. Additionally, you can choose to make the report publicly available.

Download dataset

File name: protein-high-ident.zip

File size: 750.4 KB

MD5sum: 98447d367856559bde47e8aea9aae5bc

File formats supported in benchmark

Benchmark supports one of the following file formats:

Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.

Example of Text File Format (4 sequences)

A   B   8.876
A   C   6.120
A   D   4.321
B   C   5.231
B   D   3.983
C   D   0.663

Square Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

   4
A        0.000 8.876 6.120 9.321
B        8.876 0.000 2.231 3.983
C        6.120 2.231 0.000 0.663
D        9.321 3.983 0.663 0.000

Lower-triangle Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

   4
A
B        8.876
C        6.120 2.231
D        9.321 3.983 0.663