The dataset encompasses 1066 protein sequences belonging to 533 protein families. The sequences were collected from SCOPe database. The SCOPe database consists of Protein Data Bank (PDB) entries and provides a hierarchical classiffication of proteins at four structural levels.
# | Structural Level | Short description |
---|---|---|
1 | family (fa) | Clear evolutionarily relationship |
2 | superfamily (sf) | Most probable common evolutionary oigin |
3 | fold (cf) | Major structural similarity |
4 | class (cl) | Overall structural similarity |
The reference dataset was constructed based on the section of SCOPe database, ASTRAL SCOPe 2.07. low identity were retrieved from ASTRAL40 file, which includes all SCOPe sequences that share less than 40% identity to each other.
The ASTRAL dataset was subsequently trimmed to:
The dataset has 1 directory containing 1066 FASTA files
protein-low-ident/ ├── d1a0aa_.fasta ├── d1a0ia1.fasta ├── d1a0pa2.fasta ├── d1a1va1.fasta ├── d1a7ja_.fasta ├── d1a7ta_.fasta ├── d1a9xb2.fasta ├── d1ae9a_.fasta ├── d1akoa_.fasta ├── d1alua_.fasta ├── ...
The test evaluates a capacity of alignment-free distance measures in recognition of SCOPe relationships (i.e., family, superfamily, fold, class).
Specifically, the benchmark procedure takes as input user's file containing the distances between all sequence pairs present in the dataset file. The distances between all proteins pairs are subsequently sorted, from maximum to minimum similarity (i.e. from the closest to the farthest pair).
The comparative test procedure is based on a binary classification of each protein pair, where 1 corresponds to the two proteins sharing the same group in SCOP database, 0 otherwise. A perfect metric would completely separate negative from positive relationships, i.e. the maximum similarity would correspond always to the same group and the binary classification obtained after this distance sorting would be the vector (1, ... ,1,1,0,0, ... ,0). Since the group can be defined at any one of the four different levels of the database (family, superfamily, fold, class), each protein pair is associated to four binary classifications (one for each level). At each SCOPe level, ROC curves and AUC values are obtained to give a unique number of the relative accuracy of each metric and level, according to the SCOPe classification scheme. The overall assesment of method accuracy is an average of AUC values across 4 SCOPe levels.
File name: protein-low-ident.zip
File size: 371.8 KB
MD5sum: 77d7cc30d0e0651e4a25cdd9770f1c84
Benchmark supports one of the following file formats:
Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.
Example of Text File Format (4 sequences)
A B 8.876 A C 6.120 A D 4.321 B C 5.231 B D 3.983 C D 0.663
Square Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A 0.000 8.876 6.120 9.321 B 8.876 0.000 2.231 3.983 C 6.120 2.231 0.000 0.663 D 9.321 3.983 0.663 0.000
Lower-triangle Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A B 8.876 C 6.120 2.231 D 9.321 3.983 0.663