The dataset pertains to cis-regulatory sequence modules (CRMs) that are known to regulate expression in the same tissue and/or development stage in fly or human. A CRM can be loosely defined as a contiguous non-coding sequence that contains multiple transcription factor binding sites and drives some aspect of a gene's expression profile.
The dataset was collected by Kantorovitz et al. (2007) in order to test the capacity of alignment-free measures in identification of functional relationships between regulatory sequences (e.g. enhancers or promoters).
# | Subset | n CRM seqs (positive) | n random seqs (negative) |
---|---|---|---|
1 | fly_blastoderm | 82 | 82 |
2 | fly_eye | 17 | 17 |
3 | fly_pns | 23 | 23 |
4 | fly_tracheal_system | 9 | 9 |
5 | human_HBB | 17 | 17 |
6 | human_liver | 9 | 9 |
7 | human_muscle | 28 | 28 |
The dataset has 1 directory containing 370 FASTA files
crm/ ├── FB.001.1.fasta ├── FB.002.1.fasta ├── FB.003.1.fasta ├── FB.004.1.fasta ├── FB.005.1.fasta ├── FB.006.1.fasta ├── FB.007.1.fasta ├── FB.008.1.fasta ├── FB.009.1.fasta ├── FB.010.1.fasta ├── ...
The test evaluates if functionally-related CRM sequence pairs (from positive half) are better scored by a given alignment-free tool (i.e., have lower distance values) than unrelated pairs of sequences (from negative half).
Specifically, the benchmark procedure takes as input user's file containing the distances between all sequence pairs present in the dataset file. The procedure starts with the extraction, from the user's file, of sequence pairs within "positive" and "negative" halves within each of the 7 subsets. For any subset, the top half (or 300, whichever is smaller) of the pairs are examined. The number of "positive" pairs in this top half is reported. The overall assesment of method accuracy is weighted average of positive pairs across all 7 subsets.
Benchmark supports one of the following file formats:
Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.
Example of Text File Format (4 sequences)
A B 8.876 A C 6.120 A D 4.321 B C 5.231 B D 3.983 C D 0.663
Square Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A 0.000 8.876 6.120 9.321 B 8.876 0.000 2.231 3.983 C 6.120 2.231 0.000 0.663 D 9.321 3.983 0.663 0.000
Lower-triangle Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A B 8.876 C 6.120 2.231 D 9.321 3.983 0.663