AF | Dataset: genreg-crm

Dataset description

The dataset pertains to cis-regulatory sequence modules (CRMs) that are known to regulate expression in the same tissue and/or development stage in fly or human. A CRM can be loosely defined as a contiguous non-coding sequence that contains multiple transcription factor binding sites and drives some aspect of a gene's expression profile.

The dataset was collected by Kantorovitz et al. (2007) in order to test the capacity of alignment-free measures in identification of functional relationships between regulatory sequences (e.g. enhancers or promoters).

Dataset Structure

The dataset FASTA file is a mixture of 7 subsets of CRM sequences, each taken from different tissue of D. melanogaster or Homo sapiens. Each of the 7 subsets has 2n sequences, where the first n sequences are CRMs ("positive half") and the next n sequences are random non-coding sequences with matching lengths, chosen from the respective genome ("negative half").

#	Subset	n CRM seqs (positive)	n random seqs (negative)
1	fly_blastoderm	82	82
2	fly_eye	17	17
3	fly_pns	23	23
4	fly_tracheal_system	9	9
5	human_HBB	17	17
6	human_liver	9	9
7	human_muscle	28	28

Dataset file structure

The dataset has 1 directory containing 370 FASTA files

crm/
├── FB.001.1.fasta
├── FB.002.1.fasta
├── FB.003.1.fasta
├── FB.004.1.fasta
├── FB.005.1.fasta
├── FB.006.1.fasta
├── FB.007.1.fasta
├── FB.008.1.fasta
├── FB.009.1.fasta
├── FB.010.1.fasta
├── ...

Benchmark Protocol

The test evaluates if functionally-related CRM sequence pairs (from positive half) are better scored by a given alignment-free tool (i.e., have lower distance values) than unrelated pairs of sequences (from negative half).

Specifically, the benchmark procedure takes as input user's file containing the distances between all sequence pairs present in the dataset file. The procedure starts with the extraction, from the user's file, of sequence pairs within "positive" and "negative" halves within each of the 7 subsets. For any subset, the top half (or 300, whichever is smaller) of the pairs are examined. The number of "positive" pairs in this top half is reported. The overall assesment of method accuracy is weighted average of positive pairs across all 7 subsets.

Testing your own method

Download from the server the dataset file: crm.zip. Unzip the downloaded file.

Use the unzipped FASTA file as an input to your method and calculate either, the distances between every pair of sequences or phylogenetic tree (see File Formats below).

Save the results to a text file.

Submit you predictions to the web server.

The web service benchmarks the uploaded results and presents a report with the submitted method's performance and comparison to other available methods. Additionally, you can choose to make the report publicly available.

Download dataset

File name: crm.zip

File size: 176.9 KB

MD5sum: d576c477b92343fed45fbc6bce2b5cac

File formats supported in benchmark

Benchmark supports one of the following file formats:

Tap-Separated Value Format (TSV)

Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.

Example of Text File Format (4 sequences)

A   B   8.876
A   C   6.120
A   D   4.321
B   C   5.231
B   D   3.983
C   D   0.663

Phylip Distance Matrix

Square Distance matix in Phylip format

Example of Phylip distance matrix (for 4 sequences)

   4
A        0.000 8.876 6.120 9.321
B        8.876 0.000 2.231 3.983
C        6.120 2.231 0.000 0.663
D        9.321 3.983 0.663 0.000

Lower-triangle Distance matix in Phylip format