Using PromoSer

Introduction
Researchers studying the mechanisms of transcription regulation are commonly interested in the proximal promoter regions of genes. Often one wants to obtain these regions for a large number of related genes. Methods for computational prediction of promoters and genes in the entire genome also rely heavily on predetermined data sets used to train their models. This requires collecting a large set of highly accurate promoter sequences.

PromoSer is a web-based service aimed specifically at the extraction of a large number of promoter sequences from mammalian genomes. To identify the transcription start site (TSS) of a gene, we map all available mRNA and EST sequence data onto the genome and track the overlapping alignments (denoted as a cluster) to determine the furthest possible extension to these sequences and hence determine the TSS. In many cases, our data set is enriched with full-length mRNA sequences produced by cap-trapping and oligo capping methods, providing higher confidence in our predictions.

PromoSer is easy to use. Just provide a list of GenBank accession ids to identify the genes of interest and enter the range required flanking the TSS. PromoSer will process the input and return the required regions as a multi-FASTA format text file.

Please use the following citation for PromoSer:

The current release of PromoSer is 3.0 and is based on the following genomic releases:

PromoSer has analyzed all relevant GenBank accessions up to Feb 4, 2004. For a summary of the contents and overall performance of PromoSer, see this page.

Output

PromoSer reports the number of accessions for which a promoter could be identified. This may not always be the same as the number of accessions requested. Details of why certain promoters could not be returned are given in the output report and include: a threshold too high excluding lower quality predictions, reference to an accession number that PromoSer did not analyze (probably because it is newer than the current release or because it is for an unsupported organism), or reference to one that PromoSer analyzed and excluded (e.g., synthetic probes) or could not assign confidently to any cluster.

PromoSer's output consists of a summary report displayed in the browser, and a FASTA file that is downloadable by clicking the displayed result ID (that ID can be used to retrieve the results again later from the main screen, within 24 hours). The summary report contains the software version, the genomic releases used to construct the promoters, the options chosen and a table with the following fields:

The FASTA file contains the actual promoters. Each promoter is annotated with a line that indicates the accession number, the sequential id, the coordinates of the sequence relative to the TSS (the TSS is at position +1 and there is no position 0), the coordinate of the TSS on the chromosome (always referring to the + strand), the organism, chromosome, start and end coordinates of the extracted promoter (the coordinates are always in terms of the + strand and start from 0) and the strand on which the gene lies. A * and/or ! flag may follow if applicable (as explained above).

>NM_000399|0|Promoter: -1000 to 50|TSS: 63920729|Region: Human chr10:63920680-63921730 (-)|

List of GenBank accession IDs
Enter a list of the accession IDs for which you would like to look for promoters. Separate the IDs by commas or enter each in a separate line. Please do not include the version number in the accession (i.e., use NM_091834 and not NM_091834.2). We insist on not including the version number with the accession ID to emphasize that PromoSer attempts to localize transcriptional units rather than specific sequences. If you however do use a version number it will just be ignored and the latest version of the sequence will always be used. Please limit your query to 2000 accessions per request.

Note: PromoSer currently supports the following organisms only: Homo sapiens (human), Mus musculus (mouse) and Rattus norvegicus (rat). mRNA sequences from other organisms are not considered by PromoSer (this is particularly relevant for Rattus Rattus sequences). Accession IDs for genomic gene records are recognized by PromoSer and can be searched for. Such records sometimes contain several genes and may end up being associated with several clusters, some of which might not be the gene given in the record's annotation.

Tip: If you are using for example the NCBI Entrez service to select a certain set of sequences you may copy the NCBI summary listing, e.g.

1: NM_025221 Links

Homo sapiens potassium channel-interacting protein 4 (KCNIP4), transcript variant 1, mRNA
gi|28373063|ref|NM_025221.4|[28373063]


2: NM_147183 Links

Homo sapiens potassium channel-interacting protein 4 (KCNIP4), transcript variant 4, mRNA
gi|28373062|ref|NM_147183.2|[28373062]


3: NM_147182 Links

Homo sapiens potassium channel-interacting protein 4 (KCNIP4), transcript variant 3, mRNA
gi|28373061|ref|NM_147182.2|[28373061]


4: NM_147181 Links

Homo sapiens potassium channel-interacting protein 4 (KCNIP4), transcript variant 2, mRNA
gi|28373060|ref|NM_147181.2|[28373060]
and paste it directly into PromoSer. As long as the accessions are clearly distinct and have no version, they will be identified and retrieved.

List of FASTA records
Note: Using sequence searches to locate promoters is not the preferred way to use PromoSer. If you use this feature please be patient as sequence searching can be slow (sometimes very slow).

PromoSer can accept a multi-FASTA format list of sequences as an input. Those sequences will be mapped onto the genome and matched to clusters that overlap their alignments by 95% of their length or more. This is a best effort guess and not as rigorous as the clustering process used by PromoSer for the fully processed accessions. In particular, if a sequence is much longer than any of the homologous sequences already processed by PromoSer then it will either be associated with the wrong cluster or no cluster at all.

The header line of FASTA sequences must be formatted in a special way:

  • The organism to which the sequence belongs must be the first part of the annotation. Common names (Human, Mouse, Rat), Latin names and their abbreviations hs - Homo sapiens, mm - Mus musculus, rn - Rattus norvegicus are all recognized as valid. If the organism for a sequence cannot be determined, or if the header is entirely missing the sequence will be assumed human.
  • The identifier must consist only of alphanumeric characters and the underscore. All other characters will be stripped off.
  • Only the first 20 characters of an identifier are considered (not counting the organism's name)
  • If multiple sequences are given, their identifiers must be distinct.

    For example, all of the following header lines are valid:

    >Homo sapiens early growth res 2
    acacac...
    >mm seq1
    gatata....
    >rat
    ctaaag....
    

    There is a limit of 25 sequences per request and 50,000 bases of total sequence. An individual sequence is limited to be 25,000 bases long.

    List of genomic loci
    As a high throughput service, PromoSer can be used to extract user specified regions directly from the genome. In this case, no promoter related processing is performed and PromoSer simply acts as a sequence server. There is a limit of 2000 requests and 2Mb of total sequence. The requests must be formatted in a special way. For example:

    hs chr5 4000 5000 - My sequence 1
    mm chrY 200 4000 +
    

    The list contains one request per line. Each request has 5 required columns and a sixth optional one, separated by spaces or tabs. The columns are: organism abbreviation (same as used in FASTA requests), chromosome (as chrXX, e.g. chr1, chrY, chr20 ...), start position for extraction (first position is 0), end position for extraction (end itself is not included), strand (for - strands the coordinates are in terms of the + strand but the results are reverse complemented) and an optional annotation column (only alphanumeric characters and the underscore are allowed). You can check the sizes of the chromosomes used in the current builds in this page.

    Repeat elements and tandem repeats can be masked in the returned sequence by choosing the appropriate option on the main page.

    Upstream region
    Enter the number of bases required upstream down to but not including the TSS. Allowed range is 1..10000 bases.

    Downstream region
    Enter the number of bases required starting from the TSS and downstream. Allowed range is 0..1000 bases.

    TSS quality and support
    A TSS is identified when the 5' end of a number of sequences map to the same genomic position (within a window of 20bp) with a high quality alignment score. The number of those sequences represents the support for the TSS. If the cluster does not contain any high scoring alignments, the 5' most position of the entire cluster is named a TSS and assigned quality 0 and support 0. The TSS quality score indicates the composition of the sequences that support this TSS, as follows:

    Note that the support score indicates the number of EST sequences for quality level 1 and indicates the number of non EST supporting sequences for higher levels.

    Repeat elements
    Repeats in the retrieved sequence identified by both RepeatMasker and Tandem Repeat Finder can be masked. You can choose to mask them with an N, mark their sequence using the lower case alphabet or ignore them completely and show all sequence using capital letters.

    Alternative promoters
    Predicted TSSs from non-EST sequences that are further than 20 bases apart are considered cases of alternative promoters. You can choose to return all of these promoters, the best supported, the nearest upstream to an accession, return only the 5' most or the 3' most or ignore all extension information. The promoter of the 5' most TSS is the furthest possible extension of the accession given. The promoter of the 3' most TSS is the shortest necessary extension of the given accession. These can be considered as the most aggressive and the most conservative extensions, respectively. Each TSS is identified from a group of sequences that share approximately the 5' most genomic position. We call the number of these sequences the support for the TSS. If you wish to ignore all extension info, PromoSer will consider the aligned position of the 5' end as the only TSS.
    The best supported TSS is the one with the highest support score. If several TSSs have the same highest score, only the 5'-most of them is returned. The nearest upstream TSS is the one that maps nearest but upstream to wherever the identified 5'-end of an accession aligns. This is a useful option if the promoter associated with a specific variant of some transcript is required.

    Excluding some results
    In some cases, a given sequence will map to more than one genomic locus with nearly equal high scores. PromoSer keeps all alignments within 1% of the score of the best alignment as possible alternatives. In both the summary table and the FASTA file, those alignments are flagged with an exclamation mark (!). If needed, all such results can be excluded from the output.

    In some other cases, a sequence can only be partially aligned to the genome, or not align well enough to be considered reliable. PromoSer will not use such alignments in either clustering or TSS prediction. It will however try to guess the correct cluster this alignment may belong to. Those guessed alignments are flagged with a * in both the summary table and the FASTA file. These can also be excluded from the results if desired.

    Note: Gene records that are genomic (given as a DNA sequence rather than an mRNA sequence) will always be included in the results, regardless of the above exclude settings. This is becuase genomic records associate with any cluster they overlap with on either strand of the chromosome.

    Finally, it may be the case that among the many accession IDs given as input, some map to the same cluster. Those will have identical promoters returned for each of them. If unique promoters are desired the option to exclude duplicate promoters should be selected. In this case, promoters returned will not be associated with any specific accession, but rather with the whole cluster.

    Overlap
    If the range specified for the upstream region happens to overlap with the 3' end of a gene upstream of the gene of interest then you may wish not to retrieve the sequence of the upstream gene as part of the promoter. Choose whether to ignore this case or to consider genes upstream only on the same strand or on both strands. (If the gene were on the opposite strand, the overlap would be with its 5' region.)

    Assembly gaps
    Becuase of the draft nature of the genome assemblies currently in use, certain spans of chromosomes have not yet been sequenced and are noted in the assembly by a run of N's who's length depends on the reason the gap exists. If such gaps are found in the range specified, you may wish to choose to stop the extracted sequence at the nearest upstream gap's boundary. This is potentially useful if you are specifically interested in the spatial distribution of signals in the sequence and the uncertainty in the length of the gap affects the analysis. Stopping at a gap's boundary takes precedence over upstream overlapping clusters.

    Retrieval of previous results
    If PromoSer successfully processes your request, it gives an identifier for the results. You can use this identifier to retrieve your results again within one day without rerunning the query. Results will not be retained indefinitely and will be removed periodically.