Query organism
Single organism
All the query genes belong to the same organism.
When this option is checked, the query organism has to be specified among the
supported organisms, indicated by the pop-up menu Query organism.
Multiple organisms
Retrieves other species genes that are homologous to the query genes using the EnsEMBL Compara database.
All the query genes belong to the same organism.
When this option is checked, the query organism has to be specified among the
supported organisms, indicated by the pop-up menu Query organism.
Optional filters
Taxon
Allows to filter on a given taxonomic level (example: Mammalia).
Homology type
Allows to filter on a Compara homology type, as
ortholog_one2one, ortholog_one2many, etc. Use orthologs to get all types of
orthologs but not paralogs.
Query IDs
Two options are proposed
- IDs list typed directly in the text area. Several queries can be entered simultaneously, separated by carriage returns.
- Click on the Browse... button to select a text file on your
computer that contains the list of EnsEMBL IDs.
These IDs can be EnsEMBL gene (ex: ENSG00000139618), transcript (ENST00000380152)
or protein (ENSP00000369497) IDs, IDs from other databases (Uniprot,
RefSeq, Flybase, SGD...). EnsEMBL transcript and protein IDs are
automatically converted to gene IDs and then treated as a gene query.
Type of sequence
Choose the type of sequence to retrieve, either a feature sequence (ex: gene,
intron, utr...) or a sequence surrounding a feature (upstream/downstream). If
you choose 'upstream/downstream', you can specify some options in the right
column. Otherwise default values will be applied.
Options for upstream or downstream sequences
Feature type
Refer to the below sequence position section for more detail.
Sequence position
- Upstream sequences located upstream the selected relative feature. The
origin is at the most 3' TSS for mRNA, at the most 5' TSS for gene, and at the
start codon of each alternative transcript for CDS. Note that, with mRNA feature, if you wish to retrieve sequences relative to
each alternative transcript, you should check the box 'Retrieve sequence
relative to each alternative transcript (with mRNA feature)'
- Downstream sequences located downstream the selected relative
feature. The origin is at most 5' terminator for mRNA, at the most 3'
terminator for gene, and at the stop codon of each alternative transcript for CDS. Note that, with mRNA feature, if you wish to retrieve sequences relative to
each alternative transcript, you should check the box 'Retrieve sequence
relative to each alternative transcript (with mRNA feature)'
Sequence limits (from, to)
Limits of the region to retrieve. Coordinates are calculated relative to the
selected feature. Refer to the above sequence position section for more detail.
Sign
- negative
values return sequence located upstream the origin
- positive
values return
sequences downstream the
origin
The origin itself depends on the feature type, see above)
Default values for upstream sequence retrieval
The default is from -2000 to -1 from the most 3' TSS.
Prevent overlap with neighbouring genes or ORFs:
It is quite frequent to find a predicted gene in close proximity
upstream from a query gene. If you want to discard these sequences from
your analysis, you should make sure this option is active.
When the option is selected, upstream sequences are automatically
clipped when a predicted gene or ORF is located within the range defined by
the option from. The actual size retained for the upstream
sequence is indicated in the sequence header.
Note that in some cases a known regulatory element is located
upstream or within a predicted gene. This means either that the
predited gene is an artifact, or that the same sequence is bifunctional
(coding and regulatory).
Retrieve sequence relative to each alternative transcript:
If this option is checked together with the mRNA feature, the upstream or
downstream sequences relative to each alternative transcript (if any) are retrieved.
Mask repeats
This option allows to use the genome version where repeats are masked (i.e. replaced by 'N' characters).
The presence of repetitive elements hampers the detection of motifs, especially for vertebrate genomes, because these repetitive sequences have a very distinct composition than the rest of the genome.
This option is only valid for organisms with annotated repeats.
Mask coding sequences
This option allows to mask all coding portions of the retrieved sequence (i.e. replaced by 'N' characters).
Organism name in sequence header
The sequence format is fasta. To facilitate multiple organism studies, we
included the organism name at the beginning of the sequence header. Moreover, we allow to choose between the scientific name and the
common name of the organism.