RSA-tools - Tutorials - retrieve-ensembl-seq
In the left frame, under the title Sequence retrieval, click on the tool "retrieve EnsEMBL sequence". A form appears, allowing you to select the parameters of sequence retrieval.Retrieving upstream sequences for a selected set of genes
We will first retrieve upstream sequences for a set of 4 human genes involved in breast cancer.
- In the pop-up menu "Query organism", select Homo sapiens
- In the text area under "Gene, transcript or protein IDs", type the following list of IDs
ENSG00000139618 ENSG00000100916 ENSG00000110934 ENSG00000131507- In the "From" box, under "Options for upstream or downstream sequence", replace -2000 by -5000
- Leave all other options unchanged and click on the button "GO"
The result page will appear with a message stating that your request is being processed, and showing you the Job ID and time of submission. After a few seconds, the result itself is displayed. Since the fourth gene has two alternative transcripts, you get five sequences.
Note that by default, the output mode is set to display. But if you do not need to view the sequences before submitting them to another tool, you can use the server output mode. Even if the server mode has been used, you still have the possibility to check the sequences a posteriori. In the result page, click on the link, in order to see the upstream sequence of the selected genes.
The third output mode, email, is usefull for large requests (several tens of IDs). Just fill in your email address in the provided field, and the link to your sequences will be sent to you when ready.Check the size of the sequences you obtained.
Below the sequences, you can see a series of buttons, which will allow you to send the retrieved sequences to the next task (as we will see in the next tutorials). For the time being, do not click these buttons.
Remarks
- Request IDs must be separated by carriage returns.
Preventing overlap with neighbour gene
- Come back to the retrieve EnsEMBL sequence form.
- Perform the same operations as above, but this time, select the option "Prevent overlap with neighbouring gene".
- Compare the size of the sequences with the previous result.
Interpretation of the results
The gene with ID ENSG00000110934 has a neighbour gene closer than 5000 bp. The retrieved sequence is thus only 4278 bp for that gene.
If you don't mind retrieving a sequence belonging to a neighbouring gene as long as it is not coding, use the option "Prevent overlap with neighbouring ORF" instead.
Retrieving 5'UTR sequence of a gene
- In the pop-up menu "Query organism", select Homo sapiens
- In the text area under "Gene, transcript or protein IDs", type the following ID
ENSG00000131507- In the "Type of sequence to retrieve" section, in the "Sequence type" pop-up menu, choose 5prime UTR.
- Leave all other options unchanged and click on the button "GO"
Since the query gene has two alternative transcripts, you get two sequences. But since the 5'UTRs for the two alternative transcripts partly overlap, a portion of the retrieved sequences is redundant, which is not suitable for some analyses, like motif discovery. If you want to avoid redundant sequences for a given query gene, use the same settings but make sure to check "Avoid redundant sequences due to alternative transcripts". With our example, you will then get just one sequence spanning over both 5'UTRs.
Retrieving the sequence of a gene
We now want to retrieve the whole sequence of a gene. As it is defined in EnsEMBL, it begins at the TSS of the first transcript (transcripts ordered according to their 5' end) and ends at the termination of the last transcript (transcripts ordered according to their 3' end).
- In the pop-up menu "Query organism", select Homo sapiens
- In the text area under "Gene, transcript or protein IDs", type the following ID
ENSG00000139618- In the "Type of sequence to retrieve" section, in the "Sequence type" pop-up menu, choose gene.
- Leave all other options unchanged and click on the button "GO"
You get one long sequence of around 85 kb which corresponds to the whole transcript for that gene.
Now do the same query with "Mask coding sequences" checked. You will get a sequence the same length, but with all coding exons nucleotides replaced by the 'N' character.
Retrieving homologous sequences for a gene
We will now retrieve all mamalian sequences orthologous to a human gene involved in breast cancer.
- In the pop-up menu "Query organism", select Homo sapiens
- Select the "Multiple organisms" option
- In the pop-up menu "Homology type", select orthologs
- In the text area "Taxon", type Mammalia
- In the text area under "Gene, transcript or protein IDs", type the following ID
ENSG00000139618- Leave all other options unchanged and click on the button "GO"
This takes more time (~ 40 sec) than the above examples. You can, of course, use all the options detailled in the previous examples in this "Multiple organisms" mode.
For suggestions please post an issue on GitHub or contact the