RSA-tools - Tutorials - local-word-analysis


  1. Introduction
  2. Running the example


Discovering overrepresented motifs

Starting from the observation that a set of genes are co-regulated, one can suppose that some elements are shared by their upstream region, and one would like to detect such elements. A simple and fast method to do so is to extract such elements, based on a detection of over-represented oligonucleotides. This method is implemented in local-word-analysis and oligo-analysis and has been described in detail in J.Mol. Biol. (1998) 281, 827-842.

Discovering position-specific motifs

Yet another criterion for detecting exceptional motifs is to detect oligonucleotides that are over-represented at specific positions relative to some reference position (position 0). Typically, the analysis should be performed with a large number of sequences (e.g. all promoters, or all 3' UTRs of a given organism) aligned with respect to some reference position (e.g. the start codon, or the transcription start site, the stop codon, ...). The goal is to detect motifs that are significantly 'concentrated' in some specific regions relative to the reference position. The detection of position-specific motifs can be achieved with the program local-word-analysis, which applies the same statistics as oligo-analysis to a series of windows of variable widths in the input sequence set. The reference position can be either the start or the end of the sequences. The program returns a list of significant oligonucleotides, with the boundaries of the window where they are over-represented. As a study case, we will detect position-specific oligonucleotides in all the promoters of the bacteria Bacillus subtilis.

Running the example

  1. In the left menu, click on retrieve sequence (section Sequence retrieval). This should open the retrieve sequence form.
  2. In the retrieve sequence form, check the option Single organism and select the species of your choice in the Organism pop-up menu (for the study case, select Bacillus subtilis).
  3. For the Genes radio button, select the option all in order to retrieve all the annotated gene promoters.
  4. Choose the appropriate Feature type according to the selected organism (for the study case, select CDS) and specify the region to retrieve by selecting the Sequence type and entering the From and To location (for the study case select upstream and set From to 400 and To to 1).
  5. Make sure that the option Prevent overlap with neighbour genes (noorf) is checked.
    WARNING!. For detecting position-specific motifs, it is particularly important to restrict the retrieval to non-coding sequences, because oligonucleotide frequencies strongly differ between non-coding and coding sequences.
  6. Leave all other options unchanged and click the GO button to retrieve the sequences. After a few seconds, a result page with a link to the sequence file should appear (you can optionally check them by following this link).
  7. Below the link to the sequence file, the Next step box shows a list of programs. Click the local-word-analysis button in order to transmit your promoter sequences to the local-word-analysis form.
  8. The local-word-analysis form contains a large number of options, whose detailed description is available in the manual. We will just check below some of the most important options and explain why they are critical.
  9. Make sure that the option Purge sequences is checked.
    WARNING! Sequence purging is essential to discard redundancy in the sequence set.
  10. In the next section of the form, select fixed window width and specify the width plus the Align options (for the study case set fixed Window width to 20 and Align to right since we are using upstream sequences)
  11. In the section Background model, the option upstream-noorf (Genome subset) should be checked and the organism selected according to your input sequences.
    WARNING! The choice of the background model is one of the most crucial parameters for motif discovery. An inappropriate background model provokes noisy results that can lead to erroneous interpretations.
  12. In the table Thresholds, set the upper threshold on rank to 100. This will restrict the output to the 100 most significant oligonucleotide windows.
  13. Check that the lower threshold on significance is set to 0, in order to return motifs with positive significance. You can optionally increase this threshold to perform a more stringent selection.
  14. Check that the upper threshold on window rank is set to 1, in order to return only the most significant window for each oligonucleotide. Alternatively, you can increase this value if you suspect that some motifs will be concentrated at several positions. However, the result may then contain mutually embedded windows for the same oligonucleotide.
  15. In the last part of the form, select the email Output, enter a valid email address and click the GO button. After a couple of minutes, you should receive an email containing a link to the result page.
    WARNING! Depending on parameters chosen, the length and the number of sequences, this step will take from few minutes to several hours. For this reasons, the email output is strongly recommended.