RSA-tools - tutorials

The theoretical background required for this tutorial can be found in the RSAT course.

In particular, we recommend to read the following slides before starting this tutorial.

Introduction

In this tutorial, we will get familiar with the concepts of word occurrences (i.e. number of instances of a given oligonucleotide) in DNA sequences.

Exercise

Assuming a 5^th order Markovian background model calibrated on all upstream non-coding sequences of the yeast Saccharomyces cerevisiae, how many occurrences of the word GATACA would you expect by chance in a 5kb sequence?
Using the same background model, generate 1,000 random sequences of length L=5000bp and compute the frequency distribution of the word GATACA. Does the observed mean correspond to your expectation?
Which fraction of the sequences contain at least 3 occurrences of the word ?

Tips

By default, the program dna-pattern returns the matching postions of the query patterns in the input sequences, but the options can be changed to obtain a count table, indicating the number of occurrences of a given pattern for each input sequence.

Solution

Open the toolset Build control sets of the RSAT toolbox, and click random sequence. Adapt the options to get 1000 sequences of 5,000bp each.
Choice of the background model: select the organism Saccharomyces cerevisiae, check the option "DNA sequences calibrated on non-coding upstream sequences" with an oligonucleotide size of 6 (this corresponds to a Markov model of order m = k-1 = 5), and click GO.

After a few seconds, the result page appears. The sequences are not displayed to avoid massive transfer (you don't specially need to transfer 5Mb before submitting the resulting sequences to the next analysis steps). If you want, you can check the sequences by clicking on the link under Result files(s).
In the result page, click the button dna-pattern.
- Enter GATACA in the box Query pattern(s).
- For the Search strands option, select direct only, and disactivate the option prevent overlapping matches.
- Disactivate the default return options match positions and sequence limits.
- Activate the options match count table, match rank and sort.
Justification for the chosen options
- The option match count table will return a table with one row per sequence, and one column per pattern (in this case we submitted a single pattern, but the tool would also allow you to analyze different patterns in a single run),
- The option sort will sort the sequences y decreasing occurrences, you will thus immeiately see the maximum number of occurrences in the random trial.
- The option match rank will indicate the rank of each sequence accordnig to the number of pattern occurrences, and thus allow us to check the number of sequences haing 0, 1, 2, ... occurrences.
At this stage, you can already count the number of sequences having 6, 5, 4, ... 0 occurrences of the pattern by browing the result table from top to bottom. We will however use another program to compute it automatically, and display the result graphically.
At the bottom of the dna-pattern result page, click the button Frequency distribution. Set the class interval to 1, the Data column to 3 (this column contains the counts of GATACA) and click GO. The result table shows you the number of sequences containing 0, 1, 2, ... occurrences, respectively. Read the header to understand the content of the columns. At the bottm of the frequency table, you have some statistics, including the mean number of occurrences per sequences (as I run this test, I obtain 1.331, but this number is supposed to fluctuate between trials). Compare this observed mean to the expected number of occurrences.
We will now generate a graphical representation of the frequency distribution. For this, click XYgraph at the bottom of the frequency distribution result page. Set the Data column for X axis to 1, leave all other parameters unchanged and click GO.

On the resulting graph,
- the blue curve indicates displays the number n of sequences (ordinate) presenting X occurrences (0, 1, 2, ...); of GATACA;
- the green curve (n_cum) indicates the cumulative distribution, i.e. number of sequences containing at most X occurrences;
- the pink curve (n_dcum) indicates the decrasing cumulative distribution, i.e. the number of sequences containing at least X occurrences.
The distribution graph above indicating the absolute frequencies (i.e. number of sequences). In order to display the distributions of relative frequencies, come back to the XYgraph form, and type 7,8,9 in the box Data columns for Y axis. Optionally, you can also choose to speficy a log base of 10 for the Y axis. This will better highlight the lower frequencies associated to large occurrence numbers.

RSA-tools - Tutorials - Word counts

Contents

Prerequisite

Introduction

Exercise

Tips

Solution

Justification for the chosen options

Next steps