Open the toolset Build control sets of the RSAT
toolbox, and click random sequence. Adapt
the options to get 1000 sequences of 5,000bp each.
Choice of the background model: select the
organism Saccharomyces cerevisiae, check the
option "DNA sequences calibrated on non-coding upstream
sequences" with an oligonucleotide size
of 6 (this corresponds to a Markov model of
order m = k-1 = 5), and click GO.
After a few seconds, the result page appears. The sequences
are not displayed to avoid massive transfer (you don't
specially need to transfer 5Mb before submitting the
resulting sequences to the next analysis steps). If you
want, you can check the sequences by clicking on the link
under Result files(s).
In the result page, click the
button dna-pattern.
- Enter GATACA in
the box Query pattern(s).
- For the Search strands option,
select direct only, and disactivate the
option prevent overlapping matches.
- Disactivate the default return
options match positions and sequence
limits.
- Activate the options match count
table, match rank and sort.
Justification for the chosen options
- The option match count table will return a
table with one row per sequence, and one column per
pattern (in this case we submitted a single pattern,
but the tool would also allow you to analyze different
patterns in a single run),
- The option sort will sort the sequences y
decreasing occurrences, you will thus immeiately see the
maximum number of occurrences in the random trial.
- The option match rank will indicate the rank
of each sequence accordnig to the number of pattern
occurrences, and thus allow us to check the number of
sequences haing 0, 1, 2, ... occurrences.
At this stage, you can already count the number of
sequences having 6, 5, 4, ... 0 occurrences of the
pattern by browing the result table from top to
bottom. We will however use another program to compute
it automatically, and display the result
graphically.
At the bottom of the dna-pattern result page,
click the button Frequency distribution. Set
the class interval to 1, the Data column to
3 (this column contains the counts of GATACA) and
click GO. The result table shows you the number of
sequences containing 0, 1, 2, ... occurrences,
respectively. Read the header to understand the content of
the columns. At the bottm of the frequency table, you have
some statistics, including the mean number of occurrences
per sequences (as I run this test, I obtain 1.331, but
this number is supposed to fluctuate between
trials). Compare this observed mean to the expected number
of occurrences.
We will now generate a graphical representation of the
frequency distribution. For this,
click XYgraph at the bottom of the frequency
distribution result page. Set the Data column for X
axis to 1, leave all other parameters unchanged and
click GO.
On the resulting graph,
- the blue curve indicates displays the
number n of sequences (ordinate) presenting X
occurrences (0, 1, 2, ...); of GATACA;
- the green curve
(n_cum) indicates the cumulative distribution,
i.e. number of sequences containing at most X
occurrences;
- the pink curve
(n_dcum) indicates the decrasing cumulative
distribution, i.e. the number of sequences
containing at least X occurrences.
The distribution graph above indicating the absolute
frequencies (i.e. number of sequences). In order to
display the distributions of relative frequencies, come
back to the XYgraph form, and type 7,8,9 in the
box Data columns for Y axis. Optionally, you can
also choose to speficy a log base of 10 for the Y
axis. This will better highlight the lower frequencies
associated to large occurrence numbers.