footprint-discovery
$program_version
Detect phylogenetic footprints by applying dyad-analysis in promoters of a set of orthologous genes.
Adapted from the procedure described in Janky & van Helden (2008).
Sequences
Motif discovery
footprint-discovery [-i inputfile] -o [output_prefix] \ -org query_organism -taxon ref_taxon \ -q query_gene [-q query_gene2 ...] \ [-v #] [...]
Discover conserved motifs in the promoters of the orthologs of lexA in Enterobacteriaceae.
footprint-discovery -v 1 -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -taxon Enterobacteriaceae \
-lth occ 1 -lth occ_sig 0 -uth rank 50 \
-return occ,proba,rank -filter \
-bg_model taxfreq -q lexA
Discover conserved motifs in the promoters of the orthologs of lexA in Enterobacteriaceae.
footprint-discovery -v 1 -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -taxon Enterobacteriaceae \
-lth occ 1 -lth occ_sig 0 -uth rank 50 \
-return occ,proba,rank -filter \
-bg_model taxfreq \
-sep_genes -q lexA -q recA -q uvrA
Note the option -sep_genes indicating that the genes have to be analyzed separately rather than grouped.
The genes can also be specified in a file with the option -genes.
Iterate footprint discovery for each gene separately.
footprint-discovery -v 1 -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -taxon Enterobacteriaceae \
-lth occ 1 -lth occ_sig 0 -uth rank 50 \
-return occ,proba,rank -filter \
-bg_model taxfreq -all_genes -sep_genes
The program takes as input a taxon of interest + one or several query genes.
The output consists in a set of files, containing the results of the different steps of the analysis.
Log file listing the analysis parameters + output file names;
List of query genes (one or several genes can be entered)
List of orthologous genes
Promoter sequences of the orthologous genes
Purged promoter sequences (for motif discovery) =item [prefix]_ortho_filter_dyads.tab
Dyads found in the query genes (for dyad filtering)
Significant dyads found in the promoters of orthologous genes (the footprints)
Assembled dyads
Feature-map
Janky, R. and van Helden, J. Evaluation of phylogenetic footprint discovery for the prediction of bacterial cis-regulatory elements (2008). BMC Bioinformatics 2008, 9:37 [Pubmed 18215291].
Brohee, S., Janky, R., Abdel-Sater, F., Vanderstocken, G., Andre, B. and van Helden, J. (2011). Unraveling networks of co-regulated genes on the sole basis of genome sequences. Nucleic Acids Res. [Pubmed 21572103] [Open access]
The following options are not yet implemented, but this should be done soon.
Specify a file containing a list of taxa, each of which will be analyzed separately. The results are stored in a separate folder for each taxon. The folder name is defined automatically.
Automatically analyze all the taxa, and store each result in a separate folder (the folder name is defined automatically).
Lower threshold for dyad-analysis.
See the manual of dyad-analysis for a description of the fields on which a threshold can be imposed.
Upper threshold for dyad-analysis.
See the manual of dyad-analysis for a description of the fields on which a threshold can be imposed.
Return fields for dyad-analysis. This argument is passed to dyad-analysis for the discovery of dyads in promoters of orthologous genes.
Multiple-fields can be entered either by calling this argument iterativelyk or by entering multiple fields separated by commas.
Type dyad-analysis -help to obtain the list of supported return fields.
Allow the user to choose among alternative background model (see Janky & van Helden, 2008).
Supported background model types:
Expected dyad frequencies are estimated by taking the product of the monad frequencies observed in the input sequence set. Example:
F_exp(CAGn{10}GTA) = F_obs(CAG) * F_obs(GTA)
Only valid in combination with the option -taxon.
Expected dyad frequencies are computed by summing the frequencies of all dyads in the non-coding upstream sequences of all genes for all the organisms of the reference taxon.
Only valid in combination with the option -org_list.
Expected dyad frequencies are computed by summing the frequencies of all dyads in the non-coding upstream sequences of all genes for each organism of user-specified list.
Only valid in combination with the option -bgfile.
Precises that the background model that will be used for dyad-analysis will be a file given as argument (with the option -bgfile, see below)
File containing the word frequencies to be used as the background model for dyad-analysis. This option must be used in combination with the option -bg_model file
Only accept dyads found in the promoter of the query gene, in the query organism. (option selected by default)
Accept all dyads, even if they are not found in the promoter of the query gene, in the query organism. (will cancel -filter option if selected)
Maximal dyad degree for network inference. Default: 20.
Some dyads are found significant in a very large number of genes, for various reasons (binding motifs of global factors, low-complexity motifs). These "ubiquitous" dyads create many links in the network, which makes problem to extract clusters of putatively co-regulated genes. To circumvent this problem, we discard "hub" dyads, i.e. dyads found in the footprints of too many query genes.
Level of verbosity (detail in the warning messages during execution)
Display full help message.
Same as -h
Query organism, to which the query genes belong.
Reference taxon, in which orthologous genes have to be collected.
Alternatively, reference organisms can be specified with the option -org_list.
This option gives the posibility to analyse a user-specified set of reference organisms rather than a full taxon.
File format: the first word of each line is used as organism ID. Any subsequent text is ignored. The comment char is ";".
This option is incompatible with the option "-taxon".
This option can only be used combined with the -org_list option, this gives the posibility to analyse a given set of sequences managing sequence redundancy using a list of "no redundant" organisms.
The file format is one organisms per line, the comment char is ";"
This option gives the posibility to analyse a user-specified set of orthologs for specific reference organisms instead of using the BBH set of orthologs provided by RSAT.
The query genes included here will be the ones analyzed by the program.
File format: Tab delimited file with three columns.
ID of the query gene (in the query organism)
ID of the reference gene
ID of the reference organism
Further columns will be ignored. The comment char is ";".
This option is incompatible with the option "-taxon", and "-bg_model taxfreq" option.
Query gene.
This option can be used iteratively on the command line to specify multiple genes.
Specify a file containing a list of genes. Multiple genes can also be specified by using iteratively the option -q.
Automatically analyze all the genes of a query genome, and store each result in a separate folder (the folder name is defined automatically).
Maximal number of genes to analyze.
Skip the first # genes (useful for quick testing and for resuming interrupted tasks).
Stop after having treated the first # genes (useful for quick testing).
Main output directory. The results will be dispatched in sub-directories, defined according to the taxon, query organism and query gene name(s).
If the main output dir is not specified, the program automatically sets it to "footprints".
Generate one command per query gene, and post it on the queue of a PC cluster.
Dry run: print the commands but do not execute them.
Do not die in case a sub-program returns an error.
The option -nodie allows you to circumvent problems with specific sub-tasks, but this is not recommended because the results may be incomplete.
Search footprints for each query gene separately. The results are stored in a separate folder for each gene. The folder name is defined automatically.
By default, when several query genes are specified, the program collects orthologs and analyzes their promoters altogether. The option -sep allows to automatize the detection of footprint in a set of genes that will be treated separately.
Infer operons in order to retrieve the promoters of the predicted operon leader genes rather than those located immediately upstream of the orthologs. This method uses a threshold on the intergenic distance.
Specify here the intergenic distance threshold in base pairs. Pair of adjacent genes with intergenic distance equal or less than this value are predicted to be within operon. (default : 55)
Specify a subset of tasks to be executed.
By default, the program runs all necessary tasks. However, in some cases, it can be useful to select one or several tasks to be executed separately. For instance, after having collected all the promoter sequences of ortholog genes, one might desire to run the pattern detection with various parameter values without having to retrieve the same sequences each time.
Beware: task selection requires expertise, because most tasks depends on the prior execution of some other tasks in the workflow. Selecting tasks before their prerequisite tasks have been completed will provoke fatal errors.
Supported tasks:
Run all supported tasks. If no task is specified, all tasks are performed.
Infer operons (using infer-operons. This option should be used only for Bacteria.
Retrieve upstream sequence of the query genes (using retrieve-seq).
Identify theorthologs of the query genes in the selected taxon (using get-orthologs).
Retrieve upstream sequences of the orthologs (using retrieve-seq-multigenome).
Purge upstream sequences of the orthologs (using purge-seq).
Generate an HTML index with links to the result files. This option is used for the web interface, but can also be convenient to index results, especially when several genes or taxa are analyzed (options -genes, -all_genes, -all_taxa).
With the option -sep_genes, one index is generated for each gene separately. An index summarizing the results for all genes can be generated using the option -synthesis.
Generate a HTML table with links to the individual result files. The table contains one row per query gene, one column by output type (sequences, dyads, maps, ...) for footpritn-discovery and for footprint-scan on line per TF-gene interacction.
Detect all dyads present with at elast one occurrence in the upstream sequence of the query gene (using dyad-analysis). Those dyads will be used as filter if the option -filter has been specifed.
Detect significantly over-represented in upstream sequences of orhtologs (using dyad-analysis).
Draw feature maps showing the location of over-represented dyads in upstream sequences of promoters (using feature-map).
Infer a co-regulation network from the footprints, as described in Brohee et al. (2011).
Generate an index file for each gene separately. The index file is in the gene-specific directory, it is complementary to the general index file generated with the task "synthesis".
Ortholgous genes will be obtained for the genes realted to the specified trasncription factors. This task shoulb be executed befor the option -orthologs when a tf is specified. See -tf option description for more information.
Compute the significance of number of matrix hit occurrences as a function of the weight score (using matrix-scan and matrix-scan-quick).
Generate graphs showing the distributions of occurrences and their significances, as a function of the weight score (using >XYgraph>).
Scan upstream sequences to detect hits above a given threshold (using matrix-scan).
Draw the feature map of the hits (using feature-mp).
When the option -rand is activated, the program replaces each ortholog by a gene selected at random in the genome where this ortholg was found.
This option is used (for example by footprint-scan and footprint-discovery to perform negative controls, i.e. check the rate of false positives in randomly selected promoters of the reference taxon.
Format for the feature map.
Supported: any format supported by the program feature-map.
Deprecated, replaced by the task "index".
This option generated synthetic tables (in tab-delimited text and html) for all the results. It should be combined with the option -sep_genes. The synthetic tables contain one row per gene, and one column per parameter. They summarize the results (maximal significance, top-ranking motifs) and give pointers to the detailed result files.