infer-operons
Given a list of input genes, infer the operon to which each of these genes belong.
The inferrence is based on a very simplistic distance-based method, inspired from the Salgao-Moreno method (Proc Natl Acad Sci U S A. 2000;97:6652-7). The Salgado-Moreno method classifies intergenic distances as TUB (transcription unit border) or OP (inside operon), and infers operons by iteratively collecting genes until a TUB is found. In the original method, the TUB or OP assignation relies on a log-likelihood score calculated from a training set.
The difference is that we do not use the log-likelihood (which presents risks of over-fitting), but a simple threshold on distance. Thus, we infer that the region upstream of a gene is TUB if its size is larger than a given distance threshold, and OP otherwise. Our validations (Rekins' Janky and Jacques van Helden, unpublished results) show that a simple threshold on distance raises a similar accuracy as the log-likelihood score (Acc ~ 78% for a threshold t=55).
The algorithm is based on three simple rules, depending on the relative orientation of the adjacent genes.
If the gene found upstream of a query gene is transcribed in the opposite direction, then the intergenic region is considered as a TUB, and the two flanking genes are labelled as operon leaders. This prediction is reliable (as far as genome annotation is correct), since operons only contain genes on the same strand.
If the gene found downstream of a query gene is transcribed in the opposite direction, then the intergenic region is considered as a TUB, and the two flanking genes are labelled as operon trailers. This prediction is reliable (as far as genome annotation is correct), since operons only contain genes on the same strand.
If two adjacent genes are on the same strand, then a distance threshold (option -dist) is applied to decide whether they belong to the same operon (dist <= thredhold) or not (dist > threshold). If they are predicted to be in distinct operon, the upstream gene is labelled as operon trailer, and the downstream gene as leader of the next operon.
Jacques.van-Helden\@univ-amu.fr
genomics
infer-operons [-i inputfile] [-o outputfile] [-v] [options]
With the following command, we infer the operon for a set of input genes.
infer-operons -v 1 -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -q hisD -q mhpR -q mhpA -q mhpD
We now specify different return fields
infer-operons -v 1 -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -q hisD -q lacI \ -return leader,trailer,up_info,down_info,operon
Infer operons for all the genes of an organism.
infer-operons -v 1 -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -all -return up_info,leader,operon
Infer operon from a set of query genes, and retrieve the upstream sequence of the inferred leader gene. Note that two of the input genes (lacZ, lacY) belong to the same operon. to avoid including twice their leader, we use the unix command sort -u (unique).
infer-operons -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -return leader,operon \ -q lacI -q lacZ -q lacY | sort -u \ | retrieve-seq -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -noorf
Note that operons can contain non-coding genes. For example, the metT operon contains a series of tRNA genes for methionine, leucine and glutamina, respectively.
infer-operons -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -q glnV -q metU -q ileV \ -return q_info,up_info,operon
Each row of the input file specifies one query gene. The first word of a gene is the query, the rest of the row is ignored.
Level of verbosity (detail in the warning messages during execution)
Display full help message
Same as -h
If no input file is specified, the standard input is used. This allows to use the command within a pipe.
Organism name.
Infer operons for all the genes of the query organism.
Query gene. This option can be used iteratively on the same command line to specify several query genes. Example:
infer-operons -org Escherischia_coli_K12 -q LACZ -q hisA
If no output file is specified, the standard output is used. This allows to use the command within a pipe.
Distance threshold.
Specify the separator for multi-value fields (e.g.: genes) in the output table. By default, multi-value fields are exported in a single column with a semicolon (";") as separator.
Specify a threshold on the number of genes in the operon. This option is generally used when predicting all operons (option -all), in order to only return predicted polycistronic transcription units (-min_gene_nb 2) or restrict the output to operons containingpredicted to contain at least a given number of genes (e.g. -min_gene_nb 4).
List of fields to return.
Supported fields: leader,trailer,operon,query,q_info,up_info,down_info
Predicted operon leader.
Predicted operon trailer.
Full composition of the operon. The names of member genes are separated by a semi-column ";" (note that the gene separator can be changed using the option -sep).
Detailed info on the query gene(s).
Detailed info on the upstream gene.
Detailed info on the downstream gene.
Number of genes in the predicted operon.