For this reason, in NeAT we developped programs allowing to randomize and to add some specified levels of noise to networks. This allows the user to apply the techniques used to find relevant results on networks where there is less or no signal and thus were no interesting result should emerge.
NeAT programs are able to generate randomized networks according to three methods.
In this demonstration, we will use the approach developped in [2] where we evaluated the performances of different graph clustering algorithms. Graph clustering algorithms allow to retrieve in a graph the groups of nodes that contain more connections between them than with the rest of the nodes of the graph. Clustering algorithms are often used in biology in order to extract coherent groups of nodes from networks (complexes detection (e.g. see [29,19,2,24]), protein families detection [7], co-expressed genes detection in co-expression networks (e.g see [20]), ...). The NeAT web server proposes the MCL (Markov Cluster algorithm) clustering algorithm developped by Stijn van Dongen [33,7]. To follow the command-line tools instructions, you should have MCL installed on your computer (available at http://micans.org/mcl/).
MCL simulates a flow on the graph by calculating successive powers of the associated adjacency matrix. At each iteration, an inflation step is applied to enhance the contrast between regions of strong or weak flow in the graph. The process converges towards a partition of the graph, with a set of high-flow regions (the clusters) separated by boundaries with no flow. The value of the inflation parameter strongly influences the number of clusters. According to [2], the optimal inflation value for clustering protein interaction networks is 1.8.
We will use an artificial interaction network created from the complexes annotated in the MIPS database by creating an edge between all the nodes belonging to the same complex [23]. This network contains 12262 edges between 1095 nodes. We will then use the MCL clustering algorithm on this network, on a little altered network, on a highly altered network and finally on a randomized network.
We will then compare these clusters to the MIPS complexes and estimate how well MCL can retrieve protein complexes from a protein-protein interaction and the influence of the noise on the results.
In this example, we will only use random alteration, i.e., the edges that are removed are randomly chosen. This is done to mimick what happens really in biological experiments where some inter-relationships between the nodes (genes, proteins, metabolites, ...) may not be discovered (false negatives) or are erroneously discovered (false positives). However the alter-graph program also allows to alterate the network with targeted attack on user-selected nodes. In their study, Spirin and Mirny [31] showed the affect of node targeted attacks on clustering results.
Re-do the this alteration procedure using 50% of edges removal and 100% of edges addition. Save the resulting file with name complexes_rm_50_ad_100.tab.
Repeat these steps for complexes_rm_10_ad_10.tab, complexes_rm_50_ad_100.tab and complexes_rm_00_ad_00_random.tab and save the resulting files under the names contigency_stats_ad_10_rm_10.tab, contigency_stats_ad_50_rm_100.tab, contigency_stats_ad_00_rm_00_random.tab, respectively.
If you have installed a stand-alone version of the NeAT distribution, you can use the programs random-graph and alter-graph on the command-line. This requires to be familiar with the Unix shell interface. If you don't have the stand-alone tools, you can skip this section and read the next section (Interpretation of the results).
We will now describe the use of random-graph, alter-graph, compare-classes and contingency-stats as command line tools. For this tutorial, you need to have the MCL program installed.
Start by going on the demo dataset download web page.http://rsat.scmbb.ulb.ac.be/rsat/data/neat_tuto_data/ and downloading the MIPS complex network file (complexes_rm_00_ad_00.tab) and the complexes (mips_complexes.tab).
alter-graph -v 1 -i complexes_rm_00_ad_00.tab \ -rm_edges 10% -add_edges 10% \ | cut -f 1,2 | grep -v ';' > complexes_rm_10_ad_10.tabRe-use this command, but modify the percentage of removed (-rm_edges 50%) and added edges (-add_edges 100%). Save the resulting file with name complexes_rm_50_ad_100.tab.
random-graph -v 1 -i complexes_rm_00_ad_00.tab \ -random_type node_degree \ | cut -f 1,2 | grep -v ';' > complexes_rm_00_ad_00_random.tab
mcl complexes_rm_00_ad_00.tab \ --abc -I 1.8 -o complexes_rm_00_ad_00_clusters.mcl
convert-classes -i complexes_rm_00_ad_00_clusters.mcl -from mcl -to tab -o complexes_rm_00_ad_00_clusters.tab
compare-classes -q complexes_rm_00_ad_00_clusters.tab \ -r mips_complexes.tab \ -matrix QR \ -o complexes_rm_00_ad_00_clusters_cc_complexes_matrix.tab
contingency-stats -i complexes_rm_00_ad_00_clusters_cc_complexes_matrix.tab \ -o contigency_stats_ad_00_rm_00.tab
Repeat these steps for complexes_rm_10_ad_10.tab, complexes_rm_50_ad_100.tab and complexes_rm_00_ad_00_random.tab and save the resulting files under the names contigency_stats_ad_10_rm_10.tab, contigency_stats_ad_50_rm_100.tab, contigency_stats_ad_00_rm_00_random.tab, respectively.
We will now compare the performances of MCL when applied to networks containing an increasing proportion of noise or no signal at all.
See the previous chapter (Graph clustering) for a complete description of a contigency table.
The table summarizes the kind of values that should be obtained for the metrics described in the previous section. As the alteration and the randomization procedure are random processes, you should not obtain exactly the same results.
# | true | ad10 / rm10 | ad100 / rm50 | random |
---|---|---|---|---|
ncol | 125 | 114 | 713 | 361 |
nrow | 220 | 220 | 220 | 220 |
mean | 0.0569 | 0.0624 | 0.00998 | 0.0197 |
Sn | 0.998 | 0.985 | 0.418 | 0.291 |
PPV | 0.884 | 0.836 | 0.867 | 0.459 |
acc | 0.941 | 0.91 | 0.642 | 0.375 |
acc_g | 0.939 | 0.907 | 0.602 | 0.365 |
Sn_w | 0.997 | 0.992 | 0.502 | 0.157 |
PPV_w | 0.621 | 0.62 | 0.688 | 0.244 |
acc_g_w | 0.787 | 0.785 | 0.588 | 0.196 |
sep_r | 0.567 | 0.507 | 0.676 | 0.192 |
sep_c | 0.998 | 0.979 | 0.208 | 0.117 |
sep | 0.752 | 0.704 | 0.375 | 0.15 |
As expected, the value of the global parameters, the geometric accuracy (row acc_g), the weighted geometric accuracy (row acc_g_w) and the separation (row sep) decrease drastically as the network contain less and less relevant information.
We can observe that the sensitivity is more affected than the PPV and that the complex wise separation (sep_r) is more affected than the cluster wise separation. This is due to the fact that by increasing the noise, MCL increases the number of small sized clusters (ncol) too and, as we saw in previous section, this has an impact on the sensitivity.
Note that with a random graph, we would have a separation of 0.15 but an unweighted geometric accuracy of 0.365 which is far from being 0. The relatively good performances of MCL on the highly altered graph must thus be taken with caution as the gain in performances is only of 23%. This illustrates the interest of using negative controls.