random walk method to find causal pathway

This paper described an algorithm to find the paths of casual interactions between two genes. Lets say we found a locus that is associated with the changes in expression level of a gene, but the region might contains multiple genes and even if we know the right gene, we don’t know how it affect the target gene. They use a random walk method to find a path from a possible source to the target gene in a biological network.

They create a biological network using protein-protein interaction (represented as undirected edges) and protein phosphorylation and TF-DNA binding (represented as directed edges). For a target gene g_t, we have the list of TFs T = (t₁, t₂, …, t_n) and candidate causal genes are C = (g_C₁, …, g_{C_m}). For each t_k in T, they initiate a search. In their search, when they are at node g, they want to select a gene g_i from neighborhood of g that is more likely to have a causal relation with the target gene g_t, therefore they use the correlation between g_i and g_t as the probability of selecting g_i (they set an upper limit to the correlation value, so that other genes that might have causal relation with the target gene but have lower correlation due to post-translational regulation mechanisms, have a chance of selection). note that they don’t use the correlation between g_i and g but between g_i and the target gene. using this method, they start at t_k and randomly visit one of its neighbors, and resume the search until they reach a gene in causal set C, or when entering a dead end (reaching a gene that they have visited all of its neighbors before) or reach maximum number of transitions.

If the search reach a gene g_c in C, they will have a path from g_c to g_t. They repeat this procedure N time for each t_k. They approximate the probability of reaching g_c from t_k, p(g_c , t_k) using the number of times that g_c was visited in a search initiated from t_k. they calculate the probability of each causal gene g_c as weighted sum of p(g_c , t_k) (weighted using the probability of t_k causing g_t). From the set of all possible causal genes C, they can select the gene g_c with highest probability.

To test their method, they used a knock-out experiment to create a simulation QTL. For each knockout experiment, they select genes with significant change in expression level as target gene, and then create a QTL region by putting the knocked-out gene and 9 other genes together. then they run their procedure to select the causal gene. If the algorithm select the knocked-out gene, it is correct prediction, if it select another gene it is incorrect prediction. They reached 46% accuracy; they expect 10% success by randomly selecting a gene as causal gene, so their method is more than 4 times better than random. (They also ran their method on yeast QTL data from Brem and Kruglyak and reported their findings).

Just Compiling

A space to share thoughts on computational systems biology

random walk method to find causal pathway