random walk method to find causal pathway

This paper described an algorithm to find the paths of casual interactions between two genes. Lets say we found a locus that is associated with the changes in expression level of a gene, but the region might contains multiple genes and even if we know the right gene, we don’t know how it affect the target gene. They use a random walk method to find a path from a possible source to the target gene in a biological network.

They create a biological network using protein-protein interaction (represented as undirected edges) and protein phosphorylation and TF-DNA binding (represented as directed edges). For a target gene gt, we have the list of TFs T = (t1, t2, …, tn) and candidate causal genes are C = (gC1, …, gCm). For each tk in T, they initiate a search. In their search, when they are at node g, they want to select a gene gi from neighborhood of g that is more likely to have a causal relation with the target gene gt, therefore they use the correlation between gi and gt as the probability of selecting gi (they set an upper limit to the correlation value, so that other genes that might have causal relation with the target gene but have lower correlation due to post-translational regulation mechanisms, have a chance of selection). note that they don’t use the correlation between gi and g but between gi and the target gene. using this method, they start at tk and randomly visit one of its neighbors, and resume the search until they reach a gene in causal set C, or when entering a dead end (reaching a gene that they have visited all of its neighbors before) or reach maximum number of transitions.

If the search reach a gene gc in C, they will have a path from gc to gt. They repeat this procedure N time for each tk. They approximate the probability of reaching gc from tk, p(gc , tk) using the number of times that gc was visited in a search initiated from tk. they calculate the probability of each causal gene gc as weighted sum of p(gc , tk) (weighted using the probability of tk causing gt). From the set of all possible causal genes C, they can select the gene gc with highest probability.

To test their method, they used a knock-out experiment to create a simulation QTL. For each knockout experiment, they select genes with significant change in expression level as target gene, and then create a QTL region by putting the knocked-out gene and 9 other genes together. then they run their procedure to select the causal gene. If the algorithm select the knocked-out gene, it is correct prediction, if it select another gene it is incorrect prediction. They reached 46% accuracy; they expect 10% success by randomly selecting a gene as causal gene, so their method is more than 4 times better than random. (They also ran their method on yeast QTL data from Brem and Kruglyak and reported their findings).