network_inference


Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome

Our next meeting will be at 2pm on Mar 12th, in room 4160 of the Discovery building. Our Selected paper is Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome.
.
The abstract is as follows.

Motivation: Identifying transcription factor binding sites is the first step in pinpointing non-coding mutations that disrupt the regulatory function of transcription factors and promote disease. ChIP-seq is the most common method for identifying binding sites, but performing it on patient samples is hampered by the amount of available biological material and the cost of the experiment. Existing methods for computational prediction of regulatory elements primarily predict binding in genomic regions with sequence similarity to known transcription factor sequence preferences. This has limited efficacy since most binding sites do not resemble known transcription factor sequence motifs, and many transcription factors are not even sequence-specific.

Results: We developed Virtual ChIP-seq, which predicts binding of individual transcription factors in new cell types using an artificial neural network that integrates ChIP-seq results from other cell types and chromatin accessibility data in the new cell type. Virtual ChIP-seq also uses learned associations between gene expression and transcription factor binding at specific genomic regions. This approach outperforms methods that use transcription factor sequence preferences in the form of position weight matrices, predicting binding for transcription factors (accuracy > 0.99; Matthews correlation coefficient > 0.3). In at least one validation cell type, performance of Virtual ChIP-seq is higher than all participants of the DREAM Challenge for in vivo transcription factor binding site prediction in 4 of 9 transcription factors that we could compare to.

 

We welcome all who can join us for this discussion. Feel free to begin that discussion in the comments section below.


Vicus: Exploiting local structures to improve network-based analysis of biological data

Our next meeting will be at 11:00 on Oct 24th, in room 4160 of the Discovery building. Our Selected paper is Vicus: Exploiting local structures to improve network-based analysis of biological data.
The abstract is as follows.

Biological networks entail important topological features and patterns critical to understanding interactions within complicated biological systems. Despite a great progress in understanding their structure, much more can be done to improve our inference and network analysis. Spectral methods play a key role in many network-based applications. Fundamental to spectral methods is the Laplacian, a matrix that captures the global structure of the network. Unfortunately, the Laplacian does not take into account intricacies of the network’s local structure and is sensitive to noise in the network. These two properties are fundamental to biological networks and cannot be ignored. We propose an alternative matrix Vicus. The Vicus matrix captures the local neighborhood structure of the network and thus is more effective at modeling biological interactions. We demonstrate the advantages of Vicus in the context of spectral methods by extensive empirical benchmarking on tasks such as single cell dimensionality reduction, protein module discovery and ranking genes for cancer subtyping. Our experiments show that using Vicus, spectral methods result in more accurate and robust performance in all of these tasks.

We welcome all who can join us for this discussion. Feel free to begin that discussion in the comments section below.


Knowledge-guided gene prioritization reveals new insights into the mechanisms of chemoresistance

Our next meeting will be at 11:00 on Oct 10th, in room 4160 of the Discovery building. Our Selected paper is Knowledge-guided gene prioritization reveals new insights into the mechanisms of chemoresistance.
The abstract is as follows.

Background: Identification of genes whose basal mRNA expression predicts the sensitivity of tumor cells to cytotoxic treatments can play an important role in individualized cancer medicine. It enables detailed characterization of the mechanism of action of drugs. Furthermore, screening the expression of these genes in the tumor tissue may suggest the best course of chemotherapy or a combination of drugs to overcome drug resistance.

Results: We developed a computational method called ProGENI to identify genes most associated with the variation of drug response across different individuals, based on gene expression data. In contrast to existing methods, ProGENI also utilizes prior knowledge of protein–protein and genetic interactions, using random walk techniques. Analysis of two relatively new and large datasets including gene expression data on hundreds of cell lines and their cytotoxic responses to a large compendium of drugs reveals a significant improvement in prediction of drug sensitivity using genes identified by ProGENI compared to other methods. Our siRNA knockdown experiments on ProGENI-identified genes confirmed the role of many new genes in sensitivity to three chemotherapy drugs: cisplatin, docetaxel, and doxorubicin. Based on such experiments and extensive literature survey, we demonstrate that about 73% of our top predicted genes modulate drug response in selected cancer cell lines. In addition, global analysis of genes associated with groups of drugs uncovered pathways of cytotoxic response shared by each group.

Conclusions: Our results suggest that knowledge-guided prioritization of genes using ProGENI gives new insight into mechanisms of drug resistance and identifies genes that may be targeted to overcome this phenomenon.

We welcome all who can join us for this discussion. Feel free to begin that discussion in the comments section below.


LASSIM—A network inference toolbox for genome-wide mechanistic modeling

Our next meeting will be at 2:30 on August 18th, in room 4160 of the Discovery building. Our Selected paper is LASSIM—A network inference toolbox for genome-wide mechanistic modeling.
The abstract is as follows.

Recent technological advancements have made time-resolved, quantitative, multi-omics data available for many model systems, which could be integrated for systems pharmacokinetic use. Here, we present large-scale simulation modeling (LASSIM), which is a novel mathematical tool for performing large-scale inference using mechanistically defined ordinary differential equations (ODE) for gene regulatory networks (GRNs). LASSIM integrates structural knowledge about regulatory interactions and non-linear equations with multiple steady state and dynamic response expression datasets. The rationale behind LASSIM is that biological GRNs can be simplified using a limited subset of core genes that are assumed to regulate all other gene transcription events in the network. The LASSIM method is implemented as a general-purpose toolbox using the PyGMO Python package to make the most of multicore computers and high performance clusters, and is available at https://gitlab.com/Gustafsson-lab/lassim. As a method, LASSIM works in two steps, where it first infers a non-linear ODE system of the pre-specified core gene expression. Second, LASSIM in parallel optimizes the parameters that model the regulation of peripheral genes by core system genes. We showed the usefulness of this method by applying LASSIM to infer a large-scale non-linear model of naïve Th2 cell differentiation, made possible by integrating Th2 specific bindings, time-series together with six public and six novel siRNA-mediated knock-down experiments. ChIP-seq showed significant overlap for all tested transcription factors. Next, we performed novel time-series measurements of total T-cells during differentiation towards Th2 and verified that our LASSIM model could monitor those data significantly better than comparable models that used the same Th2 bindings. In summary, the LASSIM toolbox opens the door to a new type of model-based data analysis that combines the strengths of reliable mechanistic models with truly systems-level data. We demonstrate the power of this approach by inferring a mechanistically motivated, genome-wide model of the Th2 transcription regulatory system, which plays an important role in several immune related diseases.

We welcome all who can join us for this discussion. Feel free to begin that discussion in the comments section below.


Predicting tissue specific transcription factor binding sites

Our selection for our meeting on the 16th of May is Predicting tissue specific transcription factor binding sites. We will meet as usual in room 3160 of the Discovery building at 12:30 PM. The abstract is as follows.

Background

Studies of gene regulation often utilize genome-wide predictions of transcription factor (TF) binding sites. Most existing prediction methods are based on sequence information alone, ignoring biological contexts such as developmental stages and tissue types. Experimental methods to study in vivo binding, including ChIP-chip and ChIP-seq, can only study one transcription factor in a single cell type and under a specific condition in each experiment, and therefore cannot scale to determine the full set of regulatory interactions in mammalian transcriptional regulatory networks.

Results

We developed a new computational approach, PIPES, for predicting tissue-specific TF binding. PIPES integrates in vitro protein binding microarrays (PBMs), sequence conservation and tissue-specific epigenetic (DNase I hypersensitivity) information. We demonstrate that PIPES improves over existing methods on distinguishing between in vivo bound and unbound sequences using ChIP-seq data for 11 mouse TFs. In addition, our predictions are in good agreement with current knowledge of tissue-specific TF regulation.

Conclusions

We provide a systematic map of computationally predicted tissue-specific binding targets for 284 mouse TFs across 55 tissue/cell types. Such comprehensive resource is useful for researchers studying gene regulation.

We look forward to seeing all who can attend and feel free to begin our discussion in the comments section below.


Inferring causal molecular networks: empirical assessment through a community-based method.

For our next Journal Club Meeting we will read Inferring causal molecular networks: empirical assessment through a community-based method.. The abstract is as follows:

It remains unclear whether causal, rather than merely correlational, relationships in molecular networks can be inferred in complex biological settings. Here we describe the HPN-DREAM network inference challenge, which focused on learning causal influences in signaling networks. We used phosphoprotein data from cancer cell lines as well as in silico data from a nonlinear dynamical model. Using the phosphoprotein data, we scored more than 2,000 networks submitted by challenge participants. The networks spanned 32 biological contexts and were scored in terms of causal validity with respect to unseen interventional data. A number of approaches were effective, and incorporating known biology was generally advantageous. Additional sub-challenges considered time-course prediction and visualization. Our results suggest that learning causal relationships may be feasible in complex settings such as disease states. Furthermore, our scoring approach provides a practical way to empirically assess inferred molecular networks in a causal sense.

We look forward to seeing all who can attend and feel free to extend our discussion into the comments section below.


Analysis of computational footprinting methods for DNase sequencing experiments

Our paper for the next journal club meeting on 4/18/2016 is Analysis of computational footprinting methods for DNase sequencing experiments by Gusmao et al. (Nature Methods, 2016). The abstract is as follows.

DNase-seq allows nucleotide-level identification of transcription factor binding sites on the basis of a computational search of footprint-like DNase I cleavage patterns on the DNA. Frequently in high-throughput methods, experimental artifacts such as DNase I cleavage bias affect the computational analysis of DNase-seq experiments. Here we performed a comprehensive and systematic study on the performance of computational footprinting methods. We evaluated ten footprinting methods in a panel of DNase-seq experiments for their ability to recover cell-specific transcription factor binding sites. We show that three methods—HINT, DNase2TF and PIQ—consistently outperformed the other evaluated methods and that correcting the DNase-seq signal for experimental artifacts significantly improved the accuracy of computational footprints. We also propose a score that can be used to detect footprints arising from transcription factors with potentially short residence times.

Feel free to begin the discussion in the Comments section below.


Multitask matrix completion for learning protein interactions across diseases

Four our next meeting on 3/28/2016 we have selected Multitask matrix completion for learning protein interactions across diseases by Kshirsagar et al. The abstract is as follows.

Disease causing pathogens such as viruses, introduce their proteins into the host cells where they interact with the host’s proteins enabling the virus to replicate inside the host. These interactions be- tween pathogen and host proteins are key to understanding infectious diseases. Often multiple diseases involve phylogenetically related or bio- logically similar pathogens. Here we present a multitask learning method to jointly model interactions between human proteins and three different, but related viruses: Hepatitis C, Ebola virus and Influenza A. Our multi- task matrix completion based model uses a shared low-rank structure in addition to a task-specific sparse structure to incorporate the various in- teractions. We obtain upto a 39% improvement in predictive performance over prior state-of-the-art models. We show how our model’s parame- ters can be interpreted to reveal both general and specific interaction- relevant characteristics of the viruses. Our code and data is available at: http://www.cs.cmu.edu/~mkshirsa/bsl_mtl.tgz

We look forward to seeing all who can come. Feel free to begin our discussion in the comments section below.


Factor graphs and the sum-product algorithm

Dear Journal Club members,

Our next meeting will be on March 14th, at noon in room 3160 of the Discovery Building. For this meeting we have selected the a paper by Kschischang et al, Factor graphs and the sum-product algorithm from IEEE. The abstract is presented below.

Algorithms that must deal with complicated global functions of many variables often exploit the manner in which the given functions factor as a product of “local” functions, each of which depends on a subset of the variables. Such a factorization can be visualized with a bipartite graph that we call a factor graph, In this tutorial paper, we present a generic message-passing algorithm, the sum-product algorithm, that operates in a factor graph. Following a single, simple computational rule, the sum-product algorithm computes-either exactly or approximately-various marginal functions derived from the global function. A wide variety of algorithms developed in artificial intelligence, signal processing, and digital communications can be derived as specific instances of the sum-product algorithm, including the forward/backward algorithm, the Viterbi algorithm, the iterative “turbo” decoding algorithm, Pearl’s (1988) belief propagation algorithm for Bayesian networks, the Kalman filter, and certain fast Fourier transform (FFT) algorithms

Please feel free to start the discussion in the comments section below.


GIM3E: condition-specific models of cellular metabolism developed from metabolomics and expression data.

For our next meeting we have selected the GIM3E paper for our discussion. The meeting will be in our usual location at noon on Monday Feb 8th. The paper is available at PubMed. The abstract summary is as follows:

MOTIVATION:
Genome-scale metabolic models have been used extensively to investigate alterations in cellular metabolism. The accuracy of these models to represent cellular metabolism in specific conditions has been improved by constraining the model with omics data sources. However, few practical methods for integrating metabolomics data with other omics data sources into genome-scale models of metabolism have been developed.
RESULTS:
GIM(3)E (Gene Inactivation Moderated by Metabolism, Metabolomics and Expression) is an algorithm that enables the development of condition-specific models based on an objective function, transcriptomics and cellular metabolomics data. GIM(3)E establishes metabolite use requirements with metabolomics data, uses model-paired transcriptomics data to find experimentally supported solutions and provides calculations of the turnover (production/consumption) flux of metabolites. GIM(3)E was used to investigate the effects of integrating additional omics datasets to create increasingly constrained solution spaces of Salmonella Typhimurium metabolism during growth in both rich and virulence media. This integration proved to be informative and resulted in a requirement of additional active reactions (12 in each case) or metabolites (26 or 29, respectively). The addition of constraints from transcriptomics also impacted the allowed solution space, and the cellular metabolites with turnover fluxes that were necessarily altered by the change in conditions increased from 118 to 271 of 1397.

Please feel free to begin the discussion in the comments section alone.