Yearly Archives: 2016

Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models

Our next meeting will be at 12:30 on June 6th, in room 3160 of the Discovery building. Our Selected paper is Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models. The abstract is as follows.

Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a “gain-of-target” for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes.

We well all who can join us for this discussion. Feel free to begin that discussion in the comments section below.

Predicting tissue specific transcription factor binding sites

Our selection for our meeting on the 16th of May is Predicting tissue specific transcription factor binding sites. We will meet as usual in room 3160 of the Discovery building at 12:30 PM. The abstract is as follows.


Studies of gene regulation often utilize genome-wide predictions of transcription factor (TF) binding sites. Most existing prediction methods are based on sequence information alone, ignoring biological contexts such as developmental stages and tissue types. Experimental methods to study in vivo binding, including ChIP-chip and ChIP-seq, can only study one transcription factor in a single cell type and under a specific condition in each experiment, and therefore cannot scale to determine the full set of regulatory interactions in mammalian transcriptional regulatory networks.


We developed a new computational approach, PIPES, for predicting tissue-specific TF binding. PIPES integrates in vitro protein binding microarrays (PBMs), sequence conservation and tissue-specific epigenetic (DNase I hypersensitivity) information. We demonstrate that PIPES improves over existing methods on distinguishing between in vivo bound and unbound sequences using ChIP-seq data for 11 mouse TFs. In addition, our predictions are in good agreement with current knowledge of tissue-specific TF regulation.


We provide a systematic map of computationally predicted tissue-specific binding targets for 284 mouse TFs across 55 tissue/cell types. Such comprehensive resource is useful for researchers studying gene regulation.

We look forward to seeing all who can attend and feel free to begin our discussion in the comments section below.

Inferring causal molecular networks: empirical assessment through a community-based method.

For our next Journal Club Meeting we will read Inferring causal molecular networks: empirical assessment through a community-based method.. The abstract is as follows:

It remains unclear whether causal, rather than merely correlational, relationships in molecular networks can be inferred in complex biological settings. Here we describe the HPN-DREAM network inference challenge, which focused on learning causal influences in signaling networks. We used phosphoprotein data from cancer cell lines as well as in silico data from a nonlinear dynamical model. Using the phosphoprotein data, we scored more than 2,000 networks submitted by challenge participants. The networks spanned 32 biological contexts and were scored in terms of causal validity with respect to unseen interventional data. A number of approaches were effective, and incorporating known biology was generally advantageous. Additional sub-challenges considered time-course prediction and visualization. Our results suggest that learning causal relationships may be feasible in complex settings such as disease states. Furthermore, our scoring approach provides a practical way to empirically assess inferred molecular networks in a causal sense.

We look forward to seeing all who can attend and feel free to extend our discussion into the comments section below.

Analysis of computational footprinting methods for DNase sequencing experiments

Our paper for the next journal club meeting on 4/18/2016 is Analysis of computational footprinting methods for DNase sequencing experiments by Gusmao et al. (Nature Methods, 2016). The abstract is as follows.

DNase-seq allows nucleotide-level identification of transcription factor binding sites on the basis of a computational search of footprint-like DNase I cleavage patterns on the DNA. Frequently in high-throughput methods, experimental artifacts such as DNase I cleavage bias affect the computational analysis of DNase-seq experiments. Here we performed a comprehensive and systematic study on the performance of computational footprinting methods. We evaluated ten footprinting methods in a panel of DNase-seq experiments for their ability to recover cell-specific transcription factor binding sites. We show that three methods—HINT, DNase2TF and PIQ—consistently outperformed the other evaluated methods and that correcting the DNase-seq signal for experimental artifacts significantly improved the accuracy of computational footprints. We also propose a score that can be used to detect footprints arising from transcription factors with potentially short residence times.

Feel free to begin the discussion in the Comments section below.

Multitask matrix completion for learning protein interactions across diseases

Four our next meeting on 3/28/2016 we have selected Multitask matrix completion for learning protein interactions across diseases by Kshirsagar et al. The abstract is as follows.

Disease causing pathogens such as viruses, introduce their proteins into the host cells where they interact with the host’s proteins enabling the virus to replicate inside the host. These interactions be- tween pathogen and host proteins are key to understanding infectious diseases. Often multiple diseases involve phylogenetically related or bio- logically similar pathogens. Here we present a multitask learning method to jointly model interactions between human proteins and three different, but related viruses: Hepatitis C, Ebola virus and Influenza A. Our multi- task matrix completion based model uses a shared low-rank structure in addition to a task-specific sparse structure to incorporate the various in- teractions. We obtain upto a 39% improvement in predictive performance over prior state-of-the-art models. We show how our model’s parame- ters can be interpreted to reveal both general and specific interaction- relevant characteristics of the viruses. Our code and data is available at:

We look forward to seeing all who can come. Feel free to begin our discussion in the comments section below.

Factor graphs and the sum-product algorithm

Dear Journal Club members,

Our next meeting will be on March 14th, at noon in room 3160 of the Discovery Building. For this meeting we have selected the a paper by Kschischang et al, Factor graphs and the sum-product algorithm from IEEE. The abstract is presented below.

Algorithms that must deal with complicated global functions of many variables often exploit the manner in which the given functions factor as a product of “local” functions, each of which depends on a subset of the variables. Such a factorization can be visualized with a bipartite graph that we call a factor graph, In this tutorial paper, we present a generic message-passing algorithm, the sum-product algorithm, that operates in a factor graph. Following a single, simple computational rule, the sum-product algorithm computes-either exactly or approximately-various marginal functions derived from the global function. A wide variety of algorithms developed in artificial intelligence, signal processing, and digital communications can be derived as specific instances of the sum-product algorithm, including the forward/backward algorithm, the Viterbi algorithm, the iterative “turbo” decoding algorithm, Pearl’s (1988) belief propagation algorithm for Bayesian networks, the Kalman filter, and certain fast Fourier transform (FFT) algorithms

Please feel free to start the discussion in the comments section below.

GIM3E: condition-specific models of cellular metabolism developed from metabolomics and expression data.

For our next meeting we have selected the GIM3E paper for our discussion. The meeting will be in our usual location at noon on Monday Feb 8th. The paper is available at PubMed. The abstract summary is as follows:

Genome-scale metabolic models have been used extensively to investigate alterations in cellular metabolism. The accuracy of these models to represent cellular metabolism in specific conditions has been improved by constraining the model with omics data sources. However, few practical methods for integrating metabolomics data with other omics data sources into genome-scale models of metabolism have been developed.
GIM(3)E (Gene Inactivation Moderated by Metabolism, Metabolomics and Expression) is an algorithm that enables the development of condition-specific models based on an objective function, transcriptomics and cellular metabolomics data. GIM(3)E establishes metabolite use requirements with metabolomics data, uses model-paired transcriptomics data to find experimentally supported solutions and provides calculations of the turnover (production/consumption) flux of metabolites. GIM(3)E was used to investigate the effects of integrating additional omics datasets to create increasingly constrained solution spaces of Salmonella Typhimurium metabolism during growth in both rich and virulence media. This integration proved to be informative and resulted in a requirement of additional active reactions (12 in each case) or metabolites (26 or 29, respectively). The addition of constraints from transcriptomics also impacted the allowed solution space, and the cellular metabolites with turnover fluxes that were necessarily altered by the change in conditions increased from 118 to 271 of 1397.

Please feel free to begin the discussion in the comments section alone.

Enhancer Evolution across 20 Mammalian Species

Our first meeting of 2016 is scheduled for 12:00 on the 25th of January in room 3160 in the Discovery building. The room may be subject to change. The paper selection is Enhancer Evolution across 20 Mammalian Species, available online at the link. We will allot some time at the beginning of our meeting to discuss paper suggestions and themes we would like to cover this semester.

The abstract of the paper is as follows. Please feel free to begin our discussion in the comments section below.

The mammalian radiation has corresponded with rapid changes in noncoding regions of the genome, but we lack a comprehensive understanding of regulatory evolution in mammals. Here, we track the evolution of promoters and enhancers active in liver across 20 mammalian species from six diverse orders by profiling genomic enrichment of H3K27 acetylation and H3K4 trimethylation. We report that rapid evolution of enhancers is a universal feature of mammalian genomes. Most of the recently evolved enhancers arise from ancestral DNA exaptation, rather than lineage-specific expansions of repeat elements. In contrast, almost all liver promoters are partially or fully conserved across these species. Our data further reveal that recently evolved enhancers can be associated with genes under positive selection, demonstrating the power of this approach for annotating regulatory adaptations in genomic sequences. These results provide important insight into the functional genetics underpinning mammalian regulatory evolution.

We look forward to seeing those who can attend soon.