Reproducibility of computational workflows is automated using continuous analysis

Our next meeting will be at 3:00 on March 24th, in room 4160 of the Discovery building. Our Selected paper is Reproducibility of computational workflows is automated using continuous analysis.
The abstract is as follows.

Replication, validation and extension of experiments are crucial for scientific progress. Computational experiments are scriptable and should be easy to reproduce. However, computational analyses are designed and run in a specific computing environment, which may be difficult or impossible to match using written instructions. We report the development of continuous analysis, a workflow that enables reproducible computational analyses. Continuous analysis combines Docker, a container technology akin to virtual machines, with continuous integration, a software development technique, to automatically rerun a computational analysis whenever updates or improvements are made to source code or data. This enables researchers to reproduce results without contacting the study authors. Continuous analysis allows reviewers, editors or readers to verify reproducibility without manually downloading and rerunning code and can provide an audit trail for analyses of data that cannot be shared.

We welcome all who can join us for this discussion. Feel free to begin that discussion in the comments section below.

Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models

Our next meeting will be at 12:30 on June 6th, in room 3160 of the Discovery building. Our Selected paper is Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models. The abstract is as follows.

Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a “gain-of-target” for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes.

We well all who can join us for this discussion. Feel free to begin that discussion in the comments section below.

Inferring causal molecular networks: empirical assessment through a community-based method. 1

For our next Journal Club Meeting we will read Inferring causal molecular networks: empirical assessment through a community-based method.. The abstract is as follows:

It remains unclear whether causal, rather than merely correlational, relationships in molecular networks can be inferred in complex biological settings. Here we describe the HPN-DREAM network inference challenge, which focused on learning causal influences in signaling networks. We used phosphoprotein data from cancer cell lines as well as in silico data from a nonlinear dynamical model. Using the phosphoprotein data, we scored more than 2,000 networks submitted by challenge participants. The networks spanned 32 biological contexts and were scored in terms of causal validity with respect to unseen interventional data. A number of approaches were effective, and incorporating known biology was generally advantageous. Additional sub-challenges considered time-course prediction and visualization. Our results suggest that learning causal relationships may be feasible in complex settings such as disease states. Furthermore, our scoring approach provides a practical way to empirically assess inferred molecular networks in a causal sense.

We look forward to seeing all who can attend and feel free to extend our discussion into the comments section below.

Analysis of computational footprinting methods for DNase sequencing experiments 1

Our paper for the next journal club meeting on 4/18/2016 is Analysis of computational footprinting methods for DNase sequencing experiments by Gusmao et al. (Nature Methods, 2016). The abstract is as follows.

DNase-seq allows nucleotide-level identification of transcription factor binding sites on the basis of a computational search of footprint-like DNase I cleavage patterns on the DNA. Frequently in high-throughput methods, experimental artifacts such as DNase I cleavage bias affect the computational analysis of DNase-seq experiments. Here we performed a comprehensive and systematic study on the performance of computational footprinting methods. We evaluated ten footprinting methods in a panel of DNase-seq experiments for their ability to recover cell-specific transcription factor binding sites. We show that three methods—HINT, DNase2TF and PIQ—consistently outperformed the other evaluated methods and that correcting the DNase-seq signal for experimental artifacts significantly improved the accuracy of computational footprints. We also propose a score that can be used to detect footprints arising from transcription factors with potentially short residence times.

Feel free to begin the discussion in the Comments section below.

Factor graphs and the sum-product algorithm

Dear Journal Club members,

Our next meeting will be on March 14th, at noon in room 3160 of the Discovery Building. For this meeting we have selected the a paper by Kschischang et al, Factor graphs and the sum-product algorithm from IEEE. The abstract is presented below.

Algorithms that must deal with complicated global functions of many variables often exploit the manner in which the given functions factor as a product of “local” functions, each of which depends on a subset of the variables. Such a factorization can be visualized with a bipartite graph that we call a factor graph, In this tutorial paper, we present a generic message-passing algorithm, the sum-product algorithm, that operates in a factor graph. Following a single, simple computational rule, the sum-product algorithm computes-either exactly or approximately-various marginal functions derived from the global function. A wide variety of algorithms developed in artificial intelligence, signal processing, and digital communications can be derived as specific instances of the sum-product algorithm, including the forward/backward algorithm, the Viterbi algorithm, the iterative “turbo” decoding algorithm, Pearl’s (1988) belief propagation algorithm for Bayesian networks, the Kalman filter, and certain fast Fourier transform (FFT) algorithms

Please feel free to start the discussion in the comments section below.

Quantro: a data-driven approach to guide the appropriate normalization method.

Our next meeting will be held on November 9th at noon in room 3160 of the Discovery Building. The chosen paper is on the Quantro method, a data-driven approach for choosing the best normalization methods. The paper is available from Genome Biology.

The abstract is as follows

Normalization is an essential step in the analysis of high-throughput data. Multi-sample global normalization methods, such as quantile normalization, have been successfully used to remove technical variation. However, these methods rely on the assumption that observed global changes across samples are due to unwanted technical variability. Applying global normalization methods has the potential to remove biologically driven variation. Currently, it is up to the subject matter experts to determine if the stated assumptions are appropriate. Here, we propose a data-driven alternative. We demonstrate the utility of our method (quantro) through examples and simulations. A software implementation is available from

We look forward to seeing those who can join and feel free to begin the discussion below.

Elucidating Compound Mechanism of Action by Network Perturbation Analysis

Our next paper selection is a network perturbation paper from Cell, titled Elucidating Compound Mechanism of Action by Network Perturbation Analysis. It is available from ScienceDirect. The abstract is as follows:

Genome-wide identification of the mechanism of action (MoA) of small-molecule compounds characterizing their targets, effectors, and activity modulators represents a highly relevant yet elusive goal, with critical implications for assessment of compound efficacy and toxicity. Current approaches are labor intensive and mostly limited to elucidating high-affinity binding target proteins. We introduce a regulatory network-based approach that elucidates genome-wide MoA proteins based on the assessment of the global dysregulation of their molecular interactions following compound perturbation. Analysis of cellular perturbation profiles identified established MoA proteins for 70% of the tested compounds and elucidated novel proteins that were experimentally validated. Finally, unknown-MoA compound analysis revealed altretamine, an anticancer drug, as an inhibitor of glutathione peroxidase 4 lipid repair activity, which was experimentally confirmed, thus revealing unexpected similarity to the activity of sulfasalazine. This suggests that regulatory network analysis can provide valuable mechanistic insight into the elucidation of small-molecule MoA and compound similarity.

We will meet on Monday, October 12th in room 3160 of the Discovery Building at noon, per our usual schedule. Feel free to start our discussion in the comments section below.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning 2

On Monday Sept. 14th, we will meet in room 3160 of the Discovery Building to discuss a Deep Learning method named DeepBind. The paper, is available at

The abstract of the paper reads as follows:

Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with ‘deep learning’ techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a ‘mutation map’ that indicates how variations affect binding within a specific sequence.

We welcome you to post your questions and ideas here in the Comments section of this blog.


Deep Learning

In the last two months, a couple of groups have published papers applying deep learning to problems related to gene regulation:  protein-nucleic acid binding specificity [1] and chromatin state [2]. We will be talking about these soon.

Before discussing these papers, we think it will be useful to give people some time to get familiar with the fundamentals of artificial neural networks and deep learning. So, this coming *Monday* at our new time of 12 noon, we’ll have a meeting to talk about deep learning and work through each other’s questions. Beforehand, please check out some of the following resources and bring questions (or expertise you’d like to share!).

At the meeting, we’ll walk through the topics in this Nature review:

More resources:

Lecture slides from Mark’s machine learning class:, ANNs-2.pdf

Intro to neural networks from a programming perspective (just skimmed this one; looks like an interesting presentation):

[1] DeepBind (Alipanahi et al, Nature Biotech 2015)
[2] DeepSEA (Zhou & Troyanskaya, Nature Methods 2015)


Wanderlust with special guest Monacle

“Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development”

Bendall et al, Cell 2014


Tissue regeneration is an orchestrated progression of cells from an immature state to a mature one, conventionally represented as distinctive cell subsets. A continuum of transitional cell states exists between these discrete stages. We combine the depth of single-cell mass cytometry and an algorithm developed to leverage this continuum by aligning single cells of a given lineage onto a unified trajectory that accurately predicts the developmental path de novo. Applied to human B cell lymphopoiesis, the algorithm (termed Wanderlust) constructed trajectories spanning from hematopoietic stem cells through to naive B cells. This trajectory revealed nascent fractions of B cell progenitors and aligned them with developmentally cued regulatory signaling including IL-7/STAT5 and cellular events such as immunoglobulin rearrangement, highlighting checkpoints across which regulatory signals are rewired paralleling changes in cellular state. This study provides a comprehensive analysis of human B lymphopoiesis, laying a foundation to apply this approach to other tissues and “corrupted” developmental processes including cancer.

Copyright © 2014 Elsevier Inc. All rights reserved.

Monocle method

(Trapnell et al, Nature 2014)


Defining the transcriptional dynamics of a temporal process such as cell differentiation is challenging owing to the high variability in gene expression between individual cells. Time-series gene expression analyses of bulk cells have difficulty distinguishing early and late phases of a transcriptional cascade or identifying rare subpopulations of cells, and single-cell proteomic methods rely on a priori knowledge of key distinguishing markers. Here we describe Monocle, an unsupervised algorithm that increases the temporal resolution of transcriptome dynamics using single-cell RNA-Seq data collected at multiple time points. Applied to the differentiation of primary human myoblasts, Monocle revealed switch-like changes in expression of key regulatory factors, sequential waves of gene regulation, and expression of regulators that were not known to act in differentiation. We validated some of these predicted regulators in a loss-of function screen. Monocle can in principle be used to recover single-cell gene expression kinetics from a wide array of cellular processes, including differentiation, proliferation and oncogenic transformation.