Enhancer Evolution across 20 Mammalian Species

Our first meeting of 2016 is scheduled for 12:00 on the 25th of January in room 3160 in the Discovery building. The room may be subject to change. The paper selection is Enhancer Evolution across 20 Mammalian Species, available online at the link. We will allot some time at the beginning of our meeting to discuss paper suggestions and themes we would like to cover this semester.

The abstract of the paper is as follows. Please feel free to begin our discussion in the comments section below.

The mammalian radiation has corresponded with rapid changes in noncoding regions of the genome, but we lack a comprehensive understanding of regulatory evolution in mammals. Here, we track the evolution of promoters and enhancers active in liver across 20 mammalian species from six diverse orders by profiling genomic enrichment of H3K27 acetylation and H3K4 trimethylation. We report that rapid evolution of enhancers is a universal feature of mammalian genomes. Most of the recently evolved enhancers arise from ancestral DNA exaptation, rather than lineage-specific expansions of repeat elements. In contrast, almost all liver promoters are partially or fully conserved across these species. Our data further reveal that recently evolved enhancers can be associated with genes under positive selection, demonstrating the power of this approach for annotating regulatory adaptations in genomic sequences. These results provide important insight into the functional genetics underpinning mammalian regulatory evolution.

We look forward to seeing those who can attend soon.

Sharing and Specificity of Co-expression Networks across 35 Human Tissues

Our paper selection for Monday, October 26th is about an analysis of RNAseq data from the GTEx collaboration, titled Sharing and Specificity of Co-expression Networks across 35 Human Tissues. It is available at the PLOS Computational Biology website. The abstract reads as follows.

To understand the regulation of tissue-specific gene expression, the GTEx Consortium generated RNA-seq expression data for more than thirty distinct human tissues. This data provides an opportunity for deriving shared and tissue specific gene regulatory networks on the basis of co-expression between genes. However, a small number of samples are available for a majority of the tissues, and therefore statistical inference of networks in this setting is highly underpowered. To address this problem, we infer tissue-specific gene co-expression networks for 35 tissues in the GTEx dataset using a novel algorithm, GNAT, that uses a hierarchy of tissues to share data between related tissues. We show that this transfer learning approach increases the accuracy with which networks are learned. Analysis of these networks reveals that tissue-specific transcription factors are hubs that preferentially connect to genes with tissue specific functions. Additionally, we observe that genes with tissue-specific functions lie at the peripheries of our networks. We identify numerous modules enriched for Gene Ontology functions, and show that modules conserved across tissues are especially likely to have functions common to all tissues, while modules that are upregulated in a particular tissue are often instrumental to tissue-specific function. Finally, we provide a web tool, available at, which allows exploration of gene function and regulation in a tissue-specific manner.

We look forward to seeing those who can attend on the 26th, and please feel free to start the discussion section below.

Elucidating Compound Mechanism of Action by Network Perturbation Analysis

Our next paper selection is a network perturbation paper from Cell, titled Elucidating Compound Mechanism of Action by Network Perturbation Analysis. It is available from ScienceDirect. The abstract is as follows:

Genome-wide identification of the mechanism of action (MoA) of small-molecule compounds characterizing their targets, effectors, and activity modulators represents a highly relevant yet elusive goal, with critical implications for assessment of compound efficacy and toxicity. Current approaches are labor intensive and mostly limited to elucidating high-affinity binding target proteins. We introduce a regulatory network-based approach that elucidates genome-wide MoA proteins based on the assessment of the global dysregulation of their molecular interactions following compound perturbation. Analysis of cellular perturbation profiles identified established MoA proteins for 70% of the tested compounds and elucidated novel proteins that were experimentally validated. Finally, unknown-MoA compound analysis revealed altretamine, an anticancer drug, as an inhibitor of glutathione peroxidase 4 lipid repair activity, which was experimentally confirmed, thus revealing unexpected similarity to the activity of sulfasalazine. This suggests that regulatory network analysis can provide valuable mechanistic insight into the elucidation of small-molecule MoA and compound similarity.

We will meet on Monday, October 12th in room 3160 of the Discovery Building at noon, per our usual schedule. Feel free to start our discussion in the comments section below.

Predicting effects of noncoding variants with deep learning–based sequence model

This week we will conclude our Deep Learning Theme with a loot at DeepSEA. Our meeting will be at noon Monday, September 28th in room 3160 of the Discovery building.

The title of the paper is Predicting effects of noncoding variants with deep learning–based sequence model, and it is available at The abstract reads as follows:

Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning–based algorithmic framework, DeepSEA (, that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants.

We look forward to seeing all who can attend next Monday, and please feel free to start the discussion in the comments section below.

Sara and Debbie

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning 2

On Monday Sept. 14th, we will meet in room 3160 of the Discovery Building to discuss a Deep Learning method named DeepBind. The paper, is available at

The abstract of the paper reads as follows:

Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with ‘deep learning’ techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a ‘mutation map’ that indicates how variations affect binding within a specific sequence.

We welcome you to post your questions and ideas here in the Comments section of this blog.


Deep Learning

In the last two months, a couple of groups have published papers applying deep learning to problems related to gene regulation:  protein-nucleic acid binding specificity [1] and chromatin state [2]. We will be talking about these soon.

Before discussing these papers, we think it will be useful to give people some time to get familiar with the fundamentals of artificial neural networks and deep learning. So, this coming *Monday* at our new time of 12 noon, we’ll have a meeting to talk about deep learning and work through each other’s questions. Beforehand, please check out some of the following resources and bring questions (or expertise you’d like to share!).

At the meeting, we’ll walk through the topics in this Nature review:

More resources:

Lecture slides from Mark’s machine learning class:, ANNs-2.pdf

Intro to neural networks from a programming perspective (just skimmed this one; looks like an interesting presentation):

[1] DeepBind (Alipanahi et al, Nature Biotech 2015)
[2] DeepSEA (Zhou & Troyanskaya, Nature Methods 2015)


Wanderlust with special guest Monacle

“Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development”

Bendall et al, Cell 2014


Tissue regeneration is an orchestrated progression of cells from an immature state to a mature one, conventionally represented as distinctive cell subsets. A continuum of transitional cell states exists between these discrete stages. We combine the depth of single-cell mass cytometry and an algorithm developed to leverage this continuum by aligning single cells of a given lineage onto a unified trajectory that accurately predicts the developmental path de novo. Applied to human B cell lymphopoiesis, the algorithm (termed Wanderlust) constructed trajectories spanning from hematopoietic stem cells through to naive B cells. This trajectory revealed nascent fractions of B cell progenitors and aligned them with developmentally cued regulatory signaling including IL-7/STAT5 and cellular events such as immunoglobulin rearrangement, highlighting checkpoints across which regulatory signals are rewired paralleling changes in cellular state. This study provides a comprehensive analysis of human B lymphopoiesis, laying a foundation to apply this approach to other tissues and “corrupted” developmental processes including cancer.

Copyright © 2014 Elsevier Inc. All rights reserved.

Monocle method

(Trapnell et al, Nature 2014)


Defining the transcriptional dynamics of a temporal process such as cell differentiation is challenging owing to the high variability in gene expression between individual cells. Time-series gene expression analyses of bulk cells have difficulty distinguishing early and late phases of a transcriptional cascade or identifying rare subpopulations of cells, and single-cell proteomic methods rely on a priori knowledge of key distinguishing markers. Here we describe Monocle, an unsupervised algorithm that increases the temporal resolution of transcriptome dynamics using single-cell RNA-Seq data collected at multiple time points. Applied to the differentiation of primary human myoblasts, Monocle revealed switch-like changes in expression of key regulatory factors, sequential waves of gene regulation, and expression of regulators that were not known to act in differentiation. We validated some of these predicted regulators in a loss-of function screen. Monocle can in principle be used to recover single-cell gene expression kinetics from a wide array of cellular processes, including differentiation, proliferation and oncogenic transformation.