Yearly Archives: 2015

Quantro: a data-driven approach to guide the appropriate normalization method.

Our next meeting will be held on November 9th at noon in room 3160 of the Discovery Building. The chosen paper is on the Quantro method, a data-driven approach for choosing the best normalization methods. The paper is available from Genome Biology.

The abstract is as follows

Normalization is an essential step in the analysis of high-throughput data. Multi-sample global normalization methods, such as quantile normalization, have been successfully used to remove technical variation. However, these methods rely on the assumption that observed global changes across samples are due to unwanted technical variability. Applying global normalization methods has the potential to remove biologically driven variation. Currently, it is up to the subject matter experts to determine if the stated assumptions are appropriate. Here, we propose a data-driven alternative. We demonstrate the utility of our method (quantro) through examples and simulations. A software implementation is available from

We look forward to seeing those who can join and feel free to begin the discussion below.

Sharing and Specificity of Co-expression Networks across 35 Human Tissues

Our paper selection for Monday, October 26th is about an analysis of RNAseq data from the GTEx collaboration, titled Sharing and Specificity of Co-expression Networks across 35 Human Tissues. It is available at the PLOS Computational Biology website. The abstract reads as follows.

To understand the regulation of tissue-specific gene expression, the GTEx Consortium generated RNA-seq expression data for more than thirty distinct human tissues. This data provides an opportunity for deriving shared and tissue specific gene regulatory networks on the basis of co-expression between genes. However, a small number of samples are available for a majority of the tissues, and therefore statistical inference of networks in this setting is highly underpowered. To address this problem, we infer tissue-specific gene co-expression networks for 35 tissues in the GTEx dataset using a novel algorithm, GNAT, that uses a hierarchy of tissues to share data between related tissues. We show that this transfer learning approach increases the accuracy with which networks are learned. Analysis of these networks reveals that tissue-specific transcription factors are hubs that preferentially connect to genes with tissue specific functions. Additionally, we observe that genes with tissue-specific functions lie at the peripheries of our networks. We identify numerous modules enriched for Gene Ontology functions, and show that modules conserved across tissues are especially likely to have functions common to all tissues, while modules that are upregulated in a particular tissue are often instrumental to tissue-specific function. Finally, we provide a web tool, available at, which allows exploration of gene function and regulation in a tissue-specific manner.

We look forward to seeing those who can attend on the 26th, and please feel free to start the discussion section below.

Elucidating Compound Mechanism of Action by Network Perturbation Analysis

Our next paper selection is a network perturbation paper from Cell, titled Elucidating Compound Mechanism of Action by Network Perturbation Analysis. It is available from ScienceDirect. The abstract is as follows:

Genome-wide identification of the mechanism of action (MoA) of small-molecule compounds characterizing their targets, effectors, and activity modulators represents a highly relevant yet elusive goal, with critical implications for assessment of compound efficacy and toxicity. Current approaches are labor intensive and mostly limited to elucidating high-affinity binding target proteins. We introduce a regulatory network-based approach that elucidates genome-wide MoA proteins based on the assessment of the global dysregulation of their molecular interactions following compound perturbation. Analysis of cellular perturbation profiles identified established MoA proteins for 70% of the tested compounds and elucidated novel proteins that were experimentally validated. Finally, unknown-MoA compound analysis revealed altretamine, an anticancer drug, as an inhibitor of glutathione peroxidase 4 lipid repair activity, which was experimentally confirmed, thus revealing unexpected similarity to the activity of sulfasalazine. This suggests that regulatory network analysis can provide valuable mechanistic insight into the elucidation of small-molecule MoA and compound similarity.

We will meet on Monday, October 12th in room 3160 of the Discovery Building at noon, per our usual schedule. Feel free to start our discussion in the comments section below.

Predicting effects of noncoding variants with deep learning–based sequence model

This week we will conclude our Deep Learning Theme with a loot at DeepSEA. Our meeting will be at noon Monday, September 28th in room 3160 of the Discovery building.

The title of the paper is Predicting effects of noncoding variants with deep learning–based sequence model, and it is available at The abstract reads as follows:

Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning–based algorithmic framework, DeepSEA (, that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants.

We look forward to seeing all who can attend next Monday, and please feel free to start the discussion in the comments section below.

Sara and Debbie

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning 2

On Monday Sept. 14th, we will meet in room 3160 of the Discovery Building to discuss a Deep Learning method named DeepBind. The paper, is available at

The abstract of the paper reads as follows:

Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with ‘deep learning’ techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a ‘mutation map’ that indicates how variations affect binding within a specific sequence.

We welcome you to post your questions and ideas here in the Comments section of this blog.


Deep Learning

In the last two months, a couple of groups have published papers applying deep learning to problems related to gene regulation:  protein-nucleic acid binding specificity [1] and chromatin state [2]. We will be talking about these soon.

Before discussing these papers, we think it will be useful to give people some time to get familiar with the fundamentals of artificial neural networks and deep learning. So, this coming *Monday* at our new time of 12 noon, we’ll have a meeting to talk about deep learning and work through each other’s questions. Beforehand, please check out some of the following resources and bring questions (or expertise you’d like to share!).

At the meeting, we’ll walk through the topics in this Nature review:

More resources:

Lecture slides from Mark’s machine learning class:, ANNs-2.pdf

Intro to neural networks from a programming perspective (just skimmed this one; looks like an interesting presentation):

[1] DeepBind (Alipanahi et al, Nature Biotech 2015)
[2] DeepSEA (Zhou & Troyanskaya, Nature Methods 2015)


Wanderlust with special guest Monacle

“Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development”

Bendall et al, Cell 2014


Tissue regeneration is an orchestrated progression of cells from an immature state to a mature one, conventionally represented as distinctive cell subsets. A continuum of transitional cell states exists between these discrete stages. We combine the depth of single-cell mass cytometry and an algorithm developed to leverage this continuum by aligning single cells of a given lineage onto a unified trajectory that accurately predicts the developmental path de novo. Applied to human B cell lymphopoiesis, the algorithm (termed Wanderlust) constructed trajectories spanning from hematopoietic stem cells through to naive B cells. This trajectory revealed nascent fractions of B cell progenitors and aligned them with developmentally cued regulatory signaling including IL-7/STAT5 and cellular events such as immunoglobulin rearrangement, highlighting checkpoints across which regulatory signals are rewired paralleling changes in cellular state. This study provides a comprehensive analysis of human B lymphopoiesis, laying a foundation to apply this approach to other tissues and “corrupted” developmental processes including cancer.

Copyright © 2014 Elsevier Inc. All rights reserved.

Monocle method

(Trapnell et al, Nature 2014)


Defining the transcriptional dynamics of a temporal process such as cell differentiation is challenging owing to the high variability in gene expression between individual cells. Time-series gene expression analyses of bulk cells have difficulty distinguishing early and late phases of a transcriptional cascade or identifying rare subpopulations of cells, and single-cell proteomic methods rely on a priori knowledge of key distinguishing markers. Here we describe Monocle, an unsupervised algorithm that increases the temporal resolution of transcriptome dynamics using single-cell RNA-Seq data collected at multiple time points. Applied to the differentiation of primary human myoblasts, Monocle revealed switch-like changes in expression of key regulatory factors, sequential waves of gene regulation, and expression of regulators that were not known to act in differentiation. We validated some of these predicted regulators in a loss-of function screen. Monocle can in principle be used to recover single-cell gene expression kinetics from a wide array of cellular processes, including differentiation, proliferation and oncogenic transformation.


Computational and analytical challenges in single-cell transcriptomics

Oliver Stegle1, Sarah A. Teichmann1,2 and John C. Marioni1,2

1European Molecular Biology Laboratory European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. Correspondence to J.C.M.  e-mail: doi:10.1038/nrg3833 Published online 28 January 2015



The development of high-throughput RNA sequencing (RNA-seq) at the single-cell level has already led to profound new discoveries in biology, ranging from the identification of novel cell types to the study of global patterns of stochastic gene expression. Alongside the technological breakthroughs that have facilitated the large-scale generation of single-cell transcriptomic data, it is important to consider the specific computational and analytical challenges that still have to be overcome. Although some tools for analysing RNA-seq data from bulk cell populations can be readily applied to single-cell RNA-seq data, many new computational strategies are required to fully exploit this data type and to enable a comprehensive yet detailed study of gene expression at the single-cell level.


Proportionality: A Valid Alternative to Correlation for Relative Data

David Lovell
Queensland University of Technology, Brisbane, Australia
Vera Pawlowsky-Glahn
Dept. d’Informàtica, Matemàtica Aplicada i Estadística. U. de Girona, España
Juan José Egozcue
Dept. Applied Mathematics III, U. Politécnica de Catalunya, Barcelona, Spain
Samuel Marguerat
MRC Clinical Sciences Centre, Imperial College London, United Kingdom
Jürg Bähler
Research Department of Genetics, Evolution and Environment, University College London, United Kingdom


In the life sciences, many measurement methods yield only the relative abundances of different components in a sample. With such relative—or compositional—data, differential expression needs careful interpretation, and correlation—a statistical workhorse for analyzing pairwise relationships—is an inappropriate measure of association. Using yeast gene expression data we show how correlation can be misleading and present proportionality as a valid alternative for relative data. We show how the strength of proportionality between two variables can be meaningfully and interpretably described by a new statistic ϕ which can be used instead of correlation as the basis of familiar analyses and visualisation methods, including co-expression networks and clustered heatmaps. While the main aim of this study is to present proportionality as a means to analyse relative data, it also raises intriguing questions about the molecular mechanisms underlying the proportional regulation of a range of yeast genes.


The human transcriptome across tissues and individuals

Mele, Ferreira, & Reverter et al, 2015  

Transcriptional regulation and posttranscriptional processing underlie many cellular and organismal phenotypes. We used RNA sequence data generated by Genotype-Tissue Expression (GTEx) project to investigate the patterns of transcriptome variation across individuals and tissues. Tissues exhibit characteristic transcriptional signatures that show stability in postmortem samples. These signatures are dominated by a relatively small number of genes—which is most clearly seen in blood—though few are exclusive to a particular tissue and vary more across tissues than individuals. Genes exhibiting high interindividual expression variation include disease candidates associated with sex, ethnicity, and age. Primary transcription is the major driver of cellular specificity, with splicing playing mostly a complementary role; except for the brain, which exhibits a more divergent splicing program. Variation in splicing, despite its stochasticity, may play in contrast a comparatively greater role in defining individual phenotypes.

The human transcriptome across tissues and individuals