Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning

Our next meeting will be at 3:00 on June 23th, in room 4160 of the Discovery building. Our Selected paper is Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning.
The abstract is as follows.

We present single-cell interpretation via multikernel learning (SIMLR), an analytic framework and software which learns a similarity measure from single-cell RNA-seq data in order to perform dimension reduction, clustering and visualization. On seven published data sets, we benchmark SIMLR against state-of-the-art methods. We show that SIMLR is scalable and greatly enhances clustering performance while improving the visualization and interpretability of single-cell sequencing data.

We welcome all who can join us for this discussion. Feel free to begin that discussion in the comments section below.

Can We Predict Gene Expression by Understanding Proximal Promoter Architecture?

Our next meeting will be at 3:00 on April 14th, in room 4160 of the Discovery building. Our Selected paper is Discovering sparse transcription factor codes for cell states and state transitions during development.
The abstract is as follows.

We review computational predictions of expression from the promoter architecture – the set of transcription factors that can bind the proximal promoter. We focus on spatial expression patterns in animals with complex body plans and many distinct tissue types. This field is ripe for change as functional genomics datasets accumulate for both expression and protein–DNA interactions. While there has been some success in predicting the breadth of expression (i.e., the fraction of tissue types a gene is expressed in), predicting tissue specificity remains challenging. We discuss how progress can be achieved through either machine learning or complementary combinatorial data mining. The likely impact of single-cell expression data is considered. Finally, we discuss the design of artificial promoters as a practical application.

We welcome all who can join us for this discussion. Feel free to begin that discussion in the comments section below.

Mutation effects predicted from sequence co-variation

Our next meeting will be at 3:00 on February 24th, in room 4160 of the Discovery building. Our Selected paper is Mutation effects predicted from sequence co-variation.
The abstract is as follows.

Many high-throughput experimental technologies have been developed to assess the effects of large numbers of mutations (variation) on phenotypes. However, designing functional assays for these methods is challenging, and systematic testing of all combinations is impossible, so robust methods to predict the effects of genetic variation are needed. Most prediction methods exploit evolutionary sequence conservation but do not consider the interdependencies of residues or bases. We present EVmutation, an unsupervised statistical method for predicting the effects of mutations that explicitly captures residue dependencies between positions. We validate EVmutation by comparing its predictions with outcomes of high-throughput mutagenesis experiments and measurements of human disease mutations and show that it outperforms methods that do not account for epistasis. EVmutation can be used to assess the quantitative effects of mutations in genes of any organism. We provide pre-computed predictions for ~7,000 human proteins at http://evmutation.org/.

We welcome all who can join us for this discussion. Feel free to begin that discussion in the comments section below.

Analysis of computational footprinting methods for DNase sequencing experiments 1

Our paper for the next journal club meeting on 4/18/2016 is Analysis of computational footprinting methods for DNase sequencing experiments by Gusmao et al. (Nature Methods, 2016). The abstract is as follows.

DNase-seq allows nucleotide-level identification of transcription factor binding sites on the basis of a computational search of footprint-like DNase I cleavage patterns on the DNA. Frequently in high-throughput methods, experimental artifacts such as DNase I cleavage bias affect the computational analysis of DNase-seq experiments. Here we performed a comprehensive and systematic study on the performance of computational footprinting methods. We evaluated ten footprinting methods in a panel of DNase-seq experiments for their ability to recover cell-specific transcription factor binding sites. We show that three methods—HINT, DNase2TF and PIQ—consistently outperformed the other evaluated methods and that correcting the DNase-seq signal for experimental artifacts significantly improved the accuracy of computational footprints. We also propose a score that can be used to detect footprints arising from transcription factors with potentially short residence times.

Feel free to begin the discussion in the Comments section below.

Multitask matrix completion for learning protein interactions across diseases

Four our next meeting on 3/28/2016 we have selected Multitask matrix completion for learning protein interactions across diseases by Kshirsagar et al. The abstract is as follows.

Disease causing pathogens such as viruses, introduce their proteins into the host cells where they interact with the host’s proteins enabling the virus to replicate inside the host. These interactions be- tween pathogen and host proteins are key to understanding infectious diseases. Often multiple diseases involve phylogenetically related or bio- logically similar pathogens. Here we present a multitask learning method to jointly model interactions between human proteins and three different, but related viruses: Hepatitis C, Ebola virus and Influenza A. Our multi- task matrix completion based model uses a shared low-rank structure in addition to a task-specific sparse structure to incorporate the various in- teractions. We obtain upto a 39% improvement in predictive performance over prior state-of-the-art models. We show how our model’s parame- ters can be interpreted to reveal both general and specific interaction- relevant characteristics of the viruses. Our code and data is available at: http://www.cs.cmu.edu/~mkshirsa/bsl_mtl.tgz

We look forward to seeing all who can come. Feel free to begin our discussion in the comments section below.

Factor graphs and the sum-product algorithm

Dear Journal Club members,

Our next meeting will be on March 14th, at noon in room 3160 of the Discovery Building. For this meeting we have selected the a paper by Kschischang et al, Factor graphs and the sum-product algorithm from IEEE. The abstract is presented below.

Algorithms that must deal with complicated global functions of many variables often exploit the manner in which the given functions factor as a product of “local” functions, each of which depends on a subset of the variables. Such a factorization can be visualized with a bipartite graph that we call a factor graph, In this tutorial paper, we present a generic message-passing algorithm, the sum-product algorithm, that operates in a factor graph. Following a single, simple computational rule, the sum-product algorithm computes-either exactly or approximately-various marginal functions derived from the global function. A wide variety of algorithms developed in artificial intelligence, signal processing, and digital communications can be derived as specific instances of the sum-product algorithm, including the forward/backward algorithm, the Viterbi algorithm, the iterative “turbo” decoding algorithm, Pearl’s (1988) belief propagation algorithm for Bayesian networks, the Kalman filter, and certain fast Fourier transform (FFT) algorithms

Please feel free to start the discussion in the comments section below.

Enhancer Evolution across 20 Mammalian Species

Our first meeting of 2016 is scheduled for 12:00 on the 25th of January in room 3160 in the Discovery building. The room may be subject to change. The paper selection is Enhancer Evolution across 20 Mammalian Species, available online at the link. We will allot some time at the beginning of our meeting to discuss paper suggestions and themes we would like to cover this semester.

The abstract of the paper is as follows. Please feel free to begin our discussion in the comments section below.

The mammalian radiation has corresponded with rapid changes in noncoding regions of the genome, but we lack a comprehensive understanding of regulatory evolution in mammals. Here, we track the evolution of promoters and enhancers active in liver across 20 mammalian species from six diverse orders by profiling genomic enrichment of H3K27 acetylation and H3K4 trimethylation. We report that rapid evolution of enhancers is a universal feature of mammalian genomes. Most of the recently evolved enhancers arise from ancestral DNA exaptation, rather than lineage-specific expansions of repeat elements. In contrast, almost all liver promoters are partially or fully conserved across these species. Our data further reveal that recently evolved enhancers can be associated with genes under positive selection, demonstrating the power of this approach for annotating regulatory adaptations in genomic sequences. These results provide important insight into the functional genetics underpinning mammalian regulatory evolution.

We look forward to seeing those who can attend soon.


Methods for time series analysis of RNA-seq data with application to human Th17 cell differentiation

  1. Tarmo Äijö1,*,
  2. Vincent Butty2,
  3. Zhi Chen3,
  4. Verna Salo3,
  5. Subhash Tripathi3,
  6. Christopher B. Burge2,
  7. Riitta Lahesmaa3 and
  8. Harri Lähdesmäki1,3,*

+Author Affiliations

  1. 1Department of Information and Computer Science, Aalto University, FI-00076 Aalto, Finland, 2Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA and 3Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, FI-20520 Turku, Finland
  1. *To whom correspondence should be addressed


Motivation: Gene expression profiling using RNA-seq is a powerful technique for screening RNA species’ landscapes and their dynamics in an unbiased way. While several advanced methods exist for differential expression analysis of RNA-seq data, proper tools to anal.yze RNA-seq time-course have not been proposed.

Results: In this study, we use RNA-seq to measure gene expression during the early human T helper 17 (Th17) cell differentiation and Tcell activation (Th0). To quantify Th17specific gene expression dynamics, we present a novel statistical methodology, DyNB, for analyzing time-course RNA-seq data. We use non-parametric Gaussian processes to model temporal correlation in gene expression and combine that with negative binomial likelihood for the count data. To account for experimentspecific biases in gene expression dynamics, such as differences in cell differentiation efficiencies, we propose a method to rescale the dynamics between replicated measurements. We develop an MCMC sampling method to make inference of differential expression dynamics between conditions. DyNB identifies several known and novel genes involved in Th17 differentiation. Analysis of differentiation efficiencies revealed consistent patterns in gene expression dynamics between different cultures. We use qRT-PCR to validate differential expression and differentiation efficiencies for selected genes. Comparison of the results with those obtained via traditional timepointwise analysis shows that time-course analysis together with time rescaling between cultures identifies differentially expressed genes which would not otherwise be detected.

Availability: An implementation of the proposed computational methods will be available at http://research.ics.aalto.fi/csb/software/

Contact: tarmo.aijo@aalto.fi or harri.lahdesmaki@aalto.fi

Supplementary information: Supplementary data are available atBioinformatics online.


Perturbation Biology: Inferring Signaling Networks in Cellular Systems

Evan J. Molinelli equal contributor, Anil Korkut equal contributor, Weiqing Wang equal contributor, Martin L. Miller, Nicholas P. Gauthier, Xiaohong Jing, Poorvi Kaushik, Qin He, Gordon Mills, David B. Solit, Christine A. Pratilas, Martin Weigt, Alfredo Braunstein, Andrea Pagnani, Riccardo Zecchina, Chris Sander


We present a powerful experimental-computational technology for inferring network models that predict the response of cells to perturbations, and that may be useful in the design of combinatorial therapy against cancer. The experiments are systematic series of perturbations of cancer cell lines by targeted drugs, singly or in combination. The response to perturbation is quantified in terms of relative changes in the measured levels of proteins, phospho-proteins and cellular phenotypes such as viability. Computational network models are derived de novo, i.e., without prior knowledge of signaling pathways, and are based on simple non-linear differential equations. The prohibitively large solution space of all possible network models is explored efficiently using a probabilistic algorithm, Belief Propagation (BP), which is three orders of magnitude faster than standard Monte Carlo methods. Explicit executable models are derived for a set of perturbation experiments in SKMEL-133 melanoma cell lines, which are resistant to the therapeutically important inhibitor of RAF kinase. The resulting network models reproduce and extend known pathway biology. They empower potential discoveries of new molecular interactions and predict efficacious novel drug perturbations, such as the inhibition of PLK1, which is verified experimentally. This technology is suitable for application to larger systems in diverse areas of molecular biology.


To comment, please see the continuation meeting post on 02.05.14.