Yearly Archives: 2014


Regression Analysis of Combined Gene Expression Regulation in Acute Myeloid Leukemia

Yue Li, Minggao Liang, Zhaolei Zhang


Gene expression is a combinatorial function of genetic/epigenetic factors such as copy number variation (CNV), DNA methylation (DM), transcription factors (TF) occupancy, and microRNA (miRNA) post-transcriptional regulation. At the maturity of microarray/sequencing technologies, large amounts of data measuring the genome-wide signals of those factors became available from Encyclopedia of DNA Elements (ENCODE) and The Cancer Genome Atlas (TCGA). However, there is a lack of an integrative model to take full advantage of these rich yet heterogeneous data. To this end, we developed RACER (Regression Analysis of Combined Expression Regulation), which fits the mRNA expression as response using as explanatory variables, the TF data from ENCODE, and CNV, DM, miRNA expression signals from TCGA. Briefly, RACER first infers the sample-specific regulatory activities by TFs and miRNAs, which are then used as inputs to infer specific TF/miRNA-gene interactions. Such a two-stage regression framework circumvents a common difficulty in integrating ENCODE data measured in generic cell-line with the sample-specific TCGA measurements. As a case study, we integrated Acute Myeloid Leukemia (AML) data from TCGA and the related TF binding data measured in K562 from ENCODE. As a proof-of-concept, we first verified our model formalism by 10-fold cross-validation on predicting gene expression. We next evaluated RACER on recovering known regulatory interactions, and demonstrated its superior statistical power over existing methods in detecting known miRNA/TF targets. Additionally, we developed a feature selection procedure, which identified 18 regulators, whose activities clustered consistently with cytogenetic risk groups. One of the selected regulators is miR-548p, whose inferred targets were significantly enriched for leukemia-related pathway, implicating its novel role in AML pathogenesis. Moreover, survival analysis using the inferred activities identified C-Fos as a potential AML prognostic marker. Together, we provided a novel framework that successfully integrated the TCGA and ENCODE data in revealing AML-specific regulatory program at global level.


Methods for time series analysis of RNA-seq data with application to human Th17 cell differentiation

  1. Tarmo Äijö1,*,
  2. Vincent Butty2,
  3. Zhi Chen3,
  4. Verna Salo3,
  5. Subhash Tripathi3,
  6. Christopher B. Burge2,
  7. Riitta Lahesmaa3 and
  8. Harri Lähdesmäki1,3,*

+Author Affiliations

  1. 1Department of Information and Computer Science, Aalto University, FI-00076 Aalto, Finland, 2Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA and 3Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, FI-20520 Turku, Finland
  1. *To whom correspondence should be addressed


Motivation: Gene expression profiling using RNA-seq is a powerful technique for screening RNA species’ landscapes and their dynamics in an unbiased way. While several advanced methods exist for differential expression analysis of RNA-seq data, proper tools to anal.yze RNA-seq time-course have not been proposed.

Results: In this study, we use RNA-seq to measure gene expression during the early human T helper 17 (Th17) cell differentiation and Tcell activation (Th0). To quantify Th17specific gene expression dynamics, we present a novel statistical methodology, DyNB, for analyzing time-course RNA-seq data. We use non-parametric Gaussian processes to model temporal correlation in gene expression and combine that with negative binomial likelihood for the count data. To account for experimentspecific biases in gene expression dynamics, such as differences in cell differentiation efficiencies, we propose a method to rescale the dynamics between replicated measurements. We develop an MCMC sampling method to make inference of differential expression dynamics between conditions. DyNB identifies several known and novel genes involved in Th17 differentiation. Analysis of differentiation efficiencies revealed consistent patterns in gene expression dynamics between different cultures. We use qRT-PCR to validate differential expression and differentiation efficiencies for selected genes. Comparison of the results with those obtained via traditional timepointwise analysis shows that time-course analysis together with time rescaling between cultures identifies differentially expressed genes which would not otherwise be detected.

Availability: An implementation of the proposed computational methods will be available at

Contact: or

Supplementary information: Supplementary data are available atBioinformatics online.


Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications

  1. Eduardo G. Gusmao1,*,
  2. Christoph Dieterich2,
  3. Martin Zenke3,4 and
  4. Ivan G. Costa1,5,6,*

+Author Affiliations

  1. 1IZKF Computational Biology Research Group, Institute for Biomedical Engineering, RWTH Aachen University Medical School, 52074 Aachen, 2Computational RNA Biology Lab and Bioinformatics Core, Max Planck Institute for Biology of Ageing, 50931 Cologne, 3Department of Cell Biology, Institute for Biomedical Engineering, RWTH Aachen University Medical School, 52074, 4Helmholtz Institute for Biomedical Engineering, 52074, 5Aachen Institute for Advanced Study in Computational Engineering Science (AICES), RWTH Aachen University, 52062 Aachen, Germany and 6Center of Informatics, Federal University of Pernambuco, 50740560 Recife-PE, Brazil
  1. *To whom correspondence should be addressed
  • Received October 28, 2013.
  • Revision received June 27, 2014.
  • Accepted July 25, 2014.


Motivation: The identification of active transcriptional regulatory elements is crucial to understand regulatory networks driving cellular processes such as cell development and the onset of diseases. It has recently been shown that chromatin structure information, such as DNase I hypersensitivity (DHS) or histone modifications, significantly improves cell-specific predictions of transcription factor binding sites. However, no method has so far successfully combined both DHS and histone modification data to perform active binding site prediction.

Results: We propose here a method based on hidden Markov models to integrate DHS and histone modifications occupancy for the detection of open chromatin regions and active binding sites. We have created a framework that includes treatment of genomic signals, model training and genome-wide application. In a comparative analysis, our method obtained a good trade-off between sensitivity versus specificity and superior area under the curve statistics than competing methods. Moreover, our technique does not require further training or sequence information to generate binding location predictions. Therefore, the method can be easily applied on new cell types and allow flexible downstream analysis such asde novo motif finding.

Availability and implementation: Our framework is available as part of the Regulatory Genomics Toolbox. The software information and all benchmarking data are available at

Contact: or

Supplementary information: Supplementary data are available atBioinformatics online.


Wigwams: identifying gene modules co-regulated across multiple biological conditions

  1. Krzysztof Polanski1,,
  2. Johanna Rhodes1,,
  3. Claire Hill2,
  4. Peijun Zhang2,
  5. Dafyd J. Jenkins1,
  6. Steven J. Kiddle1,§,
  7. Aleksey Jironkin1,
  8. Jim Beynon1,2,
  9. Vicky Buchanan-Wollaston1,2,
  10. Sascha Ott1 and
  11. Katherine J. Denby1,2,*

+ Author Affiliations

  1. 1Warwick Systems Biology Centre and 2School of Life Sciences, University of Warwick, CV4 7AL, UK
  1. *To whom correspondence should be addressed.
  • Received September 17, 2013.
  • Revision received December 12, 2013.
  • Accepted December 13, 2013.


Motivation: Identification of modules of co-regulated genes is a crucial first step towards dissecting the regulatory circuitry underlying biological processes. Co-regulated genes are likely to reveal themselves by showing tight co-expression, e.g. high correlation of expression profiles across multiple time series datasets. However, numbers of up- or downregulated genes are often large, making it difficult to discriminate between dependent co-expression resulting from co-regulation and independent co-expression. Furthermore, modules of co-regulated genes may only show tight co-expression across a subset of the time series, i.e. show condition-dependent regulation.

Results: Wigwams is a simple and efficient method to identify gene modules showing evidence for co-regulation in multiple time series of gene expression data. Wigwams analyzes similarities of gene expression patterns within each time series (condition) and directly tests the dependence or independence of these across different conditions. The expression pattern of each gene in each subset of conditions is tested statistically as a potential signature of a condition-dependent regulatory mechanism regulating multiple genes. Wigwams does not require particular time points and can process datasets that are on different time scales. Differential expression relative to control conditions can be taken into account. The output is succinct and non-redundant, enabling gene network reconstruction to be focused on those gene modules and combinations of conditions that show evidence for shared regulatory mechanisms. Wigwams was run using six Arabidopsis time series expression datasets, producing a set of biologically significant modules spanning different combinations of conditions.

Availability and implementation: A Matlab implementation of Wigwams, complete with graphical user interfaces and documentation, is available at:


Supplementary Data: Supplementary data are available at Bioinformatics online.


Automatic Parameter Learning for Multiple Network Alignment

Jason Flannick1, Antal Novak1, Chuong B. Do1, Balaji S. Srinivasan2, and Serafim Batzoglou1
1Department of Computer Science, Stanford University, Stanford, CA 94305, USA
2Department of Statistics, Stanford University, Stanford, CA 94305, USA


We developed Græmlin 2.0, a new multiple network aligner with (1) a novel scoring func-
tion; (2) an algorithm that automatically learns the scoring function’s parameters; and (3) an
algorithm that uses the scoring function to globally align multiple networks. Existing alignment
tools use heuristic scoring functions, which must be hand-tuned to a given set of networks and
do not apply to multiple network alignment.
Our scoring function can use arbitrary features of a multiple network alignment, such as
protein deletions, protein duplications, protein mutations, and interaction losses. Our parameter
learning algorithm uses a training set of known network alignments to learn parameters for
our scoring function and thereby automatically adapts it to any set of networks. Our global
alignment algorithm finds approximate multiple network alignments in linear time.
We tested Græmlin 2.0’s accuracy on protein interaction networks from IntAct, DIP, and
the Stanford Network Database. We show that, on each of these datasets, Græmlin 2.0 has
higher sensitivity and specificity than existing network aligners. Græmlin 2.0 is available under
the GNU public license at


A Family of Algorithms for Computing Consensus about Node State from Network Data

Eleanor R. Brush, David C. Krakauer, Jessica C. Flack


Biological and social networks are composed of heterogeneous nodes that contribute differentially to network structure and function. A number of algorithms have been developed to measure this variation. These algorithms have proven useful for applications that require assigning scores to individual nodes–from ranking websites to determining critical species in ecosystems–yet the mechanistic basis for why they produce good rankings remains poorly understood. We show that a unifying property of these algorithms is that they quantify consensus in the network about a node’s state or capacity to perform a function. The algorithms capture consensus by either taking into account the number of a target node’s direct connections, and, when the edges are weighted, the uniformity of its weighted in-degree distribution (breadth), or by measuring net flow into a target node (depth). Using data from communication, social, and biological networks we find that that how an algorithm measures consensus–through breadth or depth– impacts its ability to correctly score nodes. We also observe variation in sensitivity to source biases in interaction/adjacency matrices: errors arising from systematic error at the node level or direct manipulation of network connectivity by nodes. Our results indicate that the breadth algorithms, which are derived from information theory, correctly score nodes (assessed using independent data) and are robust to errors. However, in cases where nodes “form opinions” about other nodes using indirect information, like reputation, depth algorithms, like Eigenvector Centrality, are required. One caveat is that Eigenvector Centrality is not robust to error unless the network is transitive or assortative. In these cases the network structure allows the depth algorithms to effectively capture breadth as well as depth. Finally, we discuss the algorithms’ cognitive and computational demands. This is an important consideration in systems in which individuals use the collective opinions of others to make decisions.


Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM


Charles J. Vaske1,†, Stephen C. Benz2,†, J. Zachary Sanborn2, Dent Earl2, Christopher Szeto2, Jingchun Zhu2, David Haussler1,2 and Joshua M. Stuart2,*

+ Author Affiliations

1 Howard Hughes Medical Institute and 2 Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, UC Santa Cruz, CA, USA

* To whom correspondence should be addressed.


Motivation: High-throughput data is providing a comprehensive view of the molecular changes in cancer tissues. New technologies allow for the simultaneous genome-wide assay of the state of genome copy number variation, gene expression, DNA methylation and epigenetics of tumor samples and cancer cell lines.

Analyses of current data sets find that genetic alterations between patients can differ but often involve common pathways. It is therefore critical to identify relevant pathways involved in cancer progression and detect how they are altered in different patients.

Results: We present a novel method for inferring patient-specific genetic activities incorporating curated pathway interactions among genes. A gene is modeled by a factor graph as a set of interconnected variables encoding the expression and known activity of a gene and its products, allowing the incorporation of many types of omic data as evidence. The method predicts the degree to which a pathway’s activities (e.g. internal gene states, interactions or high-level ‘outputs’) are altered in the patient using probabilistic inference.

Compared with a competing pathway activity inference approach called SPIA, our method identifies altered activities in cancer-related pathways with fewer false-positives in both a glioblastoma multiform (GBM) and a breast cancer dataset. PARADIGM identified consistent pathway-level activities for subsets of the GBM patients that are overlooked when genes are considered in isolation. Further, grouping GBM patients based on their significant pathway perturbations divides them into clinically-relevant subgroups having significantly different survival outcomes. These findings suggest that therapeutics might be chosen that target genes at critical points in the commonly perturbed pathway(s) of a group of patients.

Availability:Source code available at


Supplementary information: Supplementary data are available at Bioinformatics online.


Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin

Katherine A. Hoadley1, 20, Christina Yau2, 20, Denise M. Wolf3, 20, Andrew D. Cherniack4, 20, David Tamborero5, Sam Ng6, Max D.M. Leiserson7, Beifang Niu8, Michael D. McLellan8, Vladislav Uzunangelov6, Jiashan Zhang9, Cyriac Kandoth8, Rehan Akbani10, Hui Shen11, 22, Larsson Omberg12, Andy Chu13, Adam A. Margolin12, 21, Laura J. van’t Veer3, Nuria Lopez-Bigas5, 14, Peter W. Laird11, 22, Benjamin J. Raphael7, Li Ding8, A. Gordon Robertson13, Lauren A. Byers10, Gordon B. Mills10, John N. Weinstein10, Carter Van Waes18, Zhong Chen19, Eric A. Collisson15,The Cancer Genome Atlas Research Network, Christopher C. Benz2, , , Charles M. Perou1, 16, 17, , , Joshua M. Stuart6, ,

1 Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA
2 Buck Institute for Research on Aging, Novato, CA 94945, USA
3 Department of Laboratory Medicine, University of California San Francisco, 2340 Sutter St, San Francisco, CA, 94115, USA
4 The Eli and Edythe Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA
5 Research Unit on Biomedical Informatics, Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Dr. Aiguader 88, Barcelona 08003, Spain
6 Department of Biomolecular Engineering, Center for Biomolecular Sciences and Engineering, University of California, Santa Cruz, 1156 High St., Santa Cruz, CA 95064, USA
7 Department of Computer Science and Center for Computational Molecular Biology, Brown University, 115 Waterman St, Providence RI 02912, USA
8 The Genome Institute, Washington University, St Louis, MO 63108, USA
9 National Cancer Institute, NIH, Bethesda, MD 20892, USA
10 UT MD Anderson Cancer Center, Bioinformatics and Computational Biology, 1400 Pressler Street, Unit 1410, Houston, TX 77030, USA
11 USC Epigenome Center, University of Southern California Keck School of Medicine, 1450 Biggy Street, Los Angeles, CA 90033, USA
12 Sage Bionetworks 1100 Fairview Avenue North, M1-C108, Seattle, WA 98109-1024, USA
13 Canada’s Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC V5Z 4S6, Canada
14 Catalan Institution for Research and Advanced Studies (ICREA), Passeig Lluís Companys, 23, Barcelona 08010, Spain
15 Department of Medicine, University of California San Francisco, 450 35d St, San Francisco, CA, 94148, USA
16 Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA
17 Department of Pathology and Laboratory Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA
18 Building 10, Room 4-2732, NIDCD/NIH, 10 Center Drive, Bethesda, MD 20892
19 Head and Neck Surgery Branch, NIDCD/NIH, 10 Center Drive, Room 5D55, Bethesda, MD 20892


An Integrated Model of Multiple-Condition ChIP-Seq Data Reveals Predeterminants of Cdx2 Binding


Shaun Mahony*, Matthew D. Edwards*, Esteban O. Mazzoni, Richard I. Sherwood, Akshay Kakumanu, Carolyn A. Morrison, Hynek Wichterle, David K. Gifford

*equal contributor

Published: March 27, 2014    DOI: 10.1371/journal.pcbi.1003501


Regulatory proteins can bind to different sets of genomic targets in various cell types or conditions. To reliably characterize such condition-specific regulatory binding we introduce MultiGPS, an integrated machine learning approach for the analysis of multiple related ChIP-seq experiments. MultiGPS is based on a generalized Expectation Maximization framework that shares information across multiple experiments for binding event discovery. We demonstrate that our framework enables the simultaneous modeling of sparse condition-specific binding changes, sequence dependence, and replicate-specific noise sources. MultiGPS encourages consistency in reported binding event locations across multiple-condition ChIP-seq datasets and provides accurate estimation of ChIP enrichment levels at each event. MultiGPS’s multi-experiment modeling approach thus provides a reliable platform for detecting differential binding enrichment across experimental conditions. We demonstrate the advantages of MultiGPS with an analysis of Cdx2 binding in three distinct developmental contexts. By accurately characterizing condition-specific Cdx2 binding, MultiGPS enables novel insight into the mechanistic basis of Cdx2 site selectivity. Specifically, the condition-specific Cdx2 sites characterized by MultiGPS are highly associated with pre-existing genomic context, suggesting that such sites are pre-determined by cell-specific regulatory architecture. However, MultiGPS-defined condition-independent sites are not predicted by pre-existing regulatory signals, suggesting that Cdx2 can bind to a subset of locations regardless of genomic environment. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.

07.09.2014 and 07.23.14

Predicting Dynamic Signaling Network Response under Unseen Perturbations

Fan Zhu 1 and Yuanfang Guan 1,2,3,*

1 Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA

2 Department of Internal Medicine, University of Michigan, Ann Arbor, MI 48109, USA

3 Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA

* To whom correspondence should be addressed.


Motivation: Predicting trajectories of signaling networks under complex perturbations is one of the most valuable but challenging tasks in systems biology. Signaling networks are involved in most of the biological pathways and modeling their dynamics has wide applications including drug design and treatment outcome prediction.

Results: In this paper, we report a novel model for predicting the cell type-specific time course response of signaling proteins under unseen perturbations. This algorithm achieved the top performance in the 2013 8th Dialogue for Reverse Engineering Assessments and Methods (DREAM 8) sub challenge: time course prediction in breast cancer cell lines. We formulate the trajectory prediction problem into a standard regularization problem; the solution becomes solving this discrete ill-posed problem. This algorithm includes three steps: denoising, estimating regression coefficients and modeling trajectories under unseen perturbations. We further validated the accuracy of this method against simulation and experimental data. Furthermore, this method reduces computational time by magnitudes compared to state-of-the-art methods, allowing genome-wide modeling of signaling pathways and time course trajectories to be carried out in a practical time.

Availability and Implementation: Source code is available at and as supplementary file online. Contact: