A systems level integrative framework for genome-wide DNA methylation and gene expression data

Integrative analysis of DNA methylation (DNAm) and gene expression data has great potential power to illuminate many processes related to the onset of dancer and other diseases. In particular, protein-protein interaction (PPI) networks have already been used to identify subnetworks associated with significant differential DNAm states, highlighting key nodes (genes) responsible in driving phenotypic differences in disease progression. We are interested in the work of Jiao et al. (http://bioinformatics.oxfordjournals.org/content/30/16/2360.full) for the method they have developed to address this question.

The effort of Jiao et al resulted in the development of the Functional Epigenetic Module (FEM) algorithm for integrative analysis of DNAm and gene expression data. This algorithm uses Illumina Infinium 450k DNAm data with matched gene expression data, and performs a supervised analysis with a chosen PPI network as a scaffold on which to identify gene modules (really a PPI subnetwork) that are epigenetically differentially regulated in a specific phenotype, such as a cancerous state. When applying this integrative analysis they use a gene set from the overlap of genes represented in the PPI network as well as in the DNAm, and gene expression data sets, which leads to a limitation for truly genome-wide analyses. For the DNAm data, the usual method is to assign each gene the average value of mapping to a location within 200 bp of the transcription start site of that gene. If no such probe mappings are found in the DNAm data, then the probes mapping to the first exon of the said gene are used, and failing that, the values for probes mapping to within 1500 bp of the gene TSS are used. The point here is that there are open questions about the best way to define per gene DNAm data values, and this is the chosen method in this work. Statistics for the association of the DNAm profile and the corresponding expression profile of each gene to the phenotype the data is representing are derived . These are basically regularized t-statistics for these measures of the genes using an empirical Bayesian framework, and these statistics are normalized for DNAm and expressions data, so as to not bias the results of the algorithm toward one data type over the other. For each gene then, an average t-statistic is ultimately produced. The central concept is to encapsulate the connection of genes in the POI with weighted edges, where the edge wright is the average of the t-statistics for the DNAm and expression data for those two connected gene nodes. The edges with higher weights represent the differentially expressed or methylated edges. Subnetworks identifying “hotspots” of differential methylation and expression are identified as “heavy” subnetworks, meaning subnetworks connected by edges with higher than average weight values for differential methylation and expression. The authors of this work call such a subnetwork a FEM.

In this method, the default mode is to select the top 100 most differential expressed and methylated genes as measured by the respective t-statistics for those genes. These genes seed the search for subnetworks, and each of the seeded genes can lead to the definition of a FEM. A clustering method called the spin-glass algorithm (http://journals.aps.org/pre/abstract/10.1103/PhysRevE.74.016110, http://www.nature.com/srep/2013/130409/srep01630/full/srep01630.html) is used to determine the sets of genes to be clustered in subnetworks, and those cases where a seed gene is really involved in an isolated edge of high association, but isn’t otherwise part of a meaningful subnetwork. The choice of 100 seed genes was found to be practical because it allowed the search space of the PPI network in the algorithm to cover most of the nodes of the network, and produced clusters of 10 to 100 genes, which they regard as optimal for finding outside validation in biology. The coverage of the chosen PPI network was the main criterion indicated for the choice of how many seed genes to work with.

The validation procedures used here are as follows. Firstly, the authors looked at the statistical significance of the variation in the DNAm and expression data of edges in the inferred modules. This was done in simulation by randomly permuting the node (gene) t-statistics of the initial network in 1000 separate randomized instances, and recomputing the modularity statistics of the initially inferred modules to see if they are still supported in the spin-glass analysis clustering method. Only modules that pass a 0.05 false discovery rate are considered validated. The definition of per-gene DNAm values was varied, to see how stable the results were to this degree of freedom. Then the algorithm was applied in these three modes (DNAm only, expression only, and DNAm+expression) to data for normal and cancerous endometrial samples from The Cancer Genome Atlas resource. The  e authors then tested the reproducibility of the FEM clustering algorithm based on a particularly well studied, and literature-supported, pathway involving the HAND2 gene in endometrial cancer. The support granted with these validation efforts for the FEM method was then used to substantiate more de novo results generated for a data set examining epithelial cell differentiation.

In moving forward, I would like to better understand the spin-glass algorithm that is used in the clustering, and study in greater detail the methods used by the authors of this work invalidating their algorithm, which points I hope to discuss in the journal club meeting.

A Hidden Markov Model Approach to Study Epigenome Dynamics in Breast Cancer

A recent 2012 paper from Bioinformatics, entitled “A Hidden Markov Model to Identify Combinatorial Epigenetic Regulation Patterns for Estrogen Receptor α Target Genes” discusses a method for exploring Epigenome dynamics through a comparative analysis of different types of breast cancer cell lines. (http://bioinformatics.oxfordjournals.org/content/early/2012/10/26/bioinformatics.bts639) The work specifically relates to understanding the epigenomic pattern of dysregulation that leads to the resistance of certain breast cancer tumor types to a common chemotherapy agent, Tamoxifen.

Breast cancer is canonically understood as being in part induced by a change in the regulatory program of Estrogen Receptor α (ERα), an estrogen (E2)- inducible transcription factor (TF) . Some 70% of breast cancer cases are estimated to be associated with abnormality associated the regulatory role of this TF. A common way in which this manifests is an over production of this TF leads to oncogenesis. Cancers of this kind are called ERα-positive. Tamoxifen (Tam) works to undermine cancer cells by addressing the hormonal role in the aberrant over production of ERα, also referred to as anti-hormonal therapy. Some 25% of ERα-positive cancers initially respond to treatment with Tam but ultimately return, leaving a strong clinical motivation to understand the underlying causes of Tam resistance. An explicit goal of the work is to “understand the underlying mechanisms of epigenetic regulatory influence on tamoxifen resistance in breast cancer.”

Hidden Markov models (HMM) are a statistical method for inferring underlying chains of processes represented in complex data. This has notably been illustrated by Ernst et al. with their work developing the ChromHMM framework for analyzing combinatorial relationships of ChIP-seq data sets that interrogate multiple epigenomic markers. The authors of this work notably seek to apply such well-established methods for the specific case of studying the differences in epigenomic regulation associated with Tam-resistance.

The present work utilized a data set covering several epigenomic markers in the Tam-resistance in the MCF7-T cell line , and the Tam-sensitive MCF7 cell line. ChIP-seq data of ERα, RNA polymerase II (PolII), and three histone modification marks, as well as MBD-seq of DNA methylation in both cells lines were used to train a first-order HMM. These data were binned into 6,045,312 1000 bp regions in the hg18 reference sequence. The mark status of each bin across the eight data sets is assigned as  “either “0” for non-mark or “1” for mark if the number of reads in the bin is sufficient such that P < 10-4 under the Poisson distribution, as described in Ernst J. et al. (Ernst and Kellis, 2010).”  Using the combinatorial state of the collection of markers, 256 unique states were identified in this way, forming a basis for the HMM for training.

In the interest of understanding the steps involved in this analysis, I’ll describe the process used for generating and refining a HHM used in this study. HMM models were trained using 9-24 states, with 5 random initializations for each number of possible states, with 300 iterations required for each of the 5x(24-9) training analyses. The training procedure utilized the Baum-Welch algorithm (Baum et al., 1970), with “a minimum of 10^(-6) enforced for all transition, emission and start probabilities.” The resulting HMMs were classified by the BIC scores for each model to describe the data. A model of 20 states was found to agree best with the data. Because 6 of the combinatorial epigenomic states in this optimal HMM model were present in less than  350 bins (of over 6 million), these 6 states were removed from the model, and a 14-state HMM model was further refined with a training analysis of 100 iterations. A log-likeliood score was further used to confirm that the successive iterations of the HMM training improved the quality of the model to describe the data.

From the refined HMM model, a set of ERα-regulated collective epigenetic states (including those of promoter regions) were identified by the resulting Hmm, and categorized by their propensity to be associated with ERα production. With the results of the final 14-state HMM, the Viterbi decoding algorithm was used for determining the most likely state of each bin in MCF7 and MCF7-T. The percentage of bins having each combinatorial state within 2 kb of a TSS was computed to identify promoter states that could be mapped to genes. Similarly, the percentage of bins with each state within a gene (5’ end to 3’ end) was computed to detect transcription-associated states. With these methods, the information from the refined HMM states was used to identify promoter and transcribed regions of the genome, and map them to genes. Gene ontology (GO) analysis was then applied to the resulting gene lists associated with the states defined from the HMM analysis. A detailed listing of the functional role of each of these states is provided in Figure. 3 of this paper. With the ERα ChIP-seq data part of the definition of the combinatorial states, the states defined in this model can also be correlated to activity that is most likely associated with the arising of the ERα-positive state, and thus also those regions and states likely to be involved in Tam response.

The final steps of the study proceed in two directions. One vital and powerful inquiry to follow with such a data set is a direct comparison of the epigenomic states of like bins in the two cell lines. The paper explicitly identifies this as a key question for inquiry, but they also explain difficulties in applying a direct comparison method on a bin-by-bin basis. Explicitly, they say “Due to the number of different bins, we could not directly compare the state sequences between the cell lines. We believe that this is due in part to noise in the original ChIP data.” This is a sobering observation for the difficulty in producing reliable comparative analyses of such data. Nevertheless, I find this paper interesting because it provides a pioneering example of how to implement a comparative analysis of combinatorial epigenome states in multiple cell lines, and by proxy, species. I particularly find it illuminating to learn about the process of learning an HMM model on this kind of data.

The second thread of the study is the afore mentioned GO analysis to the genes ultimately associated with the states most associated with ERα activity. Specifically, six states are chosen for GO analysis that “have a probability greater than 0.1 of emitting ERα with E2 treatment in at least one cell line (we select these E2-associated states since Tam is an E2 antagonist). ” In this way, regions and genes that might resonably be expected to related to Tam-resistance are selected for this detailed study. The genes mapped to each particular state are unfortunately limited in number (O(2000)), and the resulting GO enrichments do not tend to be statistically meaningful. The overarching results are not conclusive in mapping epigenomic dynamics to Tam-resistance in the comparison of the MCF7 and MCF7-T cell lines. The work presented in this paper nonetheless illustrates the applicability of HMM-based analysis of genome-wide high throughput genomic data to study epigenetic patterns in general, including specific cases like E2/ERα regulation in breast cancer as explored here. It is interesting to note that this result can also be taken as an indication that something specific like Tam-resistance may not always be correlated to Epigenome dynamics. In such a light, it is valuable to allow for null results.

Critically, the authors of the paper suggest that the application of gene network analysis can reveal “key genes responsible for the regulation of the genes in the lists formulated by this study.” This represents an important path for future development and improvement for analyses of this kind, which our lab is uniquely capable to contribute to. An additional path for improvement would be to integrate gene expression data, and a wider array of important TF’s for the processes involved. Such an approach might facilitate studying combinatorial epigenome states in a context that is more directly takes network regulatory programs into account, and places the epigenome in it’s more natural context. In fact, the relative inconclusiveness of this study opens up the important question about the utility of studying the epigenome alone , but rather its contextual role in collective (emergent?) network properties, including TF regulation and the final patterns of gene expression that lead to specific phenotypes such as Tam-resistance in cancer.



Comparative Epigenome Studies Can Provide Insight into Both Evolution, and Cancer.

The epigenome is a dynamic layer of information encoded with in the genome and the chromatin super-structure of chromosomes in eukaryotic cells. The epigenome includes cytosine methylation on the DNA strand, and modifications and variants of histone proteins that shape the chromatin structure of DNA. This information is in addition to the protein coding information encoded in the DNA sequence, and notably relates to the context-specific regulation of gene expression in a cell. The epigenome is critical to the differentiation processes of embryonic stem (ES) cells and the determination of cell fate. The epigenome plays a fundamental role in the organization of cell-type-specific regulatory programs in complex organisms like mammals, i.e. humans.

Compared to the genome (DNA sequence) the epigenome is dynamic, undergoing large-scale changes on the time scale of cellular life. The classic epigenetic markers of DNA methylation and histone modification are widely studied, yet many mechanisms may remain unknown. Histone modifications are known to have both activating and repressive roles in enhancer and promoter regions of the genome, and control the availability of protein-coding genes for transcription (Ku et al. (2012) and Xiao et al. (2012)). In contrast DNA methylation has a characteristically repressive regulatory role, but has recently been associated with regulatory processes at secondary promoter regions within gene bodies (Maunakea et al., 2010). Genome-wide assays for such markers, such as ChIP-seq for histone modification markers like H2K27me, have become routine, allowing large compendia of data to be generated for various epi-marks and transcription factors. One example is the work of the ENCODE consortium, which has produced catalogues of such data for multiple species, including Homo sapiens (human) and Mus musculus (mouse). Such data are also cataloged by the NIH Roadmap Epigenomics Mapping Consortium (Bernstein, B. et al., 2010), and cover an array of epi-markers, species, and cell lines within those species (both normal and diseased).

The inter-species comparison of these data provides a novel view on the role of the epigenome in evolution. Recent work has produced a specifically rich data resource, including ChIP-seq data for a series of eight histone markers and four transcription factors in each of three mammalian species (Xiao et al., 2012). Additionally, DNA methylation (MeDIP-seq and MRE-seq) and gene expression (RNA-seq) are also assayed. The species studied here are Homo sapiens, Mus Musculus and Sus Scrofa (pig). This, and similar, data resources provide exciting new opportunities to study cellular processes and gene regulation in a context of an integrative analysis for multiple genomic features and functions.

Studies of epigenomic patterns in multiple species provide insight into the role of the epigenome in evolution. Studying gene regions that are orthologous between species highlights those epigenomic features that are important on the evolutionary time scale. Such analysis (Xiao et al., 2012) indicates that, in mammals, the epigenome is more highly conserved in regions of the genome where the rate of sequence change (as measured by the nucleotide substitution rate) is enhanced. The majority of interspecies variation in the epigenome is found to be in regions that are neutrally evolving. These results suggest that the epigenome acts as a buffer against the pressure of natural selection induced by phenotypic changes in organisms. This is a tantalizing picture of the dynamic role of the epigenome in evolution, and in complex organisms in general.

A chief extension to this initial, ground-breaking work that is of interest is to compare the clustering patterns and network module organization represented in these data, and study the functional role of the epigenome in that context. A recently developed algorithm, Arboretum, allows the simultaneous comparison of gene modules across multiple species, and has been successfully applied to the study of evolution in fungal Ascomycete (yeast) species. Multi-clustering methods of this kind are part of the expertise of the Roy research group (Roy et al., 2012 and 2011). In short, the necessary tools are now available to develop such an analysis for complex organisms like mammals.

Another vital topic of inquiry is the role of the epigenome in cancer. Carcinogenesis has been associated with genome-wide changes in the epigenome in adult stem cells (Aran et al., 2013), which mirror those that occur in ES cells during differentiation and proliferation (Kashyap et al., 2009). The increasing compendiums of data, as described above, provide a means for examining the epigenome in the context of cancer biology, too, as well as non-diseased states. It is possible to apply the same comparative analyses to cancer data, as described above for the comparison of species. In such a context we can learn about epigenetic processes on two very different evolutionary time scales; those of species differentiation, and organismal life. Importantly, such efforts can contribute to improved medical knowledge of cancer for the well fare of patients. Multi-clustering analyses of the kind described here can yield new insights into the dynamic and functional role of the epigenome, both in evolution and in complex organisms.


Aran, D. et al. (2013) DNA methylation of distal regulatory sites characterizes dysregulation of cancer genes. Genome Biology 14, R21

Bernstein, B. et al. (2010) The NIH epigenomics mapping consortium. Nature Biotechnology 28, 1045-1048.

Kashyap, V. et al. (2009) Regulation of stem cell pluripotency and differentiation involves a mutual regulatory circuit of the nanog, OCT4, and SOX2 pluripotency transcription factors with polycomb repressive complexes and stem cell microRNAs. Stem Cells and Development 7, 1093-1106.

Ku, M. et al. (2012) H2A . Z landscapes and dual modifications in pluripotent and multipotent stem cells underlie complex genome regulatory functions. Genome Biology 13, R85.

Maunakea, A. et al. (2010) Conserved role of intragenic DNA methylation in regulating alternative promotors. Nature 466, 253-260.

Roy, S. et al. (2011) A multiple network learning approach to capture system-wide condition-specific responses. Bioinformatics 27, 1832-1838.

Roy, S. et al. (The modENCODE Consortium ) (2010) Identification of functional elements and regulatory circuits in Drosophila modENCODE. Science 24, 1787-1797


Xiao, S. et al. (2012) Comparative epigenomic annotation of regulatory DNA. Cell 149, 1381-1392.