A Hidden Markov Model Approach to Study Epigenome Dynamics in Breast Cancer

A recent 2012 paper from Bioinformatics, entitled “A Hidden Markov Model to Identify Combinatorial Epigenetic Regulation Patterns for Estrogen Receptor α Target Genes” discusses a method for exploring Epigenome dynamics through a comparative analysis of different types of breast cancer cell lines. (http://bioinformatics.oxfordjournals.org/content/early/2012/10/26/bioinformatics.bts639) The work specifically relates to understanding the epigenomic pattern of dysregulation that leads to the resistance of certain breast cancer tumor types to a common chemotherapy agent, Tamoxifen.

Breast cancer is canonically understood as being in part induced by a change in the regulatory program of Estrogen Receptor α (ERα), an estrogen (E2)- inducible transcription factor (TF) . Some 70% of breast cancer cases are estimated to be associated with abnormality associated the regulatory role of this TF. A common way in which this manifests is an over production of this TF leads to oncogenesis. Cancers of this kind are called ERα-positive. Tamoxifen (Tam) works to undermine cancer cells by addressing the hormonal role in the aberrant over production of ERα, also referred to as anti-hormonal therapy. Some 25% of ERα-positive cancers initially respond to treatment with Tam but ultimately return, leaving a strong clinical motivation to understand the underlying causes of Tam resistance. An explicit goal of the work is to “understand the underlying mechanisms of epigenetic regulatory influence on tamoxifen resistance in breast cancer.”

Hidden Markov models (HMM) are a statistical method for inferring underlying chains of processes represented in complex data. This has notably been illustrated by Ernst et al. with their work developing the ChromHMM framework for analyzing combinatorial relationships of ChIP-seq data sets that interrogate multiple epigenomic markers. The authors of this work notably seek to apply such well-established methods for the specific case of studying the differences in epigenomic regulation associated with Tam-resistance.

The present work utilized a data set covering several epigenomic markers in the Tam-resistance in the MCF7-T cell line , and the Tam-sensitive MCF7 cell line. ChIP-seq data of ERα, RNA polymerase II (PolII), and three histone modification marks, as well as MBD-seq of DNA methylation in both cells lines were used to train a first-order HMM. These data were binned into 6,045,312 1000 bp regions in the hg18 reference sequence. The mark status of each bin across the eight data sets is assigned as  “either “0” for non-mark or “1” for mark if the number of reads in the bin is sufficient such that P < 10-4 under the Poisson distribution, as described in Ernst J. et al. (Ernst and Kellis, 2010).”  Using the combinatorial state of the collection of markers, 256 unique states were identified in this way, forming a basis for the HMM for training.

In the interest of understanding the steps involved in this analysis, I’ll describe the process used for generating and refining a HHM used in this study. HMM models were trained using 9-24 states, with 5 random initializations for each number of possible states, with 300 iterations required for each of the 5x(24-9) training analyses. The training procedure utilized the Baum-Welch algorithm (Baum et al., 1970), with “a minimum of 10^(-6) enforced for all transition, emission and start probabilities.” The resulting HMMs were classified by the BIC scores for each model to describe the data. A model of 20 states was found to agree best with the data. Because 6 of the combinatorial epigenomic states in this optimal HMM model were present in less than  350 bins (of over 6 million), these 6 states were removed from the model, and a 14-state HMM model was further refined with a training analysis of 100 iterations. A log-likeliood score was further used to confirm that the successive iterations of the HMM training improved the quality of the model to describe the data.

From the refined HMM model, a set of ERα-regulated collective epigenetic states (including those of promoter regions) were identified by the resulting Hmm, and categorized by their propensity to be associated with ERα production. With the results of the final 14-state HMM, the Viterbi decoding algorithm was used for determining the most likely state of each bin in MCF7 and MCF7-T. The percentage of bins having each combinatorial state within 2 kb of a TSS was computed to identify promoter states that could be mapped to genes. Similarly, the percentage of bins with each state within a gene (5’ end to 3’ end) was computed to detect transcription-associated states. With these methods, the information from the refined HMM states was used to identify promoter and transcribed regions of the genome, and map them to genes. Gene ontology (GO) analysis was then applied to the resulting gene lists associated with the states defined from the HMM analysis. A detailed listing of the functional role of each of these states is provided in Figure. 3 of this paper. With the ERα ChIP-seq data part of the definition of the combinatorial states, the states defined in this model can also be correlated to activity that is most likely associated with the arising of the ERα-positive state, and thus also those regions and states likely to be involved in Tam response.

The final steps of the study proceed in two directions. One vital and powerful inquiry to follow with such a data set is a direct comparison of the epigenomic states of like bins in the two cell lines. The paper explicitly identifies this as a key question for inquiry, but they also explain difficulties in applying a direct comparison method on a bin-by-bin basis. Explicitly, they say “Due to the number of different bins, we could not directly compare the state sequences between the cell lines. We believe that this is due in part to noise in the original ChIP data.” This is a sobering observation for the difficulty in producing reliable comparative analyses of such data. Nevertheless, I find this paper interesting because it provides a pioneering example of how to implement a comparative analysis of combinatorial epigenome states in multiple cell lines, and by proxy, species. I particularly find it illuminating to learn about the process of learning an HMM model on this kind of data.

The second thread of the study is the afore mentioned GO analysis to the genes ultimately associated with the states most associated with ERα activity. Specifically, six states are chosen for GO analysis that “have a probability greater than 0.1 of emitting ERα with E2 treatment in at least one cell line (we select these E2-associated states since Tam is an E2 antagonist). ” In this way, regions and genes that might resonably be expected to related to Tam-resistance are selected for this detailed study. The genes mapped to each particular state are unfortunately limited in number (O(2000)), and the resulting GO enrichments do not tend to be statistically meaningful. The overarching results are not conclusive in mapping epigenomic dynamics to Tam-resistance in the comparison of the MCF7 and MCF7-T cell lines. The work presented in this paper nonetheless illustrates the applicability of HMM-based analysis of genome-wide high throughput genomic data to study epigenetic patterns in general, including specific cases like E2/ERα regulation in breast cancer as explored here. It is interesting to note that this result can also be taken as an indication that something specific like Tam-resistance may not always be correlated to Epigenome dynamics. In such a light, it is valuable to allow for null results.

Critically, the authors of the paper suggest that the application of gene network analysis can reveal “key genes responsible for the regulation of the genes in the lists formulated by this study.” This represents an important path for future development and improvement for analyses of this kind, which our lab is uniquely capable to contribute to. An additional path for improvement would be to integrate gene expression data, and a wider array of important TF’s for the processes involved. Such an approach might facilitate studying combinatorial epigenome states in a context that is more directly takes network regulatory programs into account, and places the epigenome in it’s more natural context. In fact, the relative inconclusiveness of this study opens up the important question about the utility of studying the epigenome alone , but rather its contextual role in collective (emergent?) network properties, including TF regulation and the final patterns of gene expression that lead to specific phenotypes such as Tam-resistance in cancer.