DA-seq: Detection of differentially abundant cell subpopulations in scRNA-seq data

By: Jeremy Parker Yang and Harmon Bhasin

General problem Overview

An important area of analysis in scRNA-seq is comparing the transcriptomic profiles of cells from two biological states. The most common approaches utilize clustering of all cells in both samples followed by downstream analysis measuring differential abundance. The issue with predefined clustering (unsupervised learning approach) is that differential abundance (DA) subpopulations may not align with a clear cluster structure based on cell type/sample. In addition, these clusters do not take into account the biological state of each cell. This leads to cell subpopulations that may not be representative of the differences between biological states. 

This motivated the authors to take a supervised approach to clustering. An earlier work that utilized multiple local two-sample tests for hyperspheres centered at randomly selected cells failed to accurately delineate DA subpopulations. Seeing this gap in the literature, the authors introduced differential abundance sequencing (DA-seq), a supervised approach that utilizes the biological state of each cell to identify and delineate the cell subpopulations most representative of the differences between two biological states. The key difference between this approach and those most commonly used is that they focus on individual cells instead of predefined clusters.

Brief description of method

In conventional DA analysis, clustering is done on cells in both biological states. These clusters are then analyzed for DA states. DA-seq takes a different approach where it does not use initial clusters, but instead measures the abundance of a state per cell using its neighbors. Clustering then is done on cells with differential abundance. This process is done in four steps: 

Step 1) Calculate multiscale cell score from kNN

  • Instead of a DA measure at the cluster level, we want to assign each cell with its own DA measure. This cell score is based on the cell’s k nearest neighbors. The score is multiscale in that different values of k are used to increase the quality of our DA estimation (since areas will vary in density and having k values that are too high or too low might lead to under or overfitting of the data) and resolve issues of choosing an optimal k value. Each score with value k represents an estimate of the DA measure at a different scale. 
  • Each cell x_i gets a score vector where each element is a score for a different value k
  • The score is given by
  • Where g is the probability of the cell being in a state divided by the percent of cells in that state
    • N_1(xi;k)/k estimates the probability of a point being labeled as state 1
    • n_1/n estimates the probability of sampling state 1

Step 2) Calculate the DA measure utilizing a logistic regression classifier

  • In the previous step, we get multiscale score vectors for each cell. We want to map this to a single number to represent the DA measure. It’s safe to assume that cells in DA subpopulations whose neighborhoods are enriched with cells from one biological state tend to be closer to each other in the l-dimensional score than other cells. Hence, we can use a logistic regression classifier that is trained to predict the state of a cell given its multiscale score vectors. L2 regularization is used to ensure that the output is smooth, and the DA measures across cells are localized.
  • The regularization hyperparameter lambda is fit using cross validation
  • The output of the logistic regression classifier estimates the posterior phat_i = Pr(y_i=1 | x_i)
  • Finally, the DA measure is

3) Cluster on DA cells only using SNN + Louvian

  • Now that each cell has a DA score, we want to find clusters of DA cells
  • Cells are considered DA if their DA metric is above the threshold τh or below the threshold τl. A permutation test is used to determine which threshold values are DA.
  • In this test, the null distribution is created by calculating the DA score (step 1 and 2) for each cell and randomly permuting its state.
  • The Seurat pipeline is utilized for clustering. First, a shared nearest neighbor (SNN) graph is generated using euclidean distance on the original dimensionality reduced gene space. This graph is made with all cells, including cells that are not DA. Second, non DA cells are removed from the SNN graph. Third, a modularity based clustering algorithm clusters the DA cells (Louvian). 
  • These clusters represent DA subpopulations

4) Select features for DE using STG optimization

  • We want to identify characteristic genes for each DA subpopulation. This can be done with standard differential expression analysis techniques. The authors propose their own method, framing this as a feature selection problem by identifying the smallest set of genes that distinguish a DA subpopulation. They select features using empirical risk minimization with an l_0 constraint. 
  • STG introduces a nonconvex relaxation of the optimization problem where each STG is a relaxed Bernoulli variable. Hence they’re able to reformulate their ERM problem into a risk minimization problem where they’re able to optimize by gating the variable x and minimize the number of expected active gates. This gives them a solution solved via gradient descent, where lambda is a regularized parameter that can be optimized. 
  • Formulate feature selection as empirical risk minimization (ERM)
    • where r is the number of selected features, L is the loss function, and theta are the parameters of a linear model or more complex neural network
  • Introduce STG (Stochastic Gate) approach to nn, which provides a nonconvex relaxation of the optimization in the equation above. 
    • Each STG is a relaxed Bernoulli variable zd, where P(zd = 1) = πd, d = 1, . . . , D, and D is the number of genes after an initial screening to remove genes with low expression.

How was the method evaluated?

This method was evaluated using four real datasets including (a) severe versus mild COVID-19 patients, (b) melanoma patients with immune checkpoint therapy responders versus non-responders, (c) embryonic development at two time-points, and (d) young versus aged brain tissue, along with two simulated datasets.

DA subpopulation analysis successfully identified COVID-19 severity markers in different cell types, such as nonresident macrophages (nrMa) and cytotoxic T cells (CTL), from the comparison of scRNA-seq nasopharyngeal/pharyngeal swab samples between mild and severe COVID-19 patients.

Similarly, comparing two types of melanoma patients, including those responding to immune checkpoint therapy and those not, a total of 70% of cells from scRNA-seq dataset are detected with a significant DA measure.

On the other hand, dermal cell differentiation patterns during embryonic development of mouse had been successfully identified using DA-seq method.

During comparison between scRNA-seq datasets of young versus aged brain tissues collected from mice, no DA subpopulation was identified as per the expectation.

The simulated datasets consisted of scRNA-seq data from a previous paper, and a perturbed Gaussian Mixture Model (GMM). In both simulated datasets, DA-seq successfully extracts the artificial DA subpopulations and outperforms similar clustering techniques like Cydar.

What novel insights were obtained using this method?

This approach allows us to identify regions with differential abundance between two states. In general, compared to methods that cluster the data initially, DA-seq identifies subpopulations with higher DA levels. DA seq not only recovered results that were similar to standard approaches, but also revealed striking unreported DA subpopulations, which informs on cellular functions, known and unknown genes, and greatly increases the resolution of cell type identity in different clinical states of disease. Allows us to use cells generated in a single scRNA-seq experiment and compare cells conditioned on the expression status of a single marker. You don’t need to cluster to figure out DA. Using an STG approach to optimization allows you to better select features.

What are the strengths and weaknesses of this method?

Strengths

  1. detecting salient DA subpopulations not confined to any predefined cluster,
  2. no prior knowledge required for differentiation of specific cells, and
  3. the use of an STG provides a prediction score as a linear combination of its selected genes that best separate each DA subpopulation from the rest of the cells

Weaknesses

  1. reliance on large sample sizes and model validation which could lead to overfitting in regions with low data density, 
  2. lack of causal inference technique for gene selection, lack of batch effect removal, and 
  3. the use of kNN on all cells which may be a computational bottleneck for very large datasets.

Regulatory networks in blood cells

This recent review on mapping transcriptional regulatory networks in hematopoietic stem cells (HSC) blood cells is very nice (1). Transcriptional regulatory networks are central to specifying context-specific gene expression patterns. Such networks specifying connections between transcription factor proteins and target genes, although, it is common to also include some upstream regulators such as signaling proteins, non-specific DNA binding proteins and chromatin remodelers as potential regulators. The review highlights the need to predict the connections between regulators and target genes. In particular for the HSCs about 50 or so regulators have been identified already. For several of these there are ChIP-seq/ChIP-chip data that have been measured. But our understanding of networks in this cellular state is still far from complete. While expression-based network inference methods including module networks have been applied to this lineage, there is still a lot to do to map these networks. The ability to measure single cell expression profiles opens up new opportunities because one can potentially discriminate causal and temporal dynamics on these networks. Furthermore, the ability to make more targeted perturbations using CRISPR offers even more opportunities to build more accurate pictures of such regulatory networks.

References

1. Göttgens, Berthold. “Regulatory network control of blood stem cells..” Blood 125 (April 2015): 2614-2620.

Predicting spatio-temporal patterns of expression

An open question in gene regulation is how spatio-temporal patterns of gene expression is encoded in the genome. We discussed this paper in our lab meeting. This paper talks about  a two-step approach to predicting spatial  and temporal gene expression patterns in Drosophila. This prediction task is tackled as a two-step approach: (a) first find cis-regulatory modules that exhibit spatio or temporal activity, (b) second link the crms to predict spatio-temporal expression. Assume space and time is captured by the term “context”.  To do (a) the authors use a Bayesian network which is trained on known CRM-context relationships. To do (b) the authors use additional data based on insulators and H3K4me1 to predict which CRMs are associated with which genes. In (a) the CRM-context information is known only for a few hundred of the total 8000 crms that are present. So the authors use an EM idea where they use the trained Bayesian network to predict the context-specific activity of each CRM. Then using the soft labels of each CRM they predict the expression of the gene. This second model is also a Bayesian network but has additional variables for the distance between CRMs and genes and whether there is an insulator binding site.

A systems level integrative framework for genome-wide DNA methylation and gene expression data

Integrative analysis of DNA methylation (DNAm) and gene expression data has great potential power to illuminate many processes related to the onset of dancer and other diseases. In particular, protein-protein interaction (PPI) networks have already been used to identify subnetworks associated with significant differential DNAm states, highlighting key nodes (genes) responsible in driving phenotypic differences in disease progression. We are interested in the work of Jiao et al. (http://bioinformatics.oxfordjournals.org/content/30/16/2360.full) for the method they have developed to address this question.

The effort of Jiao et al resulted in the development of the Functional Epigenetic Module (FEM) algorithm for integrative analysis of DNAm and gene expression data. This algorithm uses Illumina Infinium 450k DNAm data with matched gene expression data, and performs a supervised analysis with a chosen PPI network as a scaffold on which to identify gene modules (really a PPI subnetwork) that are epigenetically differentially regulated in a specific phenotype, such as a cancerous state. When applying this integrative analysis they use a gene set from the overlap of genes represented in the PPI network as well as in the DNAm, and gene expression data sets, which leads to a limitation for truly genome-wide analyses. For the DNAm data, the usual method is to assign each gene the average value of mapping to a location within 200 bp of the transcription start site of that gene. If no such probe mappings are found in the DNAm data, then the probes mapping to the first exon of the said gene are used, and failing that, the values for probes mapping to within 1500 bp of the gene TSS are used. The point here is that there are open questions about the best way to define per gene DNAm data values, and this is the chosen method in this work. Statistics for the association of the DNAm profile and the corresponding expression profile of each gene to the phenotype the data is representing are derived . These are basically regularized t-statistics for these measures of the genes using an empirical Bayesian framework, and these statistics are normalized for DNAm and expressions data, so as to not bias the results of the algorithm toward one data type over the other. For each gene then, an average t-statistic is ultimately produced. The central concept is to encapsulate the connection of genes in the POI with weighted edges, where the edge wright is the average of the t-statistics for the DNAm and expression data for those two connected gene nodes. The edges with higher weights represent the differentially expressed or methylated edges. Subnetworks identifying “hotspots” of differential methylation and expression are identified as “heavy” subnetworks, meaning subnetworks connected by edges with higher than average weight values for differential methylation and expression. The authors of this work call such a subnetwork a FEM.

In this method, the default mode is to select the top 100 most differential expressed and methylated genes as measured by the respective t-statistics for those genes. These genes seed the search for subnetworks, and each of the seeded genes can lead to the definition of a FEM. A clustering method called the spin-glass algorithm (http://journals.aps.org/pre/abstract/10.1103/PhysRevE.74.016110, http://www.nature.com/srep/2013/130409/srep01630/full/srep01630.html) is used to determine the sets of genes to be clustered in subnetworks, and those cases where a seed gene is really involved in an isolated edge of high association, but isn’t otherwise part of a meaningful subnetwork. The choice of 100 seed genes was found to be practical because it allowed the search space of the PPI network in the algorithm to cover most of the nodes of the network, and produced clusters of 10 to 100 genes, which they regard as optimal for finding outside validation in biology. The coverage of the chosen PPI network was the main criterion indicated for the choice of how many seed genes to work with.

The validation procedures used here are as follows. Firstly, the authors looked at the statistical significance of the variation in the DNAm and expression data of edges in the inferred modules. This was done in simulation by randomly permuting the node (gene) t-statistics of the initial network in 1000 separate randomized instances, and recomputing the modularity statistics of the initially inferred modules to see if they are still supported in the spin-glass analysis clustering method. Only modules that pass a 0.05 false discovery rate are considered validated. The definition of per-gene DNAm values was varied, to see how stable the results were to this degree of freedom. Then the algorithm was applied in these three modes (DNAm only, expression only, and DNAm+expression) to data for normal and cancerous endometrial samples from The Cancer Genome Atlas resource. The  e authors then tested the reproducibility of the FEM clustering algorithm based on a particularly well studied, and literature-supported, pathway involving the HAND2 gene in endometrial cancer. The support granted with these validation efforts for the FEM method was then used to substantiate more de novo results generated for a data set examining epithelial cell differentiation.

In moving forward, I would like to better understand the spin-glass algorithm that is used in the clustering, and study in greater detail the methods used by the authors of this work invalidating their algorithm, which points I hope to discuss in the journal club meeting.

A Hidden Markov Model Approach to Study Epigenome Dynamics in Breast Cancer

A recent 2012 paper from Bioinformatics, entitled “A Hidden Markov Model to Identify Combinatorial Epigenetic Regulation Patterns for Estrogen Receptor α Target Genes” discusses a method for exploring Epigenome dynamics through a comparative analysis of different types of breast cancer cell lines. (http://bioinformatics.oxfordjournals.org/content/early/2012/10/26/bioinformatics.bts639) The work specifically relates to understanding the epigenomic pattern of dysregulation that leads to the resistance of certain breast cancer tumor types to a common chemotherapy agent, Tamoxifen.

Breast cancer is canonically understood as being in part induced by a change in the regulatory program of Estrogen Receptor α (ERα), an estrogen (E2)- inducible transcription factor (TF) . Some 70% of breast cancer cases are estimated to be associated with abnormality associated the regulatory role of this TF. A common way in which this manifests is an over production of this TF leads to oncogenesis. Cancers of this kind are called ERα-positive. Tamoxifen (Tam) works to undermine cancer cells by addressing the hormonal role in the aberrant over production of ERα, also referred to as anti-hormonal therapy. Some 25% of ERα-positive cancers initially respond to treatment with Tam but ultimately return, leaving a strong clinical motivation to understand the underlying causes of Tam resistance. An explicit goal of the work is to “understand the underlying mechanisms of epigenetic regulatory influence on tamoxifen resistance in breast cancer.”

Hidden Markov models (HMM) are a statistical method for inferring underlying chains of processes represented in complex data. This has notably been illustrated by Ernst et al. with their work developing the ChromHMM framework for analyzing combinatorial relationships of ChIP-seq data sets that interrogate multiple epigenomic markers. The authors of this work notably seek to apply such well-established methods for the specific case of studying the differences in epigenomic regulation associated with Tam-resistance.

The present work utilized a data set covering several epigenomic markers in the Tam-resistance in the MCF7-T cell line , and the Tam-sensitive MCF7 cell line. ChIP-seq data of ERα, RNA polymerase II (PolII), and three histone modification marks, as well as MBD-seq of DNA methylation in both cells lines were used to train a first-order HMM. These data were binned into 6,045,312 1000 bp regions in the hg18 reference sequence. The mark status of each bin across the eight data sets is assigned as  “either “0” for non-mark or “1” for mark if the number of reads in the bin is sufficient such that P < 10-4 under the Poisson distribution, as described in Ernst J. et al. (Ernst and Kellis, 2010).”  Using the combinatorial state of the collection of markers, 256 unique states were identified in this way, forming a basis for the HMM for training.

In the interest of understanding the steps involved in this analysis, I’ll describe the process used for generating and refining a HHM used in this study. HMM models were trained using 9-24 states, with 5 random initializations for each number of possible states, with 300 iterations required for each of the 5x(24-9) training analyses. The training procedure utilized the Baum-Welch algorithm (Baum et al., 1970), with “a minimum of 10^(-6) enforced for all transition, emission and start probabilities.” The resulting HMMs were classified by the BIC scores for each model to describe the data. A model of 20 states was found to agree best with the data. Because 6 of the combinatorial epigenomic states in this optimal HMM model were present in less than  350 bins (of over 6 million), these 6 states were removed from the model, and a 14-state HMM model was further refined with a training analysis of 100 iterations. A log-likeliood score was further used to confirm that the successive iterations of the HMM training improved the quality of the model to describe the data.

From the refined HMM model, a set of ERα-regulated collective epigenetic states (including those of promoter regions) were identified by the resulting Hmm, and categorized by their propensity to be associated with ERα production. With the results of the final 14-state HMM, the Viterbi decoding algorithm was used for determining the most likely state of each bin in MCF7 and MCF7-T. The percentage of bins having each combinatorial state within 2 kb of a TSS was computed to identify promoter states that could be mapped to genes. Similarly, the percentage of bins with each state within a gene (5’ end to 3’ end) was computed to detect transcription-associated states. With these methods, the information from the refined HMM states was used to identify promoter and transcribed regions of the genome, and map them to genes. Gene ontology (GO) analysis was then applied to the resulting gene lists associated with the states defined from the HMM analysis. A detailed listing of the functional role of each of these states is provided in Figure. 3 of this paper. With the ERα ChIP-seq data part of the definition of the combinatorial states, the states defined in this model can also be correlated to activity that is most likely associated with the arising of the ERα-positive state, and thus also those regions and states likely to be involved in Tam response.

The final steps of the study proceed in two directions. One vital and powerful inquiry to follow with such a data set is a direct comparison of the epigenomic states of like bins in the two cell lines. The paper explicitly identifies this as a key question for inquiry, but they also explain difficulties in applying a direct comparison method on a bin-by-bin basis. Explicitly, they say “Due to the number of different bins, we could not directly compare the state sequences between the cell lines. We believe that this is due in part to noise in the original ChIP data.” This is a sobering observation for the difficulty in producing reliable comparative analyses of such data. Nevertheless, I find this paper interesting because it provides a pioneering example of how to implement a comparative analysis of combinatorial epigenome states in multiple cell lines, and by proxy, species. I particularly find it illuminating to learn about the process of learning an HMM model on this kind of data.

The second thread of the study is the afore mentioned GO analysis to the genes ultimately associated with the states most associated with ERα activity. Specifically, six states are chosen for GO analysis that “have a probability greater than 0.1 of emitting ERα with E2 treatment in at least one cell line (we select these E2-associated states since Tam is an E2 antagonist). ” In this way, regions and genes that might resonably be expected to related to Tam-resistance are selected for this detailed study. The genes mapped to each particular state are unfortunately limited in number (O(2000)), and the resulting GO enrichments do not tend to be statistically meaningful. The overarching results are not conclusive in mapping epigenomic dynamics to Tam-resistance in the comparison of the MCF7 and MCF7-T cell lines. The work presented in this paper nonetheless illustrates the applicability of HMM-based analysis of genome-wide high throughput genomic data to study epigenetic patterns in general, including specific cases like E2/ERα regulation in breast cancer as explored here. It is interesting to note that this result can also be taken as an indication that something specific like Tam-resistance may not always be correlated to Epigenome dynamics. In such a light, it is valuable to allow for null results.

Critically, the authors of the paper suggest that the application of gene network analysis can reveal “key genes responsible for the regulation of the genes in the lists formulated by this study.” This represents an important path for future development and improvement for analyses of this kind, which our lab is uniquely capable to contribute to. An additional path for improvement would be to integrate gene expression data, and a wider array of important TF’s for the processes involved. Such an approach might facilitate studying combinatorial epigenome states in a context that is more directly takes network regulatory programs into account, and places the epigenome in it’s more natural context. In fact, the relative inconclusiveness of this study opens up the important question about the utility of studying the epigenome alone , but rather its contextual role in collective (emergent?) network properties, including TF regulation and the final patterns of gene expression that lead to specific phenotypes such as Tam-resistance in cancer.

 

 

random walk method to find causal pathway

This paper described an algorithm to find the paths of casual interactions between two genes. Lets say we found a locus that is associated with the changes in expression level of a gene, but the region might contains multiple genes and even if we know the right gene, we don’t know how it affect the target gene. They use a random walk method to find a path from a possible source to the target gene in a biological network.

They create a biological network using protein-protein interaction (represented as undirected edges) and protein phosphorylation and TF-DNA binding (represented as directed edges). For a target gene gt, we have the list of TFs T = (t1, t2, …, tn) and candidate causal genes are C = (gC1, …, gCm). For each tk in T, they initiate a search. In their search, when they are at node g, they want to select a gene gi from neighborhood of g that is more likely to have a causal relation with the target gene gt, therefore they use the correlation between gi and gt as the probability of selecting gi (they set an upper limit to the correlation value, so that other genes that might have causal relation with the target gene but have lower correlation due to post-translational regulation mechanisms, have a chance of selection). note that they don’t use the correlation between gi and g but between gi and the target gene. using this method, they start at tk and randomly visit one of its neighbors, and resume the search until they reach a gene in causal set C, or when entering a dead end (reaching a gene that they have visited all of its neighbors before) or reach maximum number of transitions.

If the search reach a gene gc in C, they will have a path from gc to gt. They repeat this procedure N time for each tk. They approximate the probability of reaching gc from tk, p(gc , tk) using the number of times that gc was visited in a search initiated from tk. they calculate the probability of each causal gene gc as weighted sum of p(gc , tk) (weighted using the probability of tk causing gt). From the set of all possible causal genes C, they can select the gene gc with highest probability.

To test their method, they used a knock-out experiment to create a simulation QTL. For each knockout experiment, they select genes with significant change in expression level as target gene, and then create a QTL region by putting the knocked-out gene and 9 other genes together. then they run their procedure to select the causal gene. If the algorithm select the knocked-out gene, it is correct prediction, if it select another gene it is incorrect prediction. They reached 46% accuracy; they expect 10% success by randomly selecting a gene as causal gene, so their method is more than 4 times better than random. (They also ran their method on yeast QTL data from Brem and Kruglyak and reported their findings).

3D Chromatin Organization During Differentiation

Phillips-Cremins et al did a study of chromatin interaction organization in their new publication Architectural Protein Subclasses Shape 3D organization of genomes during Lineage Commitment. They looked at the topology of the interactions in differentiation related cell types as well as the factors that were enriched at these sites.  It is important to look at the 3D organization of chromatin to understand regulatory networks, because TF binding at a distal enhancer can affect the regulation of expression at an interacting promoter.

In this paper, they used 5C to make chromatin interaction maps in ES and ES-derived Neural Progenitor Cells (NPCs) to determine cell type specificity.  The 5C primers were located in six 1-2 Mb sized genomic regions around developmentally regulated genes (Oct4, Nanog, Nestin, Sox2, Klf4, and Olig1-Olig2).  They found that topological features in the 5C data was similar to those found in Hi-C data (Dixon et al 2012), and the replicates within a cell type were highly correlated.

Chromatin interactions are thought to be mostly between regions within a topologically associating domain (TAD), a genomic area on the scale of a megabase.  These TADs are thought to be conserved across organisms and cell types.  Looking at the 5c interactions, the complex set of TADs are visible, but they also find that within these TADs, there is a hierarchy including more fine grained sub-topologies (sub-TADs) in which chromatin interactions are more frequent.  Within these sub-TADs, interaction dynamics varied, suggesting that the small-scale interaction frequencies are specific to the function of the genomic region.

Using a stringent threshold to identify interactions, they found 83 ES-cell-specific interactions, 260 interactions that exist in both cell types, and 165 NPC specific interactions.  This demonstrates that these interactions are cell type specific.  There were a few epigenomic marks frequently found at interacting regions including H3K4me1, H3K27ac, and low levels of H3K4me3.  Greater than 80% of significant interactions were marked by some combination of CTCF, Med12 (Mediator), or Smc1 (Cohesin) in ES cells.  However, only 40% of interactions were occupied by some combination of Oct4, Nanog, and/or Sox2, suggesting that these are not as important in chromatin interaction architecture.  They test the importance of Smc1 and Med12 by knocking out these genes.  Knockdown of Med12 and Smc1 results in an alteration of the chromatin interaction landscape. They performed a knockout of Med12 and Smc1 and found no interaction between Olig1 and the enhancer region that was present in wild type.

This work is very interesting to the ongoing study of chromatin architecture.  This 5c map highlights the topology of the chromatin interactions in regions around key differentiation related genes.  These interactions change during cell type differentiation.  The identification of Smc1, Med12, and CTCF as important factors for interactions suggests that there may be a subset of proteins greatly responsible for chromatin organization.

Comparative Epigenome Studies Can Provide Insight into Both Evolution, and Cancer.

The epigenome is a dynamic layer of information encoded with in the genome and the chromatin super-structure of chromosomes in eukaryotic cells. The epigenome includes cytosine methylation on the DNA strand, and modifications and variants of histone proteins that shape the chromatin structure of DNA. This information is in addition to the protein coding information encoded in the DNA sequence, and notably relates to the context-specific regulation of gene expression in a cell. The epigenome is critical to the differentiation processes of embryonic stem (ES) cells and the determination of cell fate. The epigenome plays a fundamental role in the organization of cell-type-specific regulatory programs in complex organisms like mammals, i.e. humans.

Compared to the genome (DNA sequence) the epigenome is dynamic, undergoing large-scale changes on the time scale of cellular life. The classic epigenetic markers of DNA methylation and histone modification are widely studied, yet many mechanisms may remain unknown. Histone modifications are known to have both activating and repressive roles in enhancer and promoter regions of the genome, and control the availability of protein-coding genes for transcription (Ku et al. (2012) and Xiao et al. (2012)). In contrast DNA methylation has a characteristically repressive regulatory role, but has recently been associated with regulatory processes at secondary promoter regions within gene bodies (Maunakea et al., 2010). Genome-wide assays for such markers, such as ChIP-seq for histone modification markers like H2K27me, have become routine, allowing large compendia of data to be generated for various epi-marks and transcription factors. One example is the work of the ENCODE consortium, which has produced catalogues of such data for multiple species, including Homo sapiens (human) and Mus musculus (mouse). Such data are also cataloged by the NIH Roadmap Epigenomics Mapping Consortium (Bernstein, B. et al., 2010), and cover an array of epi-markers, species, and cell lines within those species (both normal and diseased).

The inter-species comparison of these data provides a novel view on the role of the epigenome in evolution. Recent work has produced a specifically rich data resource, including ChIP-seq data for a series of eight histone markers and four transcription factors in each of three mammalian species (Xiao et al., 2012). Additionally, DNA methylation (MeDIP-seq and MRE-seq) and gene expression (RNA-seq) are also assayed. The species studied here are Homo sapiens, Mus Musculus and Sus Scrofa (pig). This, and similar, data resources provide exciting new opportunities to study cellular processes and gene regulation in a context of an integrative analysis for multiple genomic features and functions.

Studies of epigenomic patterns in multiple species provide insight into the role of the epigenome in evolution. Studying gene regions that are orthologous between species highlights those epigenomic features that are important on the evolutionary time scale. Such analysis (Xiao et al., 2012) indicates that, in mammals, the epigenome is more highly conserved in regions of the genome where the rate of sequence change (as measured by the nucleotide substitution rate) is enhanced. The majority of interspecies variation in the epigenome is found to be in regions that are neutrally evolving. These results suggest that the epigenome acts as a buffer against the pressure of natural selection induced by phenotypic changes in organisms. This is a tantalizing picture of the dynamic role of the epigenome in evolution, and in complex organisms in general.

A chief extension to this initial, ground-breaking work that is of interest is to compare the clustering patterns and network module organization represented in these data, and study the functional role of the epigenome in that context. A recently developed algorithm, Arboretum, allows the simultaneous comparison of gene modules across multiple species, and has been successfully applied to the study of evolution in fungal Ascomycete (yeast) species. Multi-clustering methods of this kind are part of the expertise of the Roy research group (Roy et al., 2012 and 2011). In short, the necessary tools are now available to develop such an analysis for complex organisms like mammals.

Another vital topic of inquiry is the role of the epigenome in cancer. Carcinogenesis has been associated with genome-wide changes in the epigenome in adult stem cells (Aran et al., 2013), which mirror those that occur in ES cells during differentiation and proliferation (Kashyap et al., 2009). The increasing compendiums of data, as described above, provide a means for examining the epigenome in the context of cancer biology, too, as well as non-diseased states. It is possible to apply the same comparative analyses to cancer data, as described above for the comparison of species. In such a context we can learn about epigenetic processes on two very different evolutionary time scales; those of species differentiation, and organismal life. Importantly, such efforts can contribute to improved medical knowledge of cancer for the well fare of patients. Multi-clustering analyses of the kind described here can yield new insights into the dynamic and functional role of the epigenome, both in evolution and in complex organisms.

Bibliography

Aran, D. et al. (2013) DNA methylation of distal regulatory sites characterizes dysregulation of cancer genes. Genome Biology 14, R21

Bernstein, B. et al. (2010) The NIH epigenomics mapping consortium. Nature Biotechnology 28, 1045-1048.

Kashyap, V. et al. (2009) Regulation of stem cell pluripotency and differentiation involves a mutual regulatory circuit of the nanog, OCT4, and SOX2 pluripotency transcription factors with polycomb repressive complexes and stem cell microRNAs. Stem Cells and Development 7, 1093-1106.

Ku, M. et al. (2012) H2A . Z landscapes and dual modifications in pluripotent and multipotent stem cells underlie complex genome regulatory functions. Genome Biology 13, R85.

Maunakea, A. et al. (2010) Conserved role of intragenic DNA methylation in regulating alternative promotors. Nature 466, 253-260.

Roy, S. et al. (2011) A multiple network learning approach to capture system-wide condition-specific responses. Bioinformatics 27, 1832-1838.

Roy, S. et al. (The modENCODE Consortium ) (2010) Identification of functional elements and regulatory circuits in Drosophila modENCODE. Science 24, 1787-1797

 

Xiao, S. et al. (2012) Comparative epigenomic annotation of regulatory DNA. Cell 149, 1381-1392.

 

Chromatin marks and the cell-type specificity of SNPs

This paper came out recently in Nature and combines data from ENCODE, Epigenome consortia, multiple GWAS studies  and the 1000 genomes project to address the question of cell type specificity of genetic variation affecting diseases.

The authors try to get to two related questions: (a) what cell types are associated with a diseasem, by looking at the chromatin activities surrounding SNPs associated with a disease, (b) what marks are conferring this cell-type specificity to a disease, and such marks are called the informative marks. It all boils down to computing a statistic that measures how variable the strength of a mark is for SNPs in a disease.

The authors started off with SNPS associated with different diseases from a GWAS study. This analysis was done in a per-disease basis, for example consider LDL or rheumatoid arthritis, etc. The authors found what SNPS are associated with these studies in a GWAS study and added to this list some more SNPs that were in high linkage disequilibrium with these associated SNPs. Then they obtained chromatin mark peaks for different chromatin parks in different cell types and lines from ENCODE as well as the epigenome map. Then they asked for each SNP to what extent were they associated with a particular mark in a particular cell type. This was done by defining a score which is the ratio of the height of a peak to the width of the peak.

Thus if we were to think of this data as a matrix, we would have one matrix per mark, whose columns correspond to the positions of the SNPs and the rows correspond to differnce cell types. A mark is then considered informative for a disease and cell type if all or most of the marks exhibit a high score for a few cell types. A mark is uninformative if the snps associated with the highest scores are not the same across different cell types. To compute this score of informativeness of a mark, the authors defined a metric which measures the variation in the score of SNPs for a disease across cell types. Specifically, the statistic is a sum of square differences of SNP score, and the differences are computed for each cell type and phenotype combination. If this number is small, then the mark is apparently cell-type specific. Finally the authors use a pemutation analysis to identify whether a particular score is high or low. cell-type specificity for a disease is computed by summing over the scores over all snps in a given cell type and assessing significance.

The statistic they use to define whether  a mark is informative is the sum of squared differences between each snp’s score and the mean of all snps in each disease cell type combination. If this is small, then we can assume that the mark does not vary too much, but there is no control over which cell types the mark must vary over. I am not sure how the method deals with the situation where a mark is not changing a lot across cell types.

Directionality and heterogeneity of regulatory elements

A new paper from Kundaje et al., released as part of ENCODE, analyzes the arrangement of nucleosomes and histone modifications around transcription start sites (TSS) and transcription factor binding sites (TFBS) in two human tissues. Their study is motivated by the standard aggregation plot, which aggregates genome-wide profiles of a particular signal (say, MNase-seq measures of nucleosome occupancy) around a ubiquitous anchor, such as TSS’s. These aggregation plots show the general features of signal behavior around the anchor (in this case, the customary nucleosome peaks before and after the TSS), but fail to describe the diversity of signal patterns that contribute to this aggregation. Additionally, it aims to describe anchor points where the directionality is known (such as TSS, which are oriented in the direction of transcription) as well as unknown (such as distal TFBS like CTCF).

They create a new tool, Clustered AGgregation Tool (CAGT), which automatically detects distinct clusters of nucleosome or histone modification signal around a regulatory element. CAGT has two major components. First, k-medians clustering is applied to a region of a given size around each anchor point to produce a large set of signal patterns. K-medians requires only a distance metric and a choice of k; they use one minus the base-wise correlation between two signals to quantify their distance. An example cluster might contain a strong peak a set distance to the left of the anchor, and no peak on the right. K-medians might produce a large number of redundant clusters, and does not appropriately flip and combine clusters centered on anchors with unknown polarity. Therefore, the second step of CAGT is hierarchical clustering with the option of reversing signals when the anchors have unknown direction. This produces a consensus set of distinct, diverse nucleosome or chromatin mark signals around anchors such as TSS or TFBS.

They apply their method to human GM12878 and K562 cells. They find diverse positioning patterns for nucleosomes as well as various chromatin marks around gene TSS as well as numerous TFBS sites. The majority of these patterns are assymetrical, suggesting that the inherent polarity of regulatory elements strongly influences the position of these signals. Exceptions included DNase hypersensitivity signals, which were generally symmetric around all TFBS. Interestingly, nucleosome positions anchored around CTCF/cohesin complex binding sites were the most symmetrical of all the TFBS measured, suggesting a unique chromatin environment around insulator elements. In order to perform their computational analysis, they produce high-quality nucleosome positioning data genome-wide for GM12878 and K562 cells using MNase-seq.