A systems level integrative framework for genome-wide DNA methylation and gene expression data

Integrative analysis of DNA methylation (DNAm) and gene expression data has great potential power to illuminate many processes related to the onset of dancer and other diseases. In particular, protein-protein interaction (PPI) networks have already been used to identify subnetworks associated with significant differential DNAm states, highlighting key nodes (genes) responsible in driving phenotypic differences in disease progression. We are interested in the work of Jiao et al. (http://bioinformatics.oxfordjournals.org/content/30/16/2360.full) for the method they have developed to address this question.

The effort of Jiao et al resulted in the development of the Functional Epigenetic Module (FEM) algorithm for integrative analysis of DNAm and gene expression data. This algorithm uses Illumina Infinium 450k DNAm data with matched gene expression data, and performs a supervised analysis with a chosen PPI network as a scaffold on which to identify gene modules (really a PPI subnetwork) that are epigenetically differentially regulated in a specific phenotype, such as a cancerous state. When applying this integrative analysis they use a gene set from the overlap of genes represented in the PPI network as well as in the DNAm, and gene expression data sets, which leads to a limitation for truly genome-wide analyses. For the DNAm data, the usual method is to assign each gene the average value of mapping to a location within 200 bp of the transcription start site of that gene. If no such probe mappings are found in the DNAm data, then the probes mapping to the first exon of the said gene are used, and failing that, the values for probes mapping to within 1500 bp of the gene TSS are used. The point here is that there are open questions about the best way to define per gene DNAm data values, and this is the chosen method in this work. Statistics for the association of the DNAm profile and the corresponding expression profile of each gene to the phenotype the data is representing are derived . These are basically regularized t-statistics for these measures of the genes using an empirical Bayesian framework, and these statistics are normalized for DNAm and expressions data, so as to not bias the results of the algorithm toward one data type over the other. For each gene then, an average t-statistic is ultimately produced. The central concept is to encapsulate the connection of genes in the POI with weighted edges, where the edge wright is the average of the t-statistics for the DNAm and expression data for those two connected gene nodes. The edges with higher weights represent the differentially expressed or methylated edges. Subnetworks identifying “hotspots” of differential methylation and expression are identified as “heavy” subnetworks, meaning subnetworks connected by edges with higher than average weight values for differential methylation and expression. The authors of this work call such a subnetwork a FEM.

In this method, the default mode is to select the top 100 most differential expressed and methylated genes as measured by the respective t-statistics for those genes. These genes seed the search for subnetworks, and each of the seeded genes can lead to the definition of a FEM. A clustering method called the spin-glass algorithm (http://journals.aps.org/pre/abstract/10.1103/PhysRevE.74.016110, http://www.nature.com/srep/2013/130409/srep01630/full/srep01630.html) is used to determine the sets of genes to be clustered in subnetworks, and those cases where a seed gene is really involved in an isolated edge of high association, but isn’t otherwise part of a meaningful subnetwork. The choice of 100 seed genes was found to be practical because it allowed the search space of the PPI network in the algorithm to cover most of the nodes of the network, and produced clusters of 10 to 100 genes, which they regard as optimal for finding outside validation in biology. The coverage of the chosen PPI network was the main criterion indicated for the choice of how many seed genes to work with.

The validation procedures used here are as follows. Firstly, the authors looked at the statistical significance of the variation in the DNAm and expression data of edges in the inferred modules. This was done in simulation by randomly permuting the node (gene) t-statistics of the initial network in 1000 separate randomized instances, and recomputing the modularity statistics of the initially inferred modules to see if they are still supported in the spin-glass analysis clustering method. Only modules that pass a 0.05 false discovery rate are considered validated. The definition of per-gene DNAm values was varied, to see how stable the results were to this degree of freedom. Then the algorithm was applied in these three modes (DNAm only, expression only, and DNAm+expression) to data for normal and cancerous endometrial samples from The Cancer Genome Atlas resource. The  e authors then tested the reproducibility of the FEM clustering algorithm based on a particularly well studied, and literature-supported, pathway involving the HAND2 gene in endometrial cancer. The support granted with these validation efforts for the FEM method was then used to substantiate more de novo results generated for a data set examining epithelial cell differentiation.

In moving forward, I would like to better understand the spin-glass algorithm that is used in the clustering, and study in greater detail the methods used by the authors of this work invalidating their algorithm, which points I hope to discuss in the journal club meeting.