A new paper from Kundaje et al., released as part of ENCODE, analyzes the arrangement of nucleosomes and histone modifications around transcription start sites (TSS) and transcription factor binding sites (TFBS) in two human tissues. Their study is motivated by the standard aggregation plot, which aggregates genome-wide profiles of a particular signal (say, MNase-seq measures of nucleosome occupancy) around a ubiquitous anchor, such as TSS’s. These aggregation plots show the general features of signal behavior around the anchor (in this case, the customary nucleosome peaks before and after the TSS), but fail to describe the diversity of signal patterns that contribute to this aggregation. Additionally, it aims to describe anchor points where the directionality is known (such as TSS, which are oriented in the direction of transcription) as well as unknown (such as distal TFBS like CTCF).
They create a new tool, Clustered AGgregation Tool (CAGT), which automatically detects distinct clusters of nucleosome or histone modification signal around a regulatory element. CAGT has two major components. First, k-medians clustering is applied to a region of a given size around each anchor point to produce a large set of signal patterns. K-medians requires only a distance metric and a choice of k; they use one minus the base-wise correlation between two signals to quantify their distance. An example cluster might contain a strong peak a set distance to the left of the anchor, and no peak on the right. K-medians might produce a large number of redundant clusters, and does not appropriately flip and combine clusters centered on anchors with unknown polarity. Therefore, the second step of CAGT is hierarchical clustering with the option of reversing signals when the anchors have unknown direction. This produces a consensus set of distinct, diverse nucleosome or chromatin mark signals around anchors such as TSS or TFBS.
They apply their method to human GM12878 and K562 cells. They find diverse positioning patterns for nucleosomes as well as various chromatin marks around gene TSS as well as numerous TFBS sites. The majority of these patterns are assymetrical, suggesting that the inherent polarity of regulatory elements strongly influences the position of these signals. Exceptions included DNase hypersensitivity signals, which were generally symmetric around all TFBS. Interestingly, nucleosome positions anchored around CTCF/cohesin complex binding sites were the most symmetrical of all the TFBS measured, suggesting a unique chromatin environment around insulator elements. In order to perform their computational analysis, they produce high-quality nucleosome positioning data genome-wide for GM12878 and K562 cells using MNase-seq.