Affinity regression predicts the recognition code of nucleic acid–binding proteins 1


Our next meeting will be at 3:00 on February 10th, in room 4160 of the Discovery building. Our Selected paper is Affinity regression predicts the recognition code of nucleic acid–binding proteins.
The abstract is as follows.

Predicting the affinity profiles of nucleic acid–binding proteins directly from the protein sequence is a challenging problem. We present a statistical approach for learning the recognition code of a family of transcription factors or RNA-binding proteins (RBPs) from high-throughput binding data. Our method, called affinity regression, trains on protein binding microarray (PBM) or RNAcompete data to learn an interaction model between proteins and nucleic acids using only protein domain and probe sequences as inputs. When trained on mouse homeodomain PBM profiles, our model correctly identifies residues that confer DNA-binding specificity and accurately predicts binding motifs for an independent set of divergent homeodomains. Similarly, when trained on RNAcompete profiles for diverse RBPs, our model correctly predicts the binding affinities of held-out proteins and identifies key RNA-binding residues, despite the high level of sequence divergence across RBPs. We expect that the method will be broadly applicable to modeling and predicting paired macromolecular interactions in settings where high-throughput affinity data are available.

We welcome all who can join us for this discussion. Feel free to begin that discussion in the comments section below.


Leave a comment

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

One thought on “Affinity regression predicts the recognition code of nucleic acid–binding proteins

  • Debbie

    Some notes from our discussion:

    The goal is to be able to predict the affinity between a protein and nucleic acid using the protein’s amino acid sequence. Protein binding microarrays and RNA compete assays measure the affinity between proteins and NAs. An important application of affinity regression is to identify the contributing residues in the protein, which will assist in the study of understanding the similarities and differences between NA-binding proteins and how they contribute to gene regulation. Another is to learn a motif for an unassayed protein by putting the predicted binding affinities from affinity regression into a motif learning algorithm.

    The affinity regression approach is a recommender system trained on PBMs and similar data. It aims to (a) make predictions about the binding profile of an unknown protein based on similarities to proteins for which we have data and (b) reveal the kmer features important for affinity in both the nucleic acid sequence and amino acid sequence.

    First we talked about the model by stepping through the original model and some optimizations that they made to it to make it tractable. The input data are protein binding array profiles Y (DNA probes x proteins), NA-kmer features for the probes D (probes x kmers), and amino acid kmer features for the TFs P (TFs x kmers).

    The initial model is DWP^T=Y. We want to learn W, an interaction matrix between the NA kmers and the AA kmers.

    The size of the number of probes (thousands) versus TFs (hundreds) poses a problem for solving this, so they start by multiplying both sides of the equation by Y^T. As a result, the outputs are now Y^TY, or pairwise similarities between TFs (TFs x TFs) instead of scores from TFs to probes. Now their model W predicts similarity between pairs of TFs. (Fig 1b)

    We spent some time talking about how to interpret their additional optimizations. The first step was to reduce the dimensionality of W (the interaction matrix). To do so, they first did SVD on Y^TD (TFs to DNA kmers). They also propose another dimensionality reduction on P (probes to kmers ), but they said the first one was sufficient.

    They also looked at two ways to regularize W — lasso and ridge. Apparently ridge worked better in their experiments.

    Something we were wondering about is exactly how to interpret what happens when you left multiply by Y.

    Now we a make prediction for a held-out protein (Fig 1c). Since the model doesn’t directly predict a binding profile for probes, but instead a similarity vector, we will use training probes and affinity regression. They propose two methods: “mapping reconstruction” (which predicts the binding profile as a linear combination of training profiles, but requires linearity of the output space?) and

    “nearest neighbor reconstruction”, which is simpler and predicts the profile as a weighted average of the held-out protein’s nearest neighbors as measured by similarity between the predicted vector and the training probe intensity vectors.

    In fig 1d-g, they compare their method’s ability to predict binding intensity for held-out proteins to probes against some baselines including predicting the most similar training example by BLOSUM similarity and by kmer similarity. They also compare the similarity between two replicates of the experiment as an upper bound on how well they can expect to do.

    Fig 2 is about using the model to interpret the most important residues for binding, and how those residues vary across a family of homeobox proteins. They predict TF kmer importance using Y^TDW and score each position of the AA TF’s sequence by summing up weights for overlapping kmers. First, in 2a, they look at the weights across the sequences for several homeoboxes and observe some strongly conserved regions as well as diverged regions. Next, they look at a few specific proteins and compare to PDB structures to see whether their predictions line up with what is known about the interaction sites for those proteins. They develop a null model to assess positional significance of specific kmers, and show that for two of the homeobox proteins, they have distinct predicted active sites that partially overlap with what was known in PDB, and partially predict new residues that are important.

    Fig 3 is about comparing homeodomains across species. Their predictions were pretty accurate compared to best possible (“oracle” — prediction value when a TF is included in the training set) and overall, better than or as good as BLOSUM similarity. They also assessed whether you can use predicted affinity regression profiles to learn motifs as well as other approaches. In general, the nearest neighbor approach did almost as well. When restricted to just mouse, it outperformed nearest neighbor.