A key step in the development of an adaptive immune response to pathogens or vaccines is the binding of short peptides, to molecules of the Major Histocompatibility Complex (MHC) for presentation to T lymphocytes, which are thereby activated and differentiate into effector and memory cells. The rational design of vaccines consists in part in the identification of appropriate peptides to effect this process.
The task is complicated by the fact that genes of the MHC locus have some of the greatest allelic variability observed among functional loci [1]. Peptides that bind well to one allele may or may not bind well to another.
A variety of methods for predicting peptide-binding to specific MHC-alleles based on sequence information of the peptides have been developed. A comparative review of some of the most influential approaches, including Weight Matrix Models (WMM), Hidden Markov Models (HMM), and Artificial Neural Networks (ANN) can be found in [2].
Prediction algorithms can be categorized broadly into two main classes- those based on pattern recognition and those based on classification. Pattern recognition methods seek to discover similarities among the peptides that bind at high affinity to a given MHC allele (henceforth denoted as "binders"), without considering the properties of non-binders. On the other hand classification methods seek those characteristics that most effectively distinguish binders from non-binders. Pattern recognition-based methods include WMM and motif-based prediction and profile HMM [3, 4],. Classification methods include Support vector machines [5] and Classification Trees [6–8]. These methods and the software implementing them are reviewed in [9] and [10].
Whether based on classification or pattern discovery, the peptides under investigation must have a representation in an appropriate space. The most commonly used prediction methods employ a simple categorical representation of the amino acids by chemical identity. In this representation, each amino acid is implicitly regarded as equidistant from every other amino acid.
Our aim in this paper is to determine the ability of more structured representations, based on the biophysical properties of the amino acids, with a goal toward improving the effectiveness of standard classification methods. Given that classification must always seek to balance parsimony, or simplicity in the model specification, against accuracy within the training set, a representation based on properties that may play a significant role in determining the binding characteristics of the peptide has a fair chance of supporting models that achieve accuracy with simple models.
For example, one property of amino acids that is clearly related to protein binding is hydrophobicity. The Kyte-Doolittle [11] hydrophobicity index induces an order on the amino acids. We may distinguish one set of amino acids, e.g., (R, K, D, E, N, Q, H, P, Y, W, S, T, G) from the remainder (A, M, C, F, I, L, V) by stating that the first set is to contain all amino acids with KD index less than that of A (Alanine). There are 21 ways to form such subsets, or log2 21 = 4.4 bits of information to specify the split based on the above ordered set. An arbitrary split into two groups, on the other hand, requires log2 220 = 20 bits for its specification. Note also that classification based on hydrophobicity when hydrophobicity is not strongly relevant can quickly become inefficient (such as "everything with KD index less than that of alanine, plus phenylalanine and valine; not including arginine, aspartic acid and tryptophan").
We do not intend to measure the information required for the specification of each classification model, but instead rely on the natural role of representational simplicity in the performance of classification methods. We may put it another way and ask whether a biophysical encoding makes it easier to find most of the binders by piling them up near each other in feature space, rather than having them scattered more diffusely at the level of individual residues.
To demonstrate the effectiveness of our feature space representation we compare the performance of several well-known classification methods under both a biophysical amino acid encoding and a simple categorical encoding. This paper does not focus on comparing the classification methods themselves. Instead, for each of the classification methods we compare the performance of the classifiers using the biophysical encoding against the usual categorical encoding.
In addition to effective classification, prediction methods based on amino acid biophysical properties may lend themselves naturally to the development of more comprehensive systems that combine purely empirical methods with de novo or first-principles prediction of peptide binding.
Prior Art
In contrast to the substantial literature on sequence-based peptide binding prediction, there has been relatively little focus on the use of amino acid biophysical properties in binding prediction. Information about the amino acid properties can be used for prediction in several ways. One may, for example, use the real-valued properties themselves in a regression model, or more simply use the order induced by the properties. Alternatively, one can use these properties define new categorical variables and thus natural equivalence classes on the amino acids. [12] used statistical dissimilarity defined on "property models" which showed increased sensitivity over other existing methods. Using the public database on Amino Acid Properties [13], containing a total of 484 properties, [14] and [15] build prediction rules using SVM, decision trees and C4.5 and C5 [16, 17]. [15] chose a list of 23 properties from different major and minor classes and measured the performance of classification algorithm based on specificity, sensitivity and accuracy. Additionally, the paper summarizes the three most important variables together with the most important positions for each of the MHC-I alleles they consider. On the other hand [14] started with all 484 properties (leaving out the 10 properties containing missing values) and used heuristic algorithm (based on the pairwise correlation coefficients among the properties) to remove redundancy. Finally, they report the misclassification error for C4.5, both with and without bagging, using all the variables that passed the redundancy test.
Using structural information, [18] describe a regression model to explain the binding affinity (pIC
50) with the properties describing the 3-dimensional structure of the peptide. In particular [18]pIC
50 regressed pIC
50 values of peptides on two sets of position specific structural parameters, namely Isotropic Surface (ESI), area and Electronic Charge Index (ECI). Though none of regression coefficient themselves were statistically significant their approach provided a more desirable leave-one-out cross validation error. Another approach using amino acid biophysical properties has been proposed by [19]. They employ an encoding based on the biophysical properties to classify the 20 AA into four binary factors: Hydrophobic = {A, V, F, P, M, I, L}, Polar = {S, T, Y, H, C, N, Q, W}, Charged = {D, E, K, R} and Glycine = {G}. This coding assigns a corresponding biochemical signature to each peptide, where each position now belongs to a 4-letter alphabet rather than a 20-letter alphabet. Though this coding does not allow one to distinguish between amino acids with the same code, e.g., Leucine and Isoleucine, it gives a very important partition in reduced dimension, which is particularly relevant for peptide prediction. Using this dimension reduction [19] report better misclassification error compared to algorithms based on the full unstructured 20-symbol alphabet. Empirical evidence of superiority of property based methods has also been well documented in an array of recent literature including [20–26].
Our approach
While many of the research work mentioned above examined the advantages of using bio-physio-chemical properties for MHC-peptide binding, often under particular classification frameworks, ours is the first article which provides the mathematical rigor of the generalized theory of representing the 9-mer amino acids into the space of amino acid properties. Our method was developed parallel to [14] and [15] and is closely related to their approaches. i.e. we use the full metric information of the amino acid properties, but we do not use any metric information of structural parameters. But one major difference to the approaches by [14] and [15] is that the properties we analyze are first screened on the basis of their importance based on X-ray crystallography study of peptide binding phenomenon reported in the literature. This screening is based on the crystallographic study, rather than being completely determined by data from AAindex. This step is extremely important, for several reasons. First, the values of AA properties listed in the AAindex are based on experimental data, which are not standardized and often results in discrepant measurement of the same property. Moreover, there exist a lot of redundancy e.g. the database contains three indices, one each for negative, positive and net charge. There are also instances of one index being a more precise version of another index e.g. Electron-ion interaction potential by [27] and [28]. Finally the properties chosen by exhaustively searching the AAindex is time consuming and often may not be easily interpretable. Our screening of AA properties based on their relevance in binding avoids these difficulties.
The main goal of this paper is to show that by starting with a small set of properties known a priori to be of importance in protein-protein binding, and then by using statistical techniques for variable selection to further refine this set of properties, leads to a significant decrease in the misclassification error compared to simple sequence-based classification. Moreover, as our starting set is known a priori to be relevant in MHC binding, the final subset of properties can be directly interpreted and later used to formulate de novo or first-principles prediction of peptide binding. Finally, to our knowledge this is the first research comparing sequence-based and property-based classification of MHC-binding peptides using a number of competitive classification algorithms.
The layout of this article is as follows: the Methods section describes the steps used in choosing the biophysical properties. The Results section presents a direct application of the proposed algorithm on a training peptide binding dataset for MHC allele A*0201, which has been previously used by [2] to compare several sequence-based classification algorithms. The last section provides a detailed discussion and comparison with competing methods.