### Propensity scale methods

The propensity scale methods assign a propensity value to every amino acid of the query protein sequence. Fluctuations are reduced by applying a running mean window. In the N- and C- termini we used asymmetric windows to avoid discarding prediction examples. The scales used in this study are based on antigenicity [20], hydrophilicity [6], inverted hydrophobicity [21, 22], accessibility [9] and secondary structure [7, 8].

### Hidden Markov models

Let **i** = (*i*
_{1}, *i*
_{2}, ..., *i*
_{
w
}) denote a sequence of amino acids, which has been extracted from a protein sequence. Let *j* denote the position in this window, *j* = 1...*w*. On basis of **i**, the hidden Markov model predicts if the center position of the window is annotated as part of an epitope. In the N- and C-termini, parts of the extracted windows are exceeding the terminals. For these residues, the character 'X' is used, which does not count when the hidden Markov model is used for the predictions. The prediction score for a window is given by

which is the log odds of the residue at the center position of the window is being part of an epitope (Epitope model) as opposed to if it is occurring by chance (Random model).

To construct the Random model, background frequencies of the Swiss-Prot database [23], *q*
_{
i
}, is used. For the Epitope model, *p*
_{
i,j
} is the effective amino acid probability of having amino acid *i* at position *j* according to the model.

To calculate the values of *p*
_{
i,j
}, all windows, for which their center position is annotated as part of an epitope, are extracted from atraining data set. Again, if an extracted window exceeds the N or C terminal, the character 'X' is used, which does not count when calculating the parameters.

These extracted peptide windows form a matrix of aligned peptides of the width *w*. From this alignment, *p*
_{
i,j
} is calculated as the pseudo count corrected probability of occurrence of amino acid *i* in column *j*, estimated as in [24]. To make the pseudo count correction, pseudo count frequencies, *g*
_{
i,j
}, are calculated. They are given by

where *p*
_{
k,j
} is the observed frequency of amino acid *k* in column *j* of the alignment [25]. The variable *b*
_{
i,k
} is the Blosum 62 substitution matrix frequency, e.g. the frequency of which *i* is aligned to *k* [26].

To give an example of using (2), let the window size, *w* = 1. The model is then only covering residues, which are annotated as being part of linear B-cell epitopes. If the observed peptides consists of the following single amino acid sequences L and V, with the frequencies *p*
_{
L,1
} = 0.5 and *p*
_{
V,1
} = 0.5, then the pseudo-count frequency for e.g. I is given by

The effective amino acid frequencies are calculated as a weighted average of the observed frequency and the pseudo count frequency,

Here, *α* is the effective number of sequences in the alignment - 1, and *β* is the pseudo count correction [25], which is also called the weight on low counts. To finish the calculation example, let *β* be very large as it is in this work. Then *p*
_{
I,1
} ≈ *g*
_{
I,1
} = 0.14.

Note that we shall use the term hidden Markov model throughout this work to refer to the weight matrix generated using (1). The parameters of the ungapped Markov model are calculated using a so-called Gibbs sampler, written by Nielsen et al. [24].

The result of applying (1) is a prediction score for every residue of the query sequence. To reduce fluctuations, a smoothing window is applied to every position. It is made asymmetric in the N- and C- termini in order to conserve prediction examples.

### ROC-curves

The result of applying a prediction method to a data set is a set of prediction examples, **x** = (*x*
_{1}, *x*
_{2}, ...,*x*
_{
N
}). Let *n* denote the residue number. Every *x*
_{
n
} consists of a target value and a predicted value. If the residue is annotated as part of an epitope, the target value is 1, zero otherwise. If asymmetric smoothing windows are used in the N- and C- termini, the variable *N* is equal to the number of residues in the data set.

According to a variable threshold, the prediction examples are classified as positives or negatives, and according to the target values, the predictions can be true or false. The predictions can be either true positives (TP), true negatives (TN), false positives (FP) or false negatives (FN).

The prediction accuracy is measured by constructing Receiver Operational Characteristics, ROC, curves [27]. For every value of the threshold, the true positive proportion, TP/(TP+FN), and the false positive proportion, FP/(FP+TN), is calculated. A ROC-curve is constructed by plotting the false positive proportion against the true positive proportion for all values of the threshold. It is therefore a non-parametric measure.

The sensitivity is equal to the true positive proportion, and the specificity, given by TN/(FP+TN), is equal to 1 – the false positive proportion. In this way, a ROC-curve is displaying the trade-off between the sensitivity and the specificity for all possible thresholds. A good method has a high true positive proportion when it has a low false positive proportion. A such model has a high sensitivity and a high specificity. The performance of the method is measured as the area under the curve, the *A*
_{
roc
}-value. For a random prediction, the true positive proportion is equal to the false positive proportion for every value of the threshold. Then *A*
_{
roc
} = 0.5. For a perfect method, *A*
_{
roc
} = 1.

### Bootstrapping

Bootstrapping is used to estimate the standard error of the *A*
_{
roc
}-value, as a measure of the uncertainty of the *A*
_{
roc
}-value [28]. The relation between the standard error and the standard deviation, *s*, is that *se* = , where *r* is the number of repeats of the underlying experiment [29].

Bootstrapping is a method for generating pseudo-replica (bootstrap samples) of the predictions, denoted **x***, which deviate a little from **x**. The bootstrap sample, **x*** = , is defined as a random sample of size *N*, drawn with replacement from **x**. Some of the prediction examples from **x** may appear zero times, some one time, some twice etc. Drawing a bootstrap sample can in other words be done by copying randomly chosen prediction examples, *x*
_{
n
}, from **x** into **x***. In this way, some variation from **x** is introduced into **x***.

Totally *B* bootstrap samples are drawn. Let **x**
^{*b} denote the *b*'th bootstrap sample. The prediction accuracy of **x**
^{*b} is calculated as .

The result of the bootstrap experiment is **x**
^{*1}, **x**
^{*2},...,**x**
^{*B} and hence . The standard error of the original *A*
_{
roc
}-value is given by

where

is the expected value of , given by [28]. Note the similarity to the way the standard deviation is calculated. approaches the original *A*
_{
roc
}-value as *B* gets large.

### Paired t-tests

A paired t-test is performed in order to determine if one method is more accurate than another. The *H*
_{0}-hypothesis for this test is that two means are equal, *μ*
_{1} = *μ*
_{2}. Instead of *μ*, and hence *A*
_{
roc
} is used. The starting point is the performance measures of the two methods, *A*
_{roc,M1} and *A*
_{roc,M2}, where *M1* denotes method 1. By bootstrapping we have the vectors and . Every bootstrap pair are drawn identically for every *b*, making the two *A*
_{
roc
}-values paired.

The *H*
_{0}-hypothesis is therefore *A*
_{roc,M1} = *A*
_{roc,M2} and the alternative hypothesis *A*
_{roc,M1} > *A*
_{roc,M2}. The test statistic *t* is given by

The paired difference of the *b*'th bootstrap samples, *D*
^{b}, is given by

The variable

is calculated as the expected value of *D*
^{b}, and is calculated using (4) but replacing with *D*
^{b}. The test statistic is following a t-distribution with *m* = *B* - 1 degrees of freedom, which approaches the normal distribution for *m* > 30, then *t* ≈ *z*. The P-value for the test is then given by 1 - *F*(*z*), where *F*(*z*) is the cumulative normal distribution. See [29] for more information about the paired t-test.

### Permutation tests

When testing the *H*
_{0}-hypothesis that a method performs like a random model, a permutation experiment can be made. The alternative hypothesis is that the method is performing better than a random model. From the predictions of the method, **x**, the target values are permuted to result in a new prediction set, **x**
^{perm,p}. This is done for *p* = 1...*p*
_{
max
}. For every *p*, the prediction accuracy is calculated as . The P-value for the *H*
_{0}-hypothesis is calculated as the proportion of times for which > *A*
_{
roc
}.