Setting the default simulation parameters
We first performed an initial exploration of the parameter space (results not shown), varying all parameter values simultaneously, and based on those results, we narrowed the default value ranges of the parameters. (The simulation parameters are summarized in Table 1.) Then, we proceeded to explore each of the simulation parameters further.
Number of generations and population size
The first parameter to be examined was the number of generations needed to achieve the system's steady state. This steady state is not necessarily a fixed state, as there are ongoing fluctuations in the population as it continues to evolve. However, the amplitude of these fluctuations is relatively small and the system stabilizes around a single state (that is, more or less constant average values of fitness, genome size and diversity) after a relatively small number of generations - around 500. The number of generations in the simulation was set to 20,000, or until all the organisms have perished. The latter condition was rarely encountered and used only in extreme conditions. The 20,000-generation limit was high enough so that the effect of the initial conditions was small and the system quickly reached equilibrium (Figure 2 A-C). The genome size under the default parameters has settled to an average of about 40, which is within the range of the known Ig genomes as described above [10–18], and the average diversity approaches 0.5, which is the maximum diversity possible for random bit strings.
The size of the population in the simulations had only a small impact on the results. As the number of organisms is increased, the variability in the results is reduced, but the overall system behavior remains the same (Figure 2 D-F).
Gene length and number
The gene length in the simulation has a fixed value. We used a size of 128 bits in most runs, but also checked the values 512 and 1,024. The effect of increasing the gene length is that, as the length is increased, fewer bits are identical in a bit match between any two genes. For example, in 128 bit-long sequences the random matching is 64 bits and for a threshold of 0.53 (the binding threshold, described in detail below, is the minimum number of identical bits needed for Ab-Ag binding) there are on average about 4 unmatched bits for totally random genes that need to be mutated in order to get an antibody string that would pass the binding threshold (128*(0.53 - 0.5)), while for sequence with 1024 bits there are about 31 unmatched bits for the same threshold (1024*(0.53 - 0.5)) that would have to be mutated. To compensate for this, more antibody genes are needed in order to cover the antigen space, and thus the genome size is increased, although it remains within the range of the known Ig genomes as described above [10–18]. Although the repertoire quickly reaches the maximum diversity of 0.5, where all the sequences are the least similar to each other (Figure 2I), genome size still increases considerably with sequence length (Figure 2H). The large number of sequences needed for covering all the possibilities when gene length is increased is probably the reason why Ig sequences in real genomes are relatively short compared to the full length of the original antigen proteins, and antibodies only bind to small epitopes on the antigens. This way, a smaller number of genes can cover a larger fraction of all possible antigens, and the required matching does not need to be perfect, but only above a binding threshold. Both effects of gene length - large genome sizes and poor bit matching - explain why the average fitness decreases as gene size is increased (Figure 2G).
The maximum number of genes in an organism (maximum genome size), used within a penalty function that reduces the fitness as a function of gene number, helps the simulation avoid the trivial but biologically incorrect solution, where all or most of the possible genes are included in the organism's genome. The penalty function reduces the fitness of an organism when the number of genes in its genome increases. We used the value of 250 genes, as it is in upper range observed in real species [10–18], and any increase above this value seemed to increase the fitness by a negligible amount. Increasing the maximum number of genes increased the genome size (Figure 2 K) until a new balance between the fitness (Figure 2 J) gained by the larger genome size and lost by the penalty was reached. This parameter did not affect the diversity of the population (Figure 2 L), as it was already close to the maximum for smaller genome sizes.
Ag library size and binding threshold
The Ag library size has a negligible effect on overall system behaviour (Figure 2M-O). As the library is recreated randomly in every generation, under the very gross approximation that most pathogens evolve faster than organisms, the evolution of Abs acts to increase the diversity and not to achieve specificity against a static Ag.
The binding threshold is the minimum fraction of bits in the Ab string that must be identical to the Ag string in order to protect the organism against the Ag. When comparing two random bit strings, 0.5 of the bits are identical on average. Hence a binding threshold lower than 0.5 would have resulted in a high organism fitness with almost no dependency on other parameters, since with such a threshold even a small number of genes - actually, one gene - has a high probability of protecting the organism against most Ags (Figure 3A). As the threshold approaches 0.5, genome size and diversity begin to rise (Figure 3B, D), since the chance of matching all Ag genes is lower for any given Ab gene. At threshold values of 0.42 and higher the genome size is increased, to compensate for this difficulty in binding and achieve a higher fitness. In addition, there is a large "jump" in the diversity due to the increase in genome size - as one random antibody gene is no longer enough to protect the organism against all antigens. When the binding threshold approaches 0.5, the fitness begins to drop; at threshold values of over 0.5 (the random match), even those compensations are not enough and the fitness is reduced; and when the threshold exceeds 0.58, the population usually dies out. We chose the value of 0.53 as a default for subsequent runs, as it is higher than the random match - which would have been unrealistic, as real antibodies must be selected not to match the "self" - and hence must be a little more specific than just a generally "sticky" molecule - without also losing their match to most Ag sequences, so they cannot be too specific. However, the value of 0.53 is not too small, so the fitness genome size and diversity are still acceptable.
Genotype/phenotype ratio
The parameter Phenotype/Genotype ratio gives the fraction of genes in the genome that are expressed in the Ab repertoire, and was varied using the values 1, 1/2, 1/3... down to 1/10. This parameter strongly influences fitness and genome size (Figure 3 D, E). Decreasing the ratio causes a decrease in the fitness and a compensating increase in genome size. There is a limit to the compensation that can be reached by increasing the genome size, because of the limitation on the genome size in the simulation. When the Phenotype/Genotype ratio is small enough, the compensation limit is reached and, since increasing the genome does not increase the fitness anymore, the genome size and fitness drop dramatically, together with the diversity (Figure 3D-F). It is difficult to compare the ratios we used to what happens in real organisms, as complete repertoire characterizations do not exist for most of them. In zebrafish, it was found that between 50 and 86% of all possible VDJ combinations are used, and different individual fish shared a similar frequency distribution [23]. From the graphs presented in the latter study, it can be seen that some individual V genes are rarely represented in the actual repertoire, while others are more common. Similar results were found in sequencing the repertoires of a few human subjects [24]. In both cases, a fraction of the repertoire is not represented, but it does not seem to be higher than 50% in any individual, and in most cases it is must smaller - i.e., the genotype/phenotype ration in real systems is in the range that our simulations show to be "safe" for the individual.
After exploring the impact of basic parameter values (Figure 3), we chose the default values for the simulation (Table 1). Using those values, we explored the effects of changing the various mutation rates.
Effects of Mutation rates
The rates of occurrence of different mutation types were varied between simulation runs (Table 1), and their effects on average fitness, genome size, diversity, and gene locus structure were examined. Some of the results are intuitively obvious, e.g., that increasing the gene duplication rate increases the average genome size (Figure 4A). Others are less obvious and several properties are influenced by several different mutation rates, as described in the following sections. In the results presented below, global alignment was used to find the members of a family, and the average over all organisms is shown for each gene locus property. The ranges of mutation rates we used cover the transition from the region where the impact of the mutation is negligible and its effects are controlled by other factors in the system, to the region where the mutations dominate the system's behaviour - as in the case of duplication mutations increasing the genome size uncontrollably.
Gene duplication rate
The gene duplication rate directly affects the genome size. However, this effect is significant only when the duplication rate is high enough - over 0.01 events per gene per generation, which can also be considered very high in terms of real systems (Figure 4A). Genome size, limited by the maximum genome size and by the Phenotype/Genotype ratio, has an optimal value, and the larger genomes created by higher duplication rates are corrected to the preferred size. However, when the duplication rate is high enough, the genome size still increases and so the fitness decreases (Figure 4B). The number of families in the genome - which is between 10 and 40 in our simulations, as in real genomes - decreases when the duplication rate increases (Figure 4C), again because overall genome size is limited. The degree of inter-family mixing (the average radius of the family divided by its size) increases with the gene duplication rate, as the frequent duplication of genes that belong to the same family but are separated by members of other families increases the mixing (Figure 4D), up to unrealistic values of >100 (that is, distances of 100 genes or more between genes from the same family) for the highest values of the gene duplication rates.
Gene deletion rate
The same set of optimization mechanisms that play a role in the case of gene duplication is also at work in the case of gene deletion. When the gene deletion rate is low, the average genome size converges to the optimal size, but it decreases if the deletion rate is too high (Figure 4E). Unlike in duplication, the fitness decline rate as function of gene deletion rate is negligible (Figure 4F). However, similarly to the effect of changing the duplication rate, the number of families decreases as the gene deletion rate is increased (Figure 4G). The degree of mixing is only high (>2 - which is the value we'd expect to get in real genomes) for very low segment deletion rates, and for higher values it decreases, as expected since deletion contradicts the mixing effects caused by gene duplications (Figure 4H).
Point mutation rate
The point mutation rate has a negligible effect on genome size (Figure 4I). The average family size decreases slightly and family number increases slightly with an increase in the point mutation rate, which is expected since the mutations decrease the similarity between genes and hence the gene families are smaller (Figure 4J, K). These changes are very small compare to the changes caused by gene deletion and duplication. Since families become smaller when the point mutation rate is higher, the mixing between them is also reduced (Figure 4L), similar to the effect of increasing the point mutation rates.