# The HKA Test Revisited

## A Maximum-Likelihood-Ratio Test of the Standard Neutral Model

- Stephen I. Wright
^{1}and - Brian Charlesworth

- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Ashworth Laboratories, Edinburgh EH9 3JT, United Kingdom

- 1
*Corresponding author:*Department of Biology, York University, 4700 Keele St., Toronto, ON M3J 1P3, Canada. E-mail: stephenw{at}yorku.ca

## Abstract

We present a maximum-likelihood-ratio test of the standard neutral model, using multilocus data on polymorphism within species and divergence between species. The model is based on the Hudson-Kreitman-Aguadé (HKA) test, but allows for an explicit test of selection at individual loci in a multilocus framework. We use coalescent simulations to show that the likelihood-ratio test statistic is conservative, particularly when the assumption of no recombination is violated. Application of the method to polymorphism data from 18 loci from a population of *Arabidopsis lyrata* provides significant evidence for a balanced polymorphism at a candidate locus thought to be linked to the centromere. The method is also applied to polymorphism data in maize, providing support for the hypothesis of directional selection on genes in the starch pathway.

THE neutral theory of molecular evolution predicts that the amount of within-species diversity should be correlated with levels of between-species divergence, due to the dependence of both on the neutral mutation rate (Kimura 1983). The widely used HKA test (Hudson *et al.* 1987) evaluates the fit of polymorphism and divergence data to this prediction, as a test for natural selection against the null hypothesis of neutrality. The method involves the use of within-species polymorphism data on a sample from one species and sequence divergence from a related species, so that the relative amounts of polymorphism and divergence can be compared across loci. When applied to multilocus data, the HKA test assesses the overall fit of the data to a neutral model that assumes the same ratios of polymorphism and divergence at each locus. The marginal contributions of each locus to the multilocus chi-square statistic, or the results of multiple pairwise HKA tests, are often used to assess which loci contribute most to any observed departure from neutrality (*e.g.*, Moore and Purugganan 2003). This approach, however, does not provide a way of rigorously comparing different models, for example, to test for selection at a specific locus.

Here, we develop a maximum-likelihood method for analyzing polymorphism and divergence, using the HKA framework. As with the HKA test, the method assumes no recombination within loci but free recombination between loci, and it assumes that the ancestral population size of the species from which polymorphism data were obtained was the same as the current population size. The method is based on the assumption that loci are statistically independent. The likelihood (Edwards 1972) of the observed numbers of segregating sites (*S _{i}*) and pairwise divergence (

*D*) across each locus

_{i}*i*, for a total of

*r*loci sampled from a population, is then given by 1where θ

*= 4*

_{i}*N*

_{e}

*u*,

_{i}*N*

_{e}is the effective population size,

*u*is the locus-specific mutation rate, and

_{i}*T*is the divergence time of the two species, in units of 2

*N*

_{e}generations.

As with the HKA test, the method also assumes that polymorphism and divergence are independent. Although a slight nonindependence is expected between polymorphism and divergence for a given locus, the effect is very weak, and violations of the independence assumption are expected to lead to slightly conservative tests (Hudson* et al.* 1987). Under the assumption of no recombination and the standard neutral model of coalescence in a panmictic population, the likelihood of θ* _{i}* given

*S*and sample size

_{i}*n*is proportional to the probability of

*S*given θ

_{i}*and sample size*

_{i}*n*(

*P*(

_{n}*S*|θ

_{i}*)) according to the recursion 2*

_{i}(Hudson 1990), where and

The likelihood of the number of pairwise differences between a random sequence chosen from each species is given by Takahata* et al.* (1995): 3

The full neutral model thus has *r* + 1 parameters: a θ parameter for each locus, plus the shared divergence time parameter. The maximum log-likelihood under this model can thus be compared with the maximum log-likelihood with a common *u* parameter for all loci (*u*_{1} = *u*_{2} = … = *u _{r}*) to provide a likelihood-ratio test for the null hypothesis of a common mutation rate for all loci.

Within the HKA framework, selection acts by uncoupling polymorphism from divergence; positive selection will reduce the θ parameter for polymorphism relative to that for divergence, since the fixation of a favorable allele will reduce levels of diversity at linked sites by hitchhiking, while long-term balancing selection will have the opposite effect of maintaining elevated levels of diversity at linked neutral sites (Kreitman 2000). A model with a single selected locus (the *r*th locus) thus has one additional scaling parameter *k*: 4

In this model, *k* measures the degree to which diversity is increased or decreased by the action of selection at locus *r*. The fit under this model can then be compared to that under the alternative of no selection at the *r*th locus by a likelihood-ratio test.

To maximize the likelihood of the models, we use a Monte Carlo Markov chain following an approach of simulated tempering similar to that of McVean and Vieira (2001), making use of the Metropolis-Hastings algorithm (Metropolis *et al.* 1953; Hastings 1970). Briefly, a parameter is chosen at random, and the value of this parameter is incremented using a uniform distribution between −λ and +λ, where λ is a predefined increment. The likelihood of the data under the new parameter combination is then calculated, and the change is accepted if the likelihood of the new parameter set is greater. If the likelihood is less than the previous likelihood, the change is accepted with probability proportional to the difference in log-likelihoods: 5

For changes with lower likelihood, the probability of acceptance is determined by multiplying the difference in log-likelihood by a factor *f* of 50. After 10,000 iterations, this acceptance probability is reduced to a factor *f* of 0.5 times the difference in log-likelihoods for 1000 iterations and then returned to the original value. The Markov chain is run multiple times, using different starting parameters and different random-number seeds to ensure convergence. To test a difference between two models that yield maximum likelihoods *L*_{1} and *L*_{2}, we use the likelihood-ratio test statistic 2(ln *L*_{2} − ln *L*_{1}), *i.e.*, twice the difference in log-likelihoods between two models. A program written in C++ to carry out the method is available for download at www.yorku.ca/stephenw.

Here we consider four likelihood models:

Model 1 (1 free parameter): fixed mutation (

*u*_{1}=*u*_{2}= …*u*), no selection Model 2 (_{r}*r*free parameters, where*r*= number of loci): free mutation (all θestimated independently), no selection (_{i}*k*_{1}=*k*_{2}= …*k*_{r}= 1)Model 3 (1 free parameter): fixed mutation (

*u*_{1}=*u*_{2}= …*u*_{r}), selection at candidate locus*l*(*k*estimated,_{l}*k*_{2}=*k*_{3}= …*k*_{i}_{≠l}= 1)Model 4 (

*r*+ 1 free parameters): free mutation (all*u*estimated independently), selection at candidate locus_{i}*l*(*k*estimated,_{l}*k*_{1}=*k*_{2}= …*k*_{i}_{≠l}= 1).

We have applied the method to polymorphism data for 18 genes (supplementary Table 1 at http://www.genetics.org/supplemental/) from a single Icelandic population of *Arabidopsis lyrata*, using divergence estimates from *A. thaliana* (S. Wright and D. Charlesworth, unpublished data). These loci have an average of 126 synonymous sites and 3.9 segregating synonymous sites, and the sample size averaged 16.7 haploid genomes. For this data set, 100,000 chains were found to be sufficient for convergence, which took ∼5 min to run on a 2.8-GHz Pentium computer. To evaluate the behavior of the model, and to assess the fit of the likelihood ratio statistic to the χ^{2} approximation, we ran neutral coalescent simulations (Hudson 2002) for 18 loci, with divergence time, sample size, and θ parameter values chosen to match estimates from the *A. lyrata* data set. All coalescent simulations included a population split and a single sample from the outgroup species, so that polymorphism and divergence were both estimated from the same coalescent process. We ran two sets of 1000 simulations; the first had no intragenic recombination, and the second included intragenic recombination, with values of the population recombination parameter ρ = 4*N*_{e}*r* estimated for each locus using genetic mapping data from *A. thaliana*, as previously described (Wright *et al*. 2003). Across loci, the assumed per-locus values of ρ ranged from 0.3 to 60, and the ratio ρ*/*θ ranged from 0.02 to 10. Our candidate locus for selection (“locus 10”) is a gene thought to be linked to the centromere, AT1G36310, which had been hypothesized (S. Wright and D. Charlesworth, unpublished data) to be linked to a site under selection, due to increased hitchhiking in a region of reduced recombination.

To test the assumption of heterogeneity in mutation rates with this data set, we compare a fixed-rate model (model 1) with a free-rate model (model 2). Significance is assessed using a likelihood-ratio test, using χ^{2} with *r* − 1 = 17 d.f. Model 2 provides a highly significant improvement to the likelihood compared with model 1, indicating the existence of heterogeneity in mutation rates across loci (Table 1). This highlights the importance of incorporating mutation rate variation into parameter estimation (*e.g.*, Wall 2003) and tests of natural selection. Simulations show that likelihood estimation of divergence and mutation parameters perform well under the neutral model, even in the presence of intragenic recombination (Table 2).

To test for selection, we first use the standard multilocus HKA test for comparison, using the program of J. Hey (http://lifesci.rutgers.edu/heylab/DistributedProgramsandData.htm#HKA). The standard multilocus HKA test for these 18 loci shows a significant departure from neutral expectation (χ^{2} = 43.6, *P* < 0.001), suggesting the action of natural selection at some loci. However, no individual locus shows evidence for selection using the HKA test; the maximum marginal χ^{2} deviation is contributed by the polymorphism cell for the candidate locus (locus 10), which shows elevated levels of polymorphism, but this value is not significantly greater than expected under neutral coalescent simulations (*P =* 0.13). Furthermore, pairwise HKA tests comparing locus 10 to all other loci give 7 of 17 significant tests at the 5% level. While the above results are suggestive of a departure from neutrality, it is difficult to make a clear inference of selection at locus 10 using standard methods.

If we treat this centromeric locus as a candidate gene for selection using the likelihood method, comparison of a free-mutation model (model 4), which allows for selection at this locus, with the corresponding strict neutral model (model 2) gives a likelihood-ratio statistic of 9.0, which is significant under the χ^{2} approximation with 1 d.f. (Table 1). Similarly, there is a highly significant improvement to the fixed-mutation-rate model when we allow for selection at the candidate locus (comparison of models 1 and 3 in Table 1). Furthermore, the maximum-likelihood estimate of the selection parameter *k* is 4.3, suggesting a fourfold elevation of diversity over neutral expectation at this locus.

Simulations show that the neutral model with no recombination follows the χ^{2} approximation well and that the inclusion of reasonable levels of intragenic recombination makes the test more conservative (Figure 1). Under neutral coalescent simulations with intragenic recombination, the mean and median value of *k* are close to 1 (mean *k* = 1.05, median = 0.97), as expected. Furthermore, none of the 1000 simulations shows an estimate of *k* greater than that observed in this data set (Figure 2). Thus, the likelihood method shows significant evidence for a balanced polymorphism at this locus. These results suggest that the likelihood method has higher power than a multilocus HKA test to test for selection at a candidate gene.

To assess the power of the method to detect directional selection, we applied our method to a multilocus data set in maize (Whitt* et al.* 2002), which compares patterns of diversity for six genes involved in the starch pathway, which have been hypothesized to be under directional selection during domestication, with 11 neutral reference loci. Because of the considerable increase in number of sites per locus (average across loci, 795), the number of segregating sites per locus (average across loci, 26.9), and the divergence (average number of differences between species, 46.5), this analysis took considerably longer to converge (1–10 million chains) and to run (∼17–100 hr on a 2.6-GHz Pentium III computer) in comparison with the *A. lyrata* data set. We first compare a neutral model, where all 17 loci have *k* = 1, with a selection model, allowing all six starch pathway genes to be under selection. The likelihood-ratio test is highly significant for this comparison, showing strong evidence for selection on starch pathway genes (Table 3). However, three of the genes have maximum-likelihood estimates of *k* that are either close to or greater than one, suggesting that only a subset of genes may have been under directional selection. A model that allows only the 3 loci with *k* < 1 to be under selection is also highly significant in comparison with the neutral model, and the model with 6 selected loci shows no significant improvement to the likelihood in comparison with the three-gene model (Table 3).

As shown using the maize data set, the method can also easily be applied to test for selection on multiple loci, and the program allows for such models to be tested. However, the power of the method relies on the assumption that at least one locus in the data set follows the strict neutral assumptions and therefore still requires some *a priori* selection of candidate genes. With the inclusion of a significant number of neutral loci, the method should be much less sensitive to unusual loci and should incorporate the inherent uncertainty in divergence time from individual loci. This should give the likelihood method much more consistency than using a large number of pairwise tests, which have been shown to be difficult to interpret (Moriyama and Powell 1996).

Although the method appears to be fairly robust to the assumption of no recombination, the power to detect a significant reduction or increase in diversity in a given region is likely to depend on the local rate of recombination and the size of the region analyzed; a signature of natural selection using this method may be unlikely in large sampled regions of high recombination, unless selection is very recent and strong. For these types of data, tests that explicitly analyze variation in patterns of diversity across a region (*e.g.*, Kim and Stephan 2002) would be more appropriate.

In addition to the effects of natural selection, violations from the infinite-sites model, intralocus variation in mutation rates, population subdivision, and recent changes in population size (*i.e*., population bottlenecks and expansion) could lead to a significant likelihood-ratio test, although the extent to which the variance across loci in the ratio of diversity to divergence would be affected by these violations remains unclear. The problems associated with violations of the assumptions of the standard neutral model are shared with most other tests of selection based on polymorphism data, although in principle the multilocus comparisons associated with the HKA framework should be more robust to such violations than tests based on comparing different aspects of diversity at a single locus (*e.g.*, Tajima 1989; Fay and Wu 2000). One important approach to addressing this question would be to first fit the multilocus data to a demographic or mutation model and then examine the distribution of the likelihood-ratio test statistic under this model. If the fit to the chi-square statistic is found to be poor, then more computationally intensive likelihoods could be estimated using simulated data rather than the given equations above, using the same basic likelihood framework. However, the method may be conservative under some demographic models; for example, population expansion is expected to reduce the variance in diversity across loci, and in this case direct use of the above method may be preferable to avoid the computational requirements of exploring by simulation a vast array of possible demographic models and mutation parameters to reach the maximum likelihoods.

## Acknowledgments

We thank D. Charlesworth for helpful discussion, E. Buckler for providing data, and B. Gaut, P. Andolfatto, and two anonymous reviewers for comments on the manuscript. This research was supported by a Royal Society Professorship to B.C. and a Commonwealth Scholarship to S.W.

## Footnotes

Communicating editor: A. H. Paterson

- Received January 15, 2004.
- Accepted June 25, 2004.

- Genetics Society of America