Not so Fst: Davide Piffer Can’t Read

The ignominious Davide Piffer recently wrote a substack post claiming to identify a fatal flaw in my 2021 paper (freely accessible here). A careful assessment of his arguments compared to the actual contents of my paper reveals a misunderstanding so severe that I can only conclude that Piffer cannot read and interpret plain text and certainly doesn’t understand population or evolutionary genetics. If you think that assessment is harsh, please withhold judgment until the end of the post.

Background: Piffer’s previous papers

The context for this post is a long-running dispute about whether genetic differences driven by natural selection contribute to supposed racial differences in IQ (the hereditarian hypothesis) and what genomic analyses have to say on the matter. In 2015 and again in 2019, Piffer used the results of genome-wide association studies, which correlate individual DNA base pairs that differ between people (SNPs) against a trait of interest (e.g. how tall a person is) in a large sample of people. GWAS gives you a list of SNPs with significant associations to a trait and an estimate of their effect on the trait. In these papers, Piffer used publicly available genome sequences from individuals around the world and the results from a GWAS on educational attainment to identify the alleles an individual had at education-associated SNPs and added together their effects to get a single genetic score (called a polygenic score). He then took these polygenic scores and averaged them based on the country of origin for the sample and correlated those values against a table of average IQ estimates for those countries (see more about this dataset here. Piffer found that the genetic score for a country was strongly correlated with their national IQ and resoundingly declared that IQ differences between races were the result of evolved genetic differences.

However, there was one problem; actually, there were many problems. First, it turns out that comparing polygenic scores of genetically diverse groups of people is very fraught. The last decade of work in human genetics has definitively shown that the effect sizes estimates from GWAS are subtly biased by several factors, such that score do not reflect genetic effects alone and the accuracy of polygenic scores declines as a study population is more genetically distant from the original GWAS population (see Sasha Gusev’s post for an excellent summary). This problem is especially pronounced for the traits Piffer was looking at, like educational attainment and cognitive performance (See another excellent post by Sasha). Therefore, it was very likely that the polygenic scores Piffer was using were not actually reflective of underlying genetic differences in cognitive ability.

Second, Piffer was claiming natural selection was responsible for these patterns without ever actually performing a formal statistical test for selection. Natural selection is a famously difficult thing to detect because genetic variation occurs simply from random chance and the ways that people are distributed across space and time. Telling genetic differences caused by selection from those caused by non-selective processes is a very hard thing to do. Many claims of natural selection, even for single genes, do not stand up to scrutiny (e.g., ASPM and Microcephalin). Simply looking at genetic score across populations and their correlation to a trait tells you nothing about whether natural selection was responsible for the observed patterns. Confidently supporting that claim involves something like modeling what would be expected in the absence of selection and showing our observed patterns of genetic variation are statistically different from the neutral expectations. However, this also has to be done carefully. The biases that make polygenic scores inappropriate for comparison across diverse groups also lead to false-positive results for many tests for selection (as discussed in Sasha’s post and famously reported in this pair of papers about selection on height.)

Piffer has regularly claimed he has resolved these ssues caused by population structure by measuring and controlling for “LD Decay”. What he hasn’t done is define what this parameter means and how it relates to the underlying statistical and population genetic features that create polygenic score bias. Nor has he shown that his method properly measures or controls for this parameter. Normally, this would be done using simulations as ground truth and perhaps applied to an exemplar trait such as height, where the problem has been observed and where ample data exists to show the ability to avoid biases. In short, it’s unclear what Piffer is doing or whether it resolves the issues. This is not the first time Piffer has struggled with defining and demonstrating the validity of his methods. All he can show is that the result still falls in his favor after performing this supposed control.

Foreground: Bird (2021) and Piffer’s criticism

Around the time my 2021 paper was published, the consensus about biases in polygenic scores was just forming, and the extent to which traits like education and cognitive performance were especially affected was still not clearly understood. Results based on genetic comparisons of family members, which are less vulnerable to the biases affecting GWAS done on large populations of unrelated individuals, were just starting to emerge as a viable strategy. I sought to remedy the shortcomings of Piffer’s studies by using polygenic scores from a regular GWAS on educational attainment and a new sibling-based GWAS.

Part 1: Testing for natural selection

The first part of the paper involved 1. formally testing for divergent natural selection (selection driving Africans and Europeans to different trait values) using less biased data, and 2. seeing if polygenic scores from regular GWAS showed the signs of bias that had just been observed for height. I used two previously published tests for selection. First, something called the Qx test from Berg and Coop (2014) to use the polygenic scores to test for polygenic selection (with some guidance from Jeremy Berg, who appears in the acknowledgements section for his gracious help.) From these results, the less biased sibling GWAS polygenic scores did not show signs of selection while the regular GWAS polygenic scores showed did; in other words, there was lack of evidence for selection and evidence that education polygenic scores were biased, just like height.

A paper published around the same time suggested that the particular way the sibling GWAS I used was constructed may not entirely resolve the biases in polygenic scores. So, in addition to the Qx test, I also used a test from Guo et al. 2018 based on Fst, a measure of genetic divergence, looking for greater genetic divergence among the education-associated compared to matched SNPs not associated with education. This test is weaker for some kinds of highly polygenic selection (I’ve openly acknowledged this in my presentations, such as this one), though by no means useless, but is almost entirely free of the biases affecting polygenic scores. The Fst test also failed to find evidence of natural selection.

In his substack post, Piffer attacks my paper for only looking at Fst, which ignores the allelic covariation that can occur in polygenic selection. According to him, I am failing to see the forest for the trees and have stupidly fallen for a false-negative. The problem here should be obvious from the last paragraph; I did use a method that leveraged allelic covariance to test for divergent polygenic selection: The Qx test. As the abstract from Berg and Coop (2014) says clearly, the Qx test is in fact a generalization of the Qst/Fst test that Piffer talks so much about. How Piffer missed an entire section of my paper, including two figures, and clear statements in the abstract of directly cited work is beyond me. As an aside, the value and importance of Qst (a quantitative genetic analog to Fst, measuring genetic differentiation of a quantitative trait) has not been ignored by most geneticists, as Piffer claims.

To prove his point, Piffer does his own hamfisted Qst/Fst analysis. There are notable problems with his work. First, polygenic score variance isn’t truly Qst. Qst is formally defined as the entire additive genetic component of a trait, estimated using pedigrees or common-garden experiments. A polygenic score is just a fraction of the additive genetic variance for a trait. Treating polygenic score variance directly as Qst is clumsy and naive. If he wants to claim it’s fine and good, the validity of this approach must be demonstrated through simulation and/or positive controls, as is standard. The Qx test, in contrast, was designed for GWAS data and formally translates the logic of the Qst/Fst test to this setting. More importantly, Qst derived from polygenic scores does not “largely sidestep environmental noise” since they are known to be biased by population structure. The key part of my paper was using sibling-GWAS data to ameliorate this bias and more accurately test for polygenic selection. In fact, in Piffer’s own analysis (table 2), he also found that sibling-based GWAS results failed to find evidence of selection even in his faulty “Qst”/Fst and the Qx test. Somehow, a direct replication of my key results turned into a refutation of my paper!

Part 2: Fst from phenotypic data and a counterfactual exercise

The post continues with a focus on the section of my paper where I calculate an Fst estimate using phenotypic data. He seems not to have understood a single word of what was going on. In this section, the ultimate goal was to estimate what Fst would look like if all among-group IQ diﬀerences were genetic and compare that to the molecular Fst estimates from education-associated SNPs. If we see big discrepancies, then this would suggest that there are large environmental effects underly the among-group variation. I exploit the previous results showing genetic variants associated with education are consistent with neutral evolution to use the extensive toolkit of neutral evolutionary models. One such model, developed by the cited Relethford and Blangero (1990), allows an estimation of Fst using phenotypic data (what I referred to as phenotypic Fst, what they often call minimum Fst). They formally showed you can provide an estimate of the minimum genetic differentiation (Fst) between populations under a neutral model that takes within- and among-population phenotypic variance of neutral traits and the heritability of the trait. I was not confusing Qst for Fst, as Piffer claims; I was using the models described in the papers I cited. It appears Piffer did not read them.

As I described, I can use this approach to show that if it were true that all global IQ differences between Africans and Europeans are genetic, then Fst should equal ~0.6 (Bird, 2021, equation 2). What I actually saw was that Fst based on education-associated SNPs was only 0.11. Since we expect the phenotype-derived Fst to match the molecular Fst, we have far too much variance among groups to be explained by genes. Because the Qx test supported that the IQ variation was consistent with neutral variation, the assumption of neutrality is likely safe. That leaves substantial environmental eﬀects on the present-day global distribution of IQ scores as the most likely explanation for the excess variance above what is expected. I then take it a step further. I use the observed (phenotypic-derived) Fst and the expected (molecular) Fst to estimate the expected amount of among-group genetic variance (Bird, 2021, equation 3). The ratio of expected among-group genetic variance (VB_G) to the observed among-group variance (VB_G +VB_E) tells us how much among-group variance could be explained by genetics. This is approximately the between-group heritability for the racial IQ gap. The results showed that genetic effects can explain ⪅ 10% of the observed among-group variance. Once more, strongly suggesting that there are likely extensive environmental effects underlying the among-group IQ variance. As I describe in the paper, there are a lot of interpretive caveats because the model makes many assumptions. Although these assumptions would tend to overestimate genetic effects, rather than underestimate them. It is hopefully clear that this is an entirely different process than a botched Qst/Fst comparison of which Piffer accuses me. It is exploring what values would look like if the strongest hereditarian case were true and then comparing that to what the molecular data show. It’s a simple way of showing that the existing data is wildly incompatible with hereditarianism. Charles Roseman and I more fully develop this kind of exercise in our preprint.

Concluision

For all the pompous sound and fury of Piffer’s substack post, every single point is predicated on misreading or misunderstanding my paper. He either ignores or completely misses the Qx analysis in my paper that addresses the criticism that I ignore allelic covariance. He ignores the problems of population structure that bias his own results, and that his results using less-biased sibling data replicate mine. He misunderstands the evolutionary genetic models I used that employed a phenotypic-derived Fst estimate because he doesn’t read the cited papers, and he doesn’t understand the point of the counterfactual example I constructed (did he forget to eat breakfast?). If there’s a fatal flaw anywhere in all of this, it is Piffer’s poor reading comprehension. Bird (2021) isn’t perfect; I include my own list of limitations and future directions. But if you’re going to criticize the paper, I think you should at least have a basic understanding of the content.