Week 3 Ancestry and Scientific Racism

This week we discuss population genetics and how the statistical construct that we refer to as ancestry differs from the concepts of racial/cultural identity and the social construct of race.


  • Learn how semi-random patterns of mutation and inheritance, often within DNA regions of little to no functional consequence, give rise to the statistical concept of ancestry.
  • Understand the differences between the concepts of ancestry, identity, and race.
  • Introduce how wrong conclusions arise when ancestry, identity, and race are conflated in science research or communication.

Lecture Notes

These notes are intended to be a good enough introduction to a very complex process (population genetics), which is enough to build an entire course around (at UIUC: ANSC 446 / IB 416). For more details right now, I recommend Dr. Graham Coop’s lecture notes: https://cooplab.github.io/popgen-notes/

Population Genetics & Ancestry

Two processes increase genetic diversity within a population. Mutation introduces novel variants into the population. Recombination re-shuffles the existing patterns of variation (what we call haplotypes). Both of these processes take place in every individual in every generation. Each of us carries hundreds of de novo (or new) mutations that we did not inherit from either of our parents. And each of us typically carries chromosomes that were not inherited intact from our parents, but instead, such as through a cross-over event, were newly formed from chromosomes that had been separated in our genetic parents.

The fate of new mutations is affected by random drift, selection, and population history. Although many think of selection as being the most powerful force acting on new mutations, whether positive (increasing the frequency of the mutation in the population over generations) or negative (decreasing the frequency of the mutation in the population over generations), selection can only occur if the mutation has a consistent impact on an outcome that changes reproductive success (that is, something to be selected for or against). The vast majority of mutations have no or negligible impact on anything (this must be true, otherwise those hundreds of new mutations we all carry would quickly make us so different we wouldn’t be able to successfully reproduce at all). As a result, most mutations are neither selected for nor against. Sometimes, they may be located physically close on the ladder of DNA to a functionally important variant, so even mutations with no impact may still systematically increase or decrease in frequency along with that nearby functional or causal variant. But, absent this close physical link to a causal variation, nonfunctional mutations may persist, increase, or decrease in frequency over time largely as a function of randomness.

The extent to which any mutations that are physically close to one another on the ladder of DNA are expected to remain linked, or correlated, across generations is referred to as linkage disequilibrium (LD). Without the process of recombination, this LD would never be broken down and would extend a great distance along chromosomes. If Person 1 had Mutation A and Mutation B appear de novo on the same chromosome, and there was no recombination, all of Person 1s descendants would always carry both Mutation A and Mutation B. That is, the mutations would remain perfectly correlated or in strong LD; if we knew someone was a carrier of Mutation A, we would also know they were a carrier of Mutation B, even without genotyping them at the location of Mutation B. Recombination breaks down the correlation between mutations or variants over many successive generations, leaving a narrower window of correlation, or LD, over generations.

For an illustration of how LD between mutations breaks down over generations, see: https://cjbattey.shinyapps.io/LDsim/

It just so happens that mutation and recombination happen at fairly predictable rates, and so we can simulate how far patterns of LD should extend across the genome if only those two processes were at play and they played out fairly neutrally. When we compare that to the observed pattern of LD (see Figure below), we consistently observe that LD is stronger for longer distances of the genome than would be expected by chance alone. That suggests that one or more key assumptions (including neutral evolution - that is, no directional selection, homogeneous recombination - or recombination rates only being due to distance between mutations, and random mating) likely do not hold in the population we are examining (in this course, humans; but the same patterns is observed across species).

Under certain assumptions (neutral evolution, random mating, homogeneous recombination), we can model exactly how far this correlation should extend. Observed patterns of linkage disequilibrium are more expansive than would be predicted under these assumptions (image from https://doi.org/10.1038/35075590).

In human research, we usually don’t focus as much on heterogeneous recombination or directional selection. Selection can really only be observed across generations, so isn’t feasible for human researchers studying human participants (for this, we look to non-human animal research on organisms with shorter generations. Heterogeneous recombination is often attributable to physical characteristics of DNA (so generally stuff that’s outside the focus of this course), such as the density of packing and folding structures that affect how likely any given segments of DNA are to get tangled. But it is important; very briefly, really important stuff - like the MHC, a gene-rich region on chromosome 6 that basically builds the immune system - are packed really tightly to avoid mutation and recombination, because it works well enough and messing up those core regions with random changes could be really bad, like non-viable-organism bad.

More relevant to how we think about LD in this course is the pattern of non-random mating. Although disassortative mating does occur (selection of reproductive partners based on having different characteristics; most notably in sexually reproducing species: different available sex cells), assortative mating (or reproducing with partners who are similar to ourselves) is a common process that plays out across generations and cultures, for both behavioral and non-behavioral characteristics.

We typically classify assortative mating under three processes. In primary assortative mating, mates choose each other intentionally based on their trait similarity (for example, “I play soccer, I want to raise my children as soccer players, I will only consider partnering with someone else who plays soccer”). Under social homogamy, mates choose each other due to proximity that happens because they have self-selected or have been selected into environmental proximity (“I might consider a non-soccer-playing partner, but I play and talk about soccer so much that I only ever meet other soccer players”). Under convergence, mates become more similar to each other over time (“At first I played soccer and my partner didn’t, but now I’ve taught them to play soccer.”) We tend to observe assortative mating across three broad domains: attractiveness (including height), moral attitudes (including religious and political attitudes), and cognitive ability (including educational attainment). For any of these domains (except the VERY SPECIFIC example of height), any one or more of the primary, homogamy, and convergence processes could be influencing the observed outcome of partners being more similar on the phenotype than expected by chance alone (again, except we do know that convergence can not explain why people tend to end up with similar-height others). We also know that assortative mating does not occur for certain other domains, including personality and mental health (the average correlation between partners on these phenotypes tends to be 0 - partners aren’t likely to be systematically either similar or dissimilar to one another.)

But, beyond the individual differences that we tend to focus on when examining whether partners are more or less similar to one another is a factor that I know, for the overwhelming majority of all human mates (or, even beyond that, all mating pairs across all sexually reproducing species), is a similarity in their location in place and time. Until the relatively recent developments of long-distance travel and assisted reproductive technology (such as sperm and egg donation), our genetic ancestors were limited in mating options by one criterion first and foremost - were we in the same place, at the same time?

(After that, of course, the options could be narrowed down further; but same time + same place was the first criterion that applied to everyone.)

Throughout the overwhelming majority of the history of our species, our ability to travel has been limited and slow. So most humans were born and died in the same general geographic location as their ancestors going back hundreds or thousands of years. With limited movement across many generations, who do you end up mating with? Well, you don’t get to choose from the entirety of the human population; you’re limited to those around you, who have also been in the same place as their ancestors for a long time, over many generations mating only with people in the same place - there’s a pretty good chance that your reproductive options were limited to people who were your relatively close cousins. Not necessarily your first cousins (although that is the route Charles Darwin took) but, commonly, people within a region or village would all end up being, genetically speaking, the equivalent of 3rd or 4th cousins. So, you’re not mating with your 50th cousins, who are the humans on the other side of the globe, but with your relatively close genetic cousins. And the more closely related two people are, the more likely it is that they inherited some of those de novo mutations that popped up a few generations back in the same person/ancestor. And so the pattern of LD that we observe is stronger than expected - because we aren’t recombining all available mutations across the entire population, we’re limited to what exists in the surrounding area. And that process of essentially assortative mating for time and place leads to even neutral, nonfunctional mutations being shared more than expected by chance by people who are from the same genetic and, usually throughout history, social family tree. This non-random genetic similarity is the single largest signature in our genome when we look at the <99% of DNA that differs between people, and it’s what we label “ancestry.”

Principal Components Analysis

The way we estimate ancestry in human genomics is often using a statistical procedure (not specific to genetics) called principal components analysis (PCA). PCA is applied to genotype data to describe continuous axes, or principal components (PCs), of genetic variation. The variables that emerge attempt to simplify a whole dataset of hundreds or thousands of participants, genotyped on thousands or millions of loci, into just a few variables (usually 20 - 100 components) that summarize how similar, or geometrically close, each participant is to each other participant. Each component “explains” as much of the genetic variation as possible, after accounting for the preceding components (so what we label PC1 summarizes the largest amount of genetic variation among the participants within the analyzed dataset, PC2 summarizes the next largest amount, and so on through however many PCs we’ve elected to estimate). PCs are used as covariates in analyses to attempt to statistically control for the non-random differences in mutations and LD between people. We take these PCs to have something to do with historical patterns of reproduction for two reasons. First, because theory suggests that these patterns (arising from the random mutations, plus nonrandom patterns of mating primarily for same-place-same-time) will be the largest effects observed in the human genome because selection/evolution is SLOW (requiring tens or hundreds of generations) compared to how quickly mutations are generated and passed down (every generation). Second, because when we’ve estimated these PCs for participants who are from locations where their ancestors have lived for hundreds or thousands of years, the PCs broadly map onto geographic variation. Below is one of the most well-known Figures in population genetics, from a paper (https://www.nature.com/articles/nature07331) demonstrating that the first two PCs estimated from genomes from European participants (each represented by a dot, color-coded based on the place the person lived, restricted to folks who identified all four grandparents as being from that same place) essentially recreates the East/West and North/South gradients when shown on top of a map of Europe.

This pattern, where genetic mutations track geographic locations has been shown over and over across the world, wherever there is a substantial population that has been relatively stationary for hundreds or thousands of years. Genomic ancestry patterns don’t just track geographic limitations on patterns of mating, however. If, for example, there exists a religion that for hundreds or thousands of years encourages its adherents to marry with other members of the same religion, you end up with similar patterns of distinguishing mutations popping up and being passed down within that lineage; this is how we can estimate genetic signatures of certain religious or cultural groups.

One example of genomic transmission running in parallel to cultural transmission resulting in culturally identified PCs is among Ashkenazi Jews. Because of a long-running cultural emphasis on within-religion marriage, if you are of Ashkenazi Jewish heritage (which I am), you and I are likely the genetic equivalent of third or fourth cousins, even if our social genealogies would not identify a specific common ancestor that recently (within the past 4 or 5 generations). It does NOT mean, however, that the ancestry-informative markers that the PCA process uses to cluster us together as genetically similar have any impact on our traits, and it certainly doesn’t mean that there are any genes that make us Jewish. As an example, Judaism is traditionally considered to be transmitted through the mother’s line, but I am only “genetically Jewish” on my father’s side; following our culture’s traditional rules, I am considered Jewish because my Scandinavian-ancestry mother converted to Judaism before I was born; personally, I identify as Jewish because I was raised in the traditions, had a bat mitzvah (I had a moon bounce at my party!), and care deeply about bagels and deli (the Jewish side of my family is from Hoboken, New Jersey - so place and culture are both confounded with ancestry for me). Genomic ancestry does not tell us about identity or experience - it is simply a statistical signal that arises from who your genetic ancestors were (and even then, not all of them; the way probability works - you only get 50% of DNA from each genetic parent - means that many of your ancestors, especially several generations back, may not be directly represented in your DNA at all).

Unfortunately for our ability to figure out “what genes do”, genes are constantly, for most people throughout most of human history, passed down within families in a very similar pattern to how culture, or environmental exposures, or resource availability may track over time. And these spurious associations between background ancestry and environmental outcomes can arise even over a short period of time. For example, suppose it were suddenly decreed that all Icelanders must wear tophats. The policy takes effect on Jan 1. On Jan 2, you collect data from all over the world on tophat-wearing, along with DNA. You would find that the mutations that happen to appear among Icelanders more than among other ancestries are strongly correlated with tophat-wearing. But if you ‘controlled for’ (Icelandic and other) ancestry by including principal components as covariates in your statistical analysis, those correlations between genetic variants and tophat-wearing would disappear. The problem is because these spurious correlations between ancestry and environment can occur quickly, it means that to fully account for them we must be certain that we are accounting for even relatively recent (that is, single generation) passing of novel mutations. Practically speaking, we cannot collect enough data (either in quantity or detail) from among putatively unrelated folks to estimated ancestry at this necessary level of detail. For this, we must use what is currently considered the gold standard for statistical control for background ancestry, which is to examine effects only among participants from the same family (that is, siblings). Because it is much more difficult to collect family data than to just include all available participants, most studies seeking to identify genetic associations with any phenotype are still subject to the potential confounding effect of ancestry - it is incredibly difficult to rule out parallel environmental effects as an explanation (that is, this is a form of gene-environment correlation).

The Figures below show a real-world example of what happens when you fail to account for genomic ancestry in a situation where the outcome is confounded (or correlated for artificial, not causal, reasons) with genomic ancestry. The Figure below this paragraph illustrates an analysis looking for genetic variants associated with Multiple Sclerosis (MS) (https://doi.org/10.1038/nature10251). However, the available sample was not a random or representative sample (it never is). Panel A shows the number of cases (folks with MS, in red text) and controls (folks without MS, in black text) included in the study broken down by each country of origin. As you can see, the numbers aren’t similar proportions from each country: some contributed only cases, while others contributed substantially more controls than cases. Panel B shows the plot of the first two principal components for all the participants (1) split by case-control status and (2) color-coded by country of origin. As you can see, (1) the countries differ in terms of their average PCs (participants labeled with the same color tend to cluster together), and (2) the distribution of colors (representing countries) differs between the case and control groups. This sampling scheme has created a confound - or an unintended nuisance correlation - between ancestry and MS case-control status.

The Figure below this paragraph shows one of the common ways (called a ‘Q-Q plot’) that we summarize the results of the millions of tests that are done in a genome-wide association study (GWAS, pronounced “gee-WAHs”), when we check each available genotyped single nucleotide polymorphism (SNP, pronounced “snip”) for correlation with a phenotype of interest. When so many tests are done, there will be some very low p-values (how we tend to evaluate statistical significance) by chance alone. To take that into account, we can compare our observed distribution of p-values (the values used to array points - which each of which represents the result of one tested SNP - along the vertical y-axis) to the distribution of p-values that would be expected by chance alone (that’s the values used to array points along the horizontal x-axis). The dotted black line running along the diagonal is where we would expect the points to fall if there were no effects (that is, if the observed distribution and the expected distribution were the same). Points falling above the line indicate more SNPs with lower p-values than would be expected by chance alone. The plot to the left, labeled ‘No correction’, shows the result of the GWAS of MS when the PCs are NOT included as a covariate. That is, it looks like ALL the SNPs have lower p-values than expected by chance, and so we might conclude that ALL the SNPs are associated with MS. This divergence is summarized by the lambda coefficient: here it’s 2.5, which is HUGE. Once the top 100 PCs are included as covariates (the results shown in the right-side plot), however, it drops to 1.221 (values closer to 1 are considered less likely to reflect substantial stratification). There are still a lot of SNPs more strongly associated with MS than would be expected by chance alone (MS is heritable and polygenic - many genes, not just one), but there’s also some SNPs (densely plotted in that bottom left corner, it’s actually MOST of the SNPs) that are unrelated to MS status.

Ancestry and Scientific Racism

The subtleties that arise from the often co-occurrence of genetic transmission and exposure to experiences are subtle yet pervasive. Realistically, most people are not super excited about understanding the technical limitations and interpretations of this genetic ancestry thing. It is easy to misunderstand or to intentionally misrepresent, and the consequences of such can be dire, either in terms of encouraging folks to wrongly conclude that there are substantial genetic differences between geographic, cultural, or ethnic groups; that groupings we perceive or enact are “real” or distinct (we never observe categorical grouping among participants; genetic distribution is a continuous swath that stretches across all of humanity at as fine-grained a level of measurement as we can undertake); or that associations between genes and behavior (or other traits) necessarily mean that genes are the cause of the outcome. We often assume that genetic associations must be causal (even though by association I mean correlation and correlations are not sufficient to prove causation) because genes came first. But, for most of us, our genes didn’t come first, or come together randomly - they came from our social parents, who provide both genes and environments, and who did not end up mixing their genetic material randomly. And for that reason, observing that a genetic variant is associated with a phenotype in a general population sample is only the first step to understanding how or why that association has occurred - and it’s absolutely not necessarily because the gene is the cause.

Prep Work

Below is a listing of materials to review early in the week. Although these activities do not earn points, they will prepare you to undertake the Participation Activities and Course Project assignments.

  • Fill out the Team Project Topic Preferences survey
    • I use these preferences to assign folks to groups for Team Projects in Weeks 5, 6, 7, and 12 (plus a peer review activity in Week 9). I can usually get everyone into one of their top-3 choices for each week.
  • Watch Crash Course Biology: Population Genetics (11:03) https://youtu.be/WhFKPaRnTdQ
  • Read the American Society for Human Genetics’ (2020) statement on Advancing Diverse Participation in Research with Special Consideration for Vulnerable Populations. The American Journal of Human Genetics 107, 379–380. https://doi.org/10.1016/j.ajhg.2020.08.011

Participation Activities

You can earn up to 4 points for participation activities each week by selecting and completing tasks from the “menu” listed below. You may complete more than four tasks if you’d like, but the maximum number of points awarded will be 4 per week. Each activity is worth 1 point.

  • Read & Discuss via Perusall: McLean 2020 Social constructions, historical grounds. PDF downlaod: https://t.co/3Sq44TVbBc?amp=1

    • This review article discusses the conceptualization and consequences of race as a biological concept.
  • Read & Discuss via Perusall: Martin et al. 2019 Clinical use of current polygenic risk scores may exacerbate health disparities. https://doi.org/10.1038/s41588-019-0379-x

    • This review article illustrates the current lack of geographic and ancestry diversity among participants in modern genetic research and the consequences of that lack of diversity.
  • Read & Discuss via Perusall: Tsosie et al. 2019 Overvaluing individual consent ignores risks to tribal participants. https://doi.org/10.1038/s41576-019-0161-z

  • Find & Share a popular source from the past year about BIAS in human genetic research to the News, Memes, and Everything In Between Discussion Forum. Use the following structure for your post:

    • Post subject = Title of the popular media piece (Date it was published/posted)
    • Body of the post:
      • Link to the popular media piece.
      • A brief (no more than 1-2 sentences) description of the piece, specifically in terms of how it addresses bias in participation, interpretation, and/or application of human genetic research.
      • A brief (no more than 1-2 sentences) description of how the piece relates to AT LEAST ONE of the readings from Week 3 (including the ASHG 2020 statement under the Prep Work section and the McLean 2020, Martin et al. 2019, and Tsosie et al. 2019 papers under the Participation Activities section).
    • Remember: You can search general news feeds by topic, e.g. https://news.google.com/search?q=behavior+genetics
    • Caveat: A point will only be awarded to the first person to post any given popular media piece.
  • Computation Practical: Visualizing Ancestry

  • Class Chat on Thursday, 11:00 am - 12:20 pm (CT)