Week 4 Finding and Reading Behavior Genetic Papers

Because behavior genetics is a field defined by a common set of methods, regardless of topic there will be some conventions of data reporting that you will encounter over and over again in empirical (data analysis) papers.

Objectives

  • Understand the logic and common applications of twin studies and GWAS.
  • Become familiar with standard formats of tables, graphs, and statistics commonly used in twin studies and GWAS.

Lecture Notes

Behavior genetics as a field is defined by the use of a common set of methods, usually based on the availability of genetically informative data, whether organized into known relative groups or having available genotype data or both. Because of this strong overlap in methods, once you understand the general properties of a method (what it tests, how we report results, what you can- or can’t- learn from it) you can begin to dive into the literature on any phenotype.

This week we will focus on what are probably the two most common current methods:

  • The Classical Twin Study, where we estimate the similarity (correlation) between identical twins, and the correlation between fraternal twins, on some phenotype of interest and draw conclusions on the basis of how those two correlations compare; and
  • The Genome-Wide Association Study, where we one-at-a-time test the correlation between each of millions of individual genetic variants with some phenotype of interest.

Both of these are rather broad analyses, involving a lot of simplification and resting on many assumptions. We tend to use both as a “first step” in understanding the genetics of a phenotype, and both can serve as a basis of or launching point for more niche, complex analyses that we’ll begin covering in-depth in Week 6.

Finding Behavior Genetic Papers

I personally prefer Google Scholar for searching for scholarly literature. Because behavior genetics is defined by a common set of methods, if you’re looking for material related to a particular phenotype, some combination of that phenotype name and a method or two will usually turn up the bulk of the relevant literature - for example, try starting with [phenotype] heritability and [phenotype] genetics, or specific concept/method names, like [phenotype] GWAS or [phenotype] “twin study”. As you find papers that seem relevant, keep track of synonyms that appear for your phenotype, method, or concept - many topics go by a handful of different terms depending on the particular research lab or subfield that they originate from. When you find a paper you like or that is particularly helpful, try looking at the papers that cite it, the papers that it cites, or for other research by the same author(s). I also quite like the CoCites tool by Dr. Cecile Janssens, a Chrome extension that identifies papers that tend to be cited alongside a given paper that you’ve found in either Google Scholar or PubMed.

Human behavior genetics is a field that sometimes moves quite rapidly in terms of technological and statistical developments, so papers that are more recent or have larger sample sizes/numbers of participants are generally assumed to be of higher quality or more reliable. But that doesn’t mean that smaller or older studies are not of value. It is important to evaluate any paper holistically, but especially in terms of its methods. Because genetically informative data are expensive (both in terms of the actual cost, as well as time investment), secondary data analysis (or re-analysis) of data is incredibly common. It is hugely likely that the data used for a particular paper of interest for you was not designed or collected specifically for that purpose. No data are ever perfect - we always want more or better data - and no single study is ever definitive. When reading a paper, ask yourself: Did the authors do the best they could with the data and methods available, and are the conclusions they draw reasonable given the strengths and limitations of their particular study? Pay particular attention to how the phenotype was operationalized (that is, defined and measured): Does it seem reasonable (face validity)? Are there any concerns that they might not be measuring well what they’re trying to understand?

General tips for reading

Track common themes. As you’re reading the paper, consider which commonly addressed themes apply or are discussed (see Lecture Notes from Week 1). Take note of which themes the results support, and which if any they contradict. Recognizing these themes across different papers will help synthesize the information. Note a favorite table or graph. Often I find a figure or table that does a better job of presenting a main/interesting point than all of the text of the paper. When you discover a particularly useful table or figure, make note of it so that you can start there when referring back to the paper. Look for replications. One common “gold” standard for evidence in science is replication. Many GWAS include internal replication samples (that is, a smaller sample set aside specifically for the purpose of replicating any newly observed effects). You can also use Google Scholar to look for replications (either successful or null) by clicking on the “cited by” link below the paper’s search listing (any replication attempt will cite the original study). Keep in mind that science is slow, so new papers (especially < 2 years old) may not yet have any available replication attempts reported, successful or otherwise. For any topic with more than a couple of studies available, try searching for a meta-analysis (a systematic empirical summary of available studies). Some papers may themselves be replications of prior findings, in which case they would usually cite those prior studies in the Introduction.

Look for media coverage and reactions. Not all papers receive media or public attention, but where it exists it can reveal strengths and weaknesses both in terms of the study methods as well as the researchers’ attempt at communicating the conclusions. Especially high profile or controversial studies often receive a lot of discussion in public forums like Twitter. To find reactions, you can try searching for the article title or DOI (digital object identifier, now often included on the title page or at the very end of an article) in general or news search engines, twitter, Reddit, etc. Some journals and pre-print servers (for example, bioRxiv) aggregate tweets about a paper directly under it, or may show you an altmetric badge: a colorful circle with a number in it. Altmetric is a service that aggregates general references to papers across the internet and is quite thorough; click on the badge to view the various non-scholarly sources that have made reference to the paper (the higher the number, the more sources have discussed it; the more colorful, the more it’s appeared across different platforms). Some journals embed it, but for those that don’t, you can access it from any paper posted online using a bookmark.

How to Read a Classical Twin Study

Classical twin studies follow from the logic that regards twins as a “natural experiment”, where observations of a known experience that differs between individuals or over time and is not under the control of the researchers can be used to infer effects of “exposure” to the experience. In the specific case of twin studies, we are drawing conclusions about the effects of sharing 100% versus 50% of your genetic material with a person born at the same time and raised in the same family.

Participants

Monozygotic (MZ, or identical) twins result from the fertilization of a single egg by a single sperm, which then splits creating two zygotes that share 100% of their genetic material. They are naturally occurring genetic clones. Dizygotic (DZ, or fraternal) twins result from the independent fertilization of two eggs by two sperm at the same time. They share on average 50% of their genetic material and are no more or less genetically similar than any pair of full siblings. Conveniently for scientific inference, both types of twins happen to be born at the same time and the overwhelming majority are raised together, so we can ask ourselves: How similar are MZ twins to one another? And are MZ twins more similar to one another than DZ twins are? Because the answer to these questions rests on comparing the correlation between MZ twins (rMZ) to the correlation between DZ twins (rDZ), there need to be enough twin pairs included in the study to estimate a correlation well (more is better; 20 of each type is too few; 100 of each type is probably ok; 200+ of each type and I feel a lot better).

Effect sizes

Most commonly, you will be looking for standardized (that is, sum-to-1.0 or sum-to-100%) estimates of additive genetic (A, or a^2), shared or common environmental (C, or c^2), dominant genetic (D, or d^2), and non-shared or unique environmental (E or e^2) effects. For statistical reasons described below, we are limited to estimating three of those four potential sources of influences on individual differences, or “variance components,” to you will commonly find papers referring to fitting an ACE model or an ADE model. These are the factors they are referring to.

How similar are MZ twins? This question allows us to address the issue of genetic determinism, and to estimate a variance component that behavior geneticists label non-shared or unique environmental influence. We usually describe twin similarity in terms of the correlation (r) between Twin 1’s phenotype and Twin 2’s phenotype. Correlations can be between 0 and 1 and can be either positive or negative. A correlation that is closer to 1 indicates that you would be very accurate at guessing the result of Measure 2 if you knew Measure 1 (or vice versa). A correlation closer to 0 indicates that you wouldn’t do better than random chance at trying to guess Measure 2 from Measure 1 (or vice versa). A positive correlation indicates that as the scores on one measure go up, the scores on the other measure are expected to go up; a negative correlation means that as scores on one measure go up, scores on the other go down. If the only influence on an outcome was genetics, and MZ twins share 100% of the same DNA, then we would expect the MZ twin correlation (rMZ) to be exactly equal to 1. Any rMZ lower than 1 suggests that there are non-genetic influences on the twins’ phenotypes that make them different. They could potentially be anything: different experiences, similar experiences responded to differently by each twin, or just pure measurement error (when we measure something poorly, it cannot correlate highly with other measures). We refer to these difference-making experiential factors as non-shared or unique environmental influences (E), and can estimate them from just rMZ using a Falconer (the guy who derived it) equation of:

(eq. 4.1) E = 1 - rMZ

We might have a problem if rMZ ever went negative (as in, if one twin becoming more something led the other twin to become less of that thing), but practically speaking that doesn’t happen for any of the measures that we use. It’s important to note that to calculate rMZ and estimate E, all we need to measure is the phenotype. We haven’t measured anything specific about E, so we can’t really say for sure what ‘E’ is - we just know that it’s stuff that is unshared or unique between MZ twins, even those raised in the same household.

Are MZ twins more similar than DZ twins? Knowing whether and how much MZ twins are more similar than DZ twins allows us to estimate two other influences on individual differences (or variance components). One is additive genetic influences (A), a second is shared or common environmental influences (C), and a third is dominant genetic influences (D). There are no typos in the preceding sentences; when we only know the MZ twin correlation (rMZ) and the DZ twin correlation (rDZ), we’re going to need to choose whether we estimate C or D, but we can use the twin correlations to guide the choice.

If rMZ is greater than rDZ, that suggests that there is something systematically making MZ twins more similar to one another on the phenotype than DZ twins - but what do MZ twins share more in common than DZ twins? We assume that it is genetics, and Falconer derived a handy estimate of additive genetic influences, which presumes that DZ twins are half as genetically similar as MZ twins, but still more similar than any two random individuals in the population, therefore additive effects could be estimated as twice the difference between their correlations:

(eq. 4.2) A = 2 (rMZ - rDZ)

Note this would cause a problem (that is, a negative A estimate) if the rDZ correlation was larger than the rDZ correlation but, like rMZ being negative, it just doesn’t happen in practice.

Now, we need to decide whether to estimate shared or common environmental effects (C) or dominant genetic effects (D), and because of the limited information we have, we can’t estimate both at the same time. A key assumption here that has gone unstated so far is that the variance for the phenotype, whatever it is, is equal to 1 in both Twin 1 and Twin 2 (correlations are a standardized metric, meaning the measured have been adjusted or standardized to a variance of 1). So we actually have 4 pieces of information: the variance of the phenotype in Twin 1 and Twin 2, the MZ twin correlation, and the DZ twin correlation. In model fitting (even algebraically) you are limited to estimating: the number of pieces of information - 1. The short version is: the math breaks.

So, accepting that without more data (such as correlations among different kind of family relationships, which absolutely does exist for some datasets but is an added time and money cost that isn’t as widely available as the base just-the-twins participant recruitment approach) we cannot estimate A, C, D, and E simultaneously, and also knowing that A and E are always non-zero (because rMZ > rDZ and rMZ < 1), we are left to choose between C and D, and to do this it helps to define them.

Shared or common environmental influences (C) are (statistically speaking) things that make individuals who share an environment (usually defined as being raised together) more similar to one another, regardless of their genetic similarity. These sorts of influences would be expected to make twins more similar to one another across the board; so C makes rMZ greater than 0, but also make rDZ greater than 0, too, meaning that we would expect the outcome to be that MZ twins and DZ twins are more similarly similar than if that similarity was just down to differences in genetic similarity dosages alone, or, where rMZ < 2*rDZ (that is, when MZ twins are less than twice more similar than DZ twins). C influences can be estimated using the Falconer equation:

_(eq. 4.3) C = 2*rDZ - rMZ_

Dominant genetic influences (D) are defined as influences that make MZ twins more than twice as similar as DZ twins (so anything that makes them more similar and/or DZ twins less similar than the simple twice-as-much genotype comparison between the two twin types). This could refer to Mendelian-like dominance patterns but, statistically speaking, could arise from ANY non-linear relationship between genetic similarity and phenotype outcome, including gene-environment correlation and gene-environment interaction. D effects are implied when rMZ > 2rDZ and are relatively rare across all phenotypes with one notable exception: personality. When we examine meta-analyses of twin studies of personality, we reliably find that the broad-sense heritability, or what is often labeled H or h^2 (encompassing both A and D components), is approximately 40%, with no-to-little C, and the rest due to E. But that overall heritability of 40% is about half A, half D (typical twin correlations are around rMZ = 0.45, rDZ = 0.15).

Common tables and figures

When we actually estimate ACE or ADE variance components in a twin study, although it is quite typical to report the raw twin correlations (because BG folk love to estimate Falconer equations in our heads - it’s a fun party trick?), we don’t usually rely on these estimates for the primary conclusions, and they become increasingly unhelpful as more complicated models are attempted, such as examining the relationship between two phenotypes, or the effect of one on another, or change over time.

Path diagrams. Structural equation modeling (SEM) or path analysis allows us to derive predictions for the variances and covariances among variables (whether observed or inferred) under a specified model (that is, we have to define the possible relationships that may exist and should be estimated). To do that, we present relationships between variables using diagrams. The relationships can also be represented as structural equations or covariance matrices, but most people would rather have a picture. The first-ever path diagram (below, by Sewell Wright) was all about guinea pig inheritance.

Now we sadly use circles, squares, and triangles in place of guinea pigs, but the interpretation is largely the same. Arrows indicate a directional influence from one thing (at the base of the arrow) to another (at the point). A double-headed arrow indicated a correlation between two things, without implying a particular direction. Concepts that are represented by squares are manifest variables (things that were actually observed or measured; in a twin study, these would be the phenotype measures of Twin 1 and Twin 2). Circles indicate latent variables (things that are not actually observed or measured, but rather are inferred to exist based on the statistical properties of the observed data; in a twin study, A, C, D, and E are latent variables). The number placed along a single- or double-headed arrow indicates the strength of the relationship that exists between the two variables. There are some really neat characteristics of path diagrams that allow you to infer the relationship between disparate manifest and/or latent variables by multiplying the parameters along connecting paths. Most importantly, it’s a flexible, universal notation methodology that allows us to change, add, and remove concepts and relationships and still be able to convey our statistical model to a wide audience. So, you’re going to FREQUENTLY encounter path diagrams in the twin study literature.

There is exactly one difference between the pictures below, which are the path diagrams describing a standard ACE twin model. It is the entire basis of all of our estimates from classical twin studies.

Note: The difference between the set-up for an ACE model versus the setup for an ADE model is whether the correlation between Twin 1 and Twin 2 “middle” component is 1.0 in both the MZ and DZ models (estimating C, which is shared 100% between twins regardless of zygosity), but 1.0 in the MZ model and 0.25 in the DZ model (which would estimate D, which is shared more than twice as much between MZs than DZs).

Model fit statistics. When these and similar models are estimated, we use a method called Maximum Likelihood Estimation, where a program estimates that most likely values of a population or model parameter value given the observed data. Essentially, it runs through a guess-and-check process: pick some parameter values, see how well they reproduce the observed data, adjust, check again, until further adjustments don’t get any closer to matching the actual data. This is (thankfully) an automated process, so typically 10,000 iterations (or guess-and-check steps) get you a model that is as close as you’ll get to observed reality, given the model that you’ve asked the computer to estimate.

One way we commonly compare alternative models is by looking at model fit statistics. There are a huge variety of ways we can summarize “how well the final model fits the observed data.” Typically model comparisons are made in a table that will have a (hopefully) clear note including what models were compared and how the “best” model was determined (some common measures are AIC and BIC, where smaller numbers are better, and RMSEA, where smaller numbers are also better but for different reasons, and CFI and TLI, where larger numbers are better). But, in the end, it’s the conclusion that matters and these model fit statistics rarely (or debatably) have any absolute interpretation - there’s never going to be a conclusion that suggests they’ve found the True model, just identification of one that was tested better fitting the current data than the others that happen to be tested/testable (again, constrained by information provided by the available data). Alternative models to compare typically try adding or dropping concepts, or fixing or freeing parameters, and see which of many models does the best job reproducing the observed data. Typically, the researchers will identify the best model that was tested on the current data (using whatever metric) and focus on the results of that one model in particular.

How to Read a GWAS

A genome-wide association study (GWAS, pronounced “GEE-wah-s”) is an exploratory method for discovering correlations between any one of many (often millions) of available genotyped variants (usually single-nucleotide polymorphisms, SNPs, pronounced “snips”) and a phenotype of interest. It follows the process of (literally) checking each variant in turn for correlation with the phenotype. Because of this massive “multiple testing” problem, we adjust the typical threshold for statistical significance (p < 0.05) for the equivalent of 1 million tests (that is, we divide 0.05 by 1 million); so a statistically significant result in a GWAS usually requires p < 5 x 10^-8. (We do not divide by the total number of tested SNPs even if it is >1 million - up to 17 million tests is now common - because the tests are not independent. That is, linkage disequilibrium means many tests are highly correlated, so it’s like doing the exact same test over and over again, where the result is guaranteed to be the same, so we don’t count it against the multiple testing burden.) Beyond evaluating whether there are significant associations observed, there are a variety of common ways that papers use to summarize the (millions of) results.

Participants

Because of the large number of tests, and the low threshold for statistical significance, and the anticipated small effect of any one SNP, we need a very large sample to detect any of these effects as statistically significant. (P-values are the result of a function depending on both the effect size AND the sample size - so if we’re looking for small p-values, we need either large effects - which we don’t expect in genetics - or a large sample.) For normal-range phenotypes (ie. anything not restricted to cases of extreme, severe, rare outcomes), which is the vast majority of work in behavior genetics, we typically need sample sizes over 100,000 participants for adequate “power” to detect our anticipated reasonable effect sizes as being statistically significant.

Take note of any specific participant characteristics as well. Was the same limited to a certain ancestry group? (Most work so far has relied on samples of European ancestry; more research is now emerging in other ancestry groups in the past few years, but papers do still typically restrict to a single narrowly defined ancestry group.) Also note if they’re looking at only males or only females, or to a certain age group. This can inform how we interpret the phenotype and results.

Common tables and figures

Manhattan plot: This will be included in >95% of all GWAS you encounter. It is a way to summarize the results of the millions of tests that have been performed. For examples, see this twitter bot that only tweets Manhattan plots derived from publicly available GWAS results: https://twitter.com/SbotGwa. The horizontal x-axis is a map of the genome, organized from left to right by chromosome, and within chromosome by location. Each dot represents a single tested SNP. The vertical location of the SNP along the y-axis is its p-value for the correlation of itself with the phenotype - but in Manhattan plots, the p-values are negative-log-transformed (that’s the -log10(p) label on the y-axis) so that SMALLER values are HIGHER on the plot. Basically, statistically significant SNPs are jumping up, saying “Look at me!”. There is usually a horizontal dotted line provided at the level of statistical significance (again, usually p < 5x10^-8) so you can visually see where in the genome (by chromosome location) the statistically significant SNPs are located. Occasionally you will see a circular Manhattan plot; these are the same, except they are harder to read. People who use circular Manhattan plots are wrong.

QQ Plot: These are falling out of fashion, but see the example from the Lecture Notes, Week 3, Principal Components Analysis section. The x-axis is the distribution of p-values under the null (assuming no effects, but randomly occurring low p-values due to multiple testing alone). The y-axis is the observed distribution of p-values across all tests. If the dots (again, representing each tested SNP) fall along the diagonal, then there are not statistically significant effects more than would be expected by chance. If there is “lift-off” from the diagonal (and the GWAS methods specify that ancestry principal components were included as covariates), then those SNPs have lower p-values than would be expected by chance alone.

Barplots summarizing evidence: There are a wide variety of follow-up analyses that can be done using the GWAS results. Barplots are commonly used to summarize these results. Because the metrics can vary widely, it’s important to read the text to figure out what the axes are indicating. Usually, higher bars mean that thing/label is more “important”/strongly represented by lower p-values within the GWAS results; because the actual metric being depicted can be so variable, we often transform metrics to get them into this “bigger bar = more important” aesthetic.

Effect Sizes

NOTE: A p-value is not an effect size. It is the result of a function depending on both the effect size AND the sample size. A low p-value may occur either because there is a large effect OR because the sample size is large.

Largest genetic effect: Depending on the paper, this may be presented in the text, or it may be buried in the supplement. Because of consistently small individual effect sizes, many papers now restrict reporting of individual SNP effects to summary depictions, such as the Manhattan and QQ plots, and provide a limited table of “top results”, or even a full table of millions of results, in the supplement. Usually, the effect size of a single SNP is reported in terms of its odds ratio (OR; for a binary yes/no outcome) or its correlation (r) or standardized regression weight (beta) (for continuous outcomes). Another commonly used effect size reporting metric is often labeled variance explained (r^2 - conveniently, the square of what the correlation metric would be).

After the individual SNP results, there is a MULTITUDE of ways that the set of results may be summarized. Some common approaches are:

Polygenic scores: Aggregation of genome-wide variants, summed like items on a test. Usually reported in terms of correlation (r) or variance explained (r^2). We know that effect sizes of polygenic scores tend to be inflated by parallel cultural transmission processes - the strongest test of a polygenic score is WITHIN FAMILIES (that is, how well does it correlate with phenotype differences within a family, eg. between siblings, where ancestry/cultural confounds are controlled). Under NO CIRCUMSTANCE should a polygenic score be tested in the exact same sample in which the SNP effects (GWAS) were estimated - this leads to CATASTROPHIC overestimation of the effect (eg. it’s not hard to get r^2 = 1.0, because the number of variables is greater than the number of participants, so “overfitting” to the sample is a BIG problem).

Heritability among unrelated participants: The logic of twin studies can be extended to “unrelated” folks who have been genotyped, where we can ask the same question: to what extent are more closely related (more genetically similar) individuals more similar in their phenotypes?

Genetic correlation with other phenotypes: Just like we can estimate correlations between observed phenotypes, so we can estimate the extent to which the pattern of results in a GWAS is similar to patterns of GWAS results observed for other phenotypes. Genetic correlation is typically scaled to range from 0.0 to +/- 1.0, REGARDLESS of the observed phenotypic correlation and the observed heritabilities of the phenotypes (so, even if the observed phenotypes are weakly correlated and/or weakly heritable, they can still show a high degree of genetic correlation, indicating that what phenotypic correlation/heritability there is can be traced to similar patterns of associated SNPs. Because most GWAS results these days are shared publicly in-full, it is relatively easy to take results from a single GWAS and estimate genetic correlation with any GWAS that has ever been done, without needing access to the underlying data (that is, the method requires only the results/summary statistics).

Gene-based or Pathway analyses: Evaluation of whether p-values are lower than would be expected by chance within certain pre-defined sets of SNPs, such as individual genes or “pathways” (sets of multiple genes that share some common function, such as “dopamine genes” or “skeletal genes” or “genes expressed in the brain”). This is one of the RARE EXCEPTIONS to the “p-value is not an effect size” rule - most gene- or pathway-tests ONLY give a p-value as the result. The paper will usually tell you what the adjusted threshold for statistical significance is here - it will depend on the number of genes/pathways tested. Keep in mind that ONLY pre-defined genes/pathways CAN be tested - so it cannot test what we don’t yet know.

Tissue/expression enrichment: There are a huge, rapidly developing array of methods for examining tissue expression and epigenetic mechanism “enrichment,” or (as in the gene- or pathway-based methods) overrepresentation of SNPs with low p-values in regions known to be expressed in certain tissues or known to be susceptible to variations in expression from a variety of factors (including but not limited to methylation, which is one commonly studied form of epigenetic modification).

Participation Activities

You can earn up to 4 points for participation activities each week by selecting and completing tasks from the “menu” listed below. You may complete more than four tasks if you’d like, but the maximum number of points awarded will be 4 per week. Each activity is worth 1 point.

  • Read & Discuss via Perusall: Eftedal 2020 Estimating heritability of psychological traits using the classical twin design. PsyArXiv, version 2020, June 17, https://doi.org/10.31234/osf.io/g3f9c
  • Read & Discuss via Perusall: Tam et al 2019 Benefits and limitations of genome-wide association studies. Nature Reviews Genetics, 20(8), 467-484. https://doi.org/10.1038/s41576-019-0127-1
  • Activity: Citation Competition
    • One of my most frequent professional tasks is to come up with a quick answer to the question: “What is the heritability of X?” or “Do we know any genes for Y?”. In this activity, you will practice finding references for and extracting key information from heritability and GWAS research.
  • Computation Practical Exercise: Twin Heritability Two Ways
  • Class Chat on Thursday, 11:00 am - 12:20 pm

Course Project Assignment Due

Unless otherwise specified, assignments are due by Friday 5:00 pm CT for the Week in which they are listed.

  • Project Milestone 1: Topic + 5 references