Scrapbook photo 1
gwern branwen
2,725 followers|1,657,142 views


gwern branwen

Shared publicly  - 
Everything is heritable: maternal selection of sperm donors for height increases offspring height as much as predicted by heritability estimates.

"Quantitative Genetics in the Postmodern Family of the Donor Sibling Registry", Lee 2013:

"Quantitative genetics is primarily concerned with two subjects: the correlation between relatives and the response to selection. The correlation between relatives is used to determinethe heritability of a trait--the key quantity that addresses the question of nature vs. nurture. Heritability, in turn, is used to predict the response to selection --the maindriver ofimprovementsin crops and livestock. The theory of quantitative genetics has been thoroughly tested and applied in plants and animals, but heritability and selection remain open questions in humans due to limited natural experimental designs. The Donor Sibling Registry (DSR)is an organization that helps individuals conceived as a result of sperm, egg, or embryo donation make contact withgenetically related individuals. Families who conceived children via anonymous sperm donation join the DSR and match with other families who used the same donor ID at the same sperm bank. The resulting donor pedigree consists of heterosexual, lesbian, and single mother families who are connected through the common anonymous sperm donor used to conceive their children. Here, we introduce a new quantitative genetic study design based on theunprecedented family relationshipsfound in the donor pedigree. We surveyed 945 individual families constituting 159 donor pedigrees from the Donor Sibling Registry and used their demographic, physical, and behavioral characteristics to conduct a quantitative genetic study of selection and heritability. A direct measurement of phenotypic assortment showed mothers actively selected mates for height, eye color, and religion. vArtificial selection for donor height increased mean child height in a manner consistent with the selection differential. Reared-apart donor-conceived paternal half-siblings providedunbiased heritability estimates for traits influenced by maternal and contrast effects. Maternal effects were important in determining the variance of birth weight while eliminating contrast effects revealed sociability to be a highly heritable childhood temperament. Thus, the unprecedented family relationships in the donor pedigree enable a universal model for quantitative genetics.

[Arslan & Penke comment: "Estimates of heritability derived from twin studies have held up remarkably well when re-examined using different family relationships (e.g. parents, siblings, half- and adopted siblings) and can be easily extended to novel data such as the sometimes numerous offspring of sperm donors. In cases where selection is fairly clear-cut, estimates of heritability have borne out their usefulness as predictors of the response to selection. For example, children of sperm donors are taller in a manner consistent with their mothers’ selection on donor height (J.C. Lee, 2013)."]
[My comment: I can't believe I've never seen a study like this before. Sperm/egg donation is a great adjunct to classic twin and adoption studies. That it lets one verify that selection+heritability has the eugenic effects expected is even better.]

1.1.3 Specialty Designs: Specialty study designs explore a wide range of family relationships:
- Twins reared apart – Studies of twins reared apart fuse twin and adoption methodologies to eliminate shared environment. Twins separated at birth are extremely rare (9, 10).
- Children of MZ twins – Nominally cousins, children of MZ twins are genetic half-siblings who are reared apart in different households (11, 12).
- Half-siblings – In an indirect study, half-siblings are compared with full-siblings to eliminate the effect of shared environment (13-17). The more powerful direct study examines half-siblings who have been reared apart. Such half-siblings may vary in their exposure to a common environment or biological parent (8).
- Exact genetic relationship – Genome-wide DNA markers can be used to determine the exact coefficient of relatedness. The variation in r between full siblings enables a direct estimate of the heritability (18). This approach can be taken to its logical extreme using SNPs to calculate distant genetic relationships between putatively unrelated individuals, but this method cannot yet fully account for observed heritability (19).

The theory of quantitative genetics has been thoroughly tested and applied in plants and animals, but the genetic architecture of complex traits and the response to selection are open questions in humans. Here, we introduce a new family study design to address these issues.
The donor pedigree is a historically unprecedented family structure made possible by modern reproductive medicine. It consists of heterosexual, lesbian, and single mother families who are connected through the common anonymous sperm donor used to conceive their children. We used the unique gender and kinship arrangements found in the donor pedigree to conduct a quantitative genetic study with three aims: (i) to examine female mate choice preferences and determine which traits undergo active selection in humans, (ii) to describe the response to selection as a realized heritability, and (iii) to establish donor-conceived reared-apart paternal half-siblings as a model to measure the heritability of traits intractable to other study designs.

Study data was collected and managed using REDCap electronic data capture tools hosted at UCSF (12). Subjects self-identified as either a biological mother of a donorconceived child, a donor, a donor-conceived child (age 7-12), a donor-conceived adolescent (age 13-18), a donor-conceived adult (age 18+), or a non-biological parent of a donor-conceived child. Subjects completed surveys tailored to their self-identified group.
We restricted our analysis to data from biological mother reports due to the low absolute number of self-reports from donors, donor-conceived children, donor-conceived adolescents, donor-conceived adults, and non-biological parents. We further restricted our analysis to complete paternal half-sibling pedigrees that included a biological mother, a sperm donor with a known ID and clinic, donor-conceived children, and a nonbiological parent (if applicable). The complete pedigree requirement excludes mothers who used egg donation, mothers who did not have full donor ID/clinic information, and mothers who did not advance far enough into the survey to complete the parent report on their child.
Biological mothers completed self-report surveys regarding demographic, physical, and behavioral characteristics. They provided information about their donor and partner (if applicable) and then completed a parent report about their donor-conceived child's physical characteristics, medical history, temperament, symptoms of mental disorders, and birth and early development. Temperament was evaluated using the Emotionality, Activity, and Sociability (EAS) Temperament Survey for Children (13). Symptoms of mental disorders were measured using the Strengths and Difficulties Questionnaire (SDQ) (14). Birth and early development events were assessed using the NCS-A birth and early development questionnaire (15). Variable definitions for survey responses are shown in Table 2.1.

Parent age, height, and body mass index (BMI) distributions are shown in Figure 2.2. Male partners displayed greater positive skew for age than mothers or female partners. Donors were taller and had lower BMIs than male partners. The complete pedigrees contain a total of 1213 children. Descriptive statistics for these children are shown in Table 2.4. The prevalence of multiple births in the DSR was higher than the US national average of 3.3% due to the use of assisted reproductive technology (16).
Child age, height, and BMI distributions are shown in Figure 2.3. Donor-conceived children were taller and had higher BMIs compared to median CDC growth curves (17). Cross-tabulation tables for mother/donor/child eye and hair color are shown in Table 2.5 and Table 2.6, respectively. These tables are symmetric across the main diagonal, demonstrating internal consistency with regards to eye and hair color genetics. Table 2.7 shows the distribution of the number of children per biological mother. Approximately 25% of mothers had more than one donor-conceived child, either as a result of multiple births or from multiple donor-assisted conceptions. Based on shared donor ID and clinic information, 576 out of the 1213 children matched with a half-sibling internally within our sample, capturing 8.1% of the 7155 children who had matched with a half-sibling in the DSR at the start of the study. The 576 children who matched with a half-sibling formed 159 donor pedigrees in which each child shares the same sperm donor. The distribution of the number of children in each donor pedigree is shown in Table 2.8; the largest donor pedigree contains 10 children, with a median size of three children.
We assessed the reliability of the donor ID/clinic half-sibling matching process by measuring the inter-rater reliability of mother-reported physical characteristics for the donors. Krippendorff alpha (18) values for height (  =0.67), weight (  =0.76), eye color (  =0.78), and hair color (  = 0.79) support the claim that mothers in each donor pedigree used the same donor.

Figure 2.5 shows maternal age at time of birth. Although maternal age was not a significant covariate for birth weight, the spike at age 38 illustrates the inevitability of the biological clock.

Numerous surveys have been conducted to determine which traits women value when selecting a mate, but they all suffer from the same fault: women's stated preferences may not be reflected in their mate choices (2-5).

Anonymous sperm donation eliminates these confounding processes to reveal a clear link between female mate choice preferences and mate selection. Mothers freely choose a donor from a sperm bank, particularly single mothers who express their preferences without input from a partner. Thus, the single mother/donor correlation is the first direct measure of phenotypic assortment in humans, standing in contrast to previous indirect methods based on twins and their spouses (5, 11-13). Any heterosexual mother/male partner or lesbian mother/female partner correlation in excess of the single mother/donor correlation is attributable to social homogamy (Table 2.2).

The majority of mothers said race/ethnicity, education level, height, body mass index (BMI), eye color, and hair color were important, but employment status and religion were not (Table 3.3). Given the donor's wholly genetic contribution, these ratings can be interpreted as reflecting mothers' beliefs about heritability.

[Silly mothers. Of course employment status and religion are heritable. Employment, if nothing else, through race, education, and intelligence.] Height and Eye Color
We used phenotypic assortment to determine which traits underwent active selection. For height and eye color, the single mother/donor correlations were greater than the heterosexual mother/male partner and lesbian mother/female partner correlations (Table 3.3). This suggests mutual mate choice limits phenotypic assortment in a monogamous mating system, as compared to unconstrained female choice of donor (15). A log-linear regression analysis of eye color shows single mothers preferentially selected donors with recessive blue eyes (Table 3.4, Figure 3.1). Thus, mothers actively selected for height and eye color in accordance with their stated preferences.
Examining the remaining parental correlations for height and eye color, the positive male partner/donor and female partner/donor correlations (Table 3.3) indicate biological mothers in heterosexual and lesbian couples chose a donor to match their partner in a transitive form of phenotypic assortment. Log-linear regression results show heterosexual and lesbian couples matched partner and donor eye colors (Table 3.4, Figure 3.2, Figure 3.3). The lesbian mother/donor correlations were significant, but the heterosexual mother/donor correlations were not. Lesbian couples could choose a donor to match either parent because they do not contend with the same paternity issues facing heterosexual couples (16, 17). Education Level and BMI
For education level and BMI, the heterosexual mother/male partner (education level only) and lesbian mother/female partner correlations were significant, but the single mother/donor correlations were not (Table 3.3). Assortative mating for education level 47and BMI was therefore driven by passive social homogamy, contrary to expectations from mothers' stated preferences. The single mother/donor correlations may have been influenced by a ceiling effect in which the majority of donors were college-educated and had healthy BMIs (Table 2.3). Religion
For religion, the heterosexual mother/male partner and lesbian mother/female partner correlations were greater than the single mother/donor correlation (Table 3.3). All three correlations were significant, signaling the influence of both phenotypic assortment and social homogamy. Phenotypic assortment was driven by the association of atheist/atheist and Jewish/Jewish pairings between single mothers and donors (Table 3.5, Figure 3.4). Social homogamy was driven by atheist/atheist, Jewish/Jewish, and other/other pairings between mothers and their partners (Table 3.5, Figure 3.5, Figure 3.6). Judaism's dual role as a religion and an ethnicity could partially explain why there was phenotypic assortment for religion despite it not being an important factor in donor choice. Race/Ethnicity
For race/ethnicity, our sample was approximately 90% white (Table 2.3). Heterosexual couples (Figure 3.7) and single mothers (Figure 3.9) were less diverse then lesbian couples (Figure 3.8). Lesbian couples demonstrated concordance for race/ethnicity via statistically significant correlations (Table 3.3) and associations (Table 3.6) between lesbian mother/female partner, lesbian mother/donor, and female partner/donor. The single mother/donor correlation was statistically significant (Table 3.3), but no log-linear regression associations were found between single mother and 48donor due to small non-white sample size (Table 3.6). The heterosexual mother/male partner and heterosexual mother/donor correlations were not significant because the small number of non-white mothers in heterosexual couples almost exclusively had white male partners and they all chose white donors (Figure 3.7). This idiosyncratic pattern and lack of diversity precludes a general inference about selection for race/ethnicity. Hair Color, and Employment Status
We did not observe any assortative mating for hair color or employment status between single mother/donor, heterosexual mother/male partner, or lesbian mother/female partner. Lesbian couples matched partner/donor hair color, while heterosexual couples did not (Table 3.3).

Children in the DSR were taller than the median growth curve by R = 1.23 inches, averaged across all ages for both sexes (Figure 2.3B). The selection differential, S , measures the strength of selection and is defined as the difference between the mean height of the selected parents and the mean height of the population. Biological mothers were taller than the median Caucasian female by 0.7 inches and selected donors were taller than the median Caucasian male by 2 inches, resulting in a selection differential of S = 1.35 inches (Table 2.3) (18). The response to selection is related to the selection differential by the realized heritability h^2 = R / S . Assisted reproduction created a rare natural experiment to study artificial selection for height in humans; the effect of selection as described by the realized heritability h^2 = 0 .91 was consistent with the heritability of adult height calculated using traditional methods (19).

The donor pedigree enables the first direct measure of phenotypic assortment in humans. We compared mother/donor with mother/partner correlations and found mothers actively selected for height, eye color, and religion. The response to selection for height matched theoretical predictions; taller donors begat taller children in a manner consistent with previous heritability estimates. Our results represent a unique experimental validation of ethical artificial selection in humans. Mothers who selected for height endowed their children with an economic advantage because height is positively associated with social status and labor market outcomes (20)."
Add a comment...

gwern branwen

Shared publicly  - 
"The phenotypic legacy of admixture between modern humans and Neandertals", Simonti et al 2016:

"Many modern human genomes retain DNA inherited from interbreeding with archaic hominins, such as Neandertals, yet the influence of this admixture on human traits is largely unknown. We analyzed the contribution of common Neandertal variants to over 1000 electronic health record (EHR)–derived phenotypes in ~28,000 adults of European ancestry. We discovered and replicated associations of Neandertal alleles with neurological, psychiatric, immunological, and dermatological phenotypes. Neandertal alleles together explained a significant fraction of the variation in risk for depression and skin lesions resulting from sun exposure (actinic keratosis), and individual Neandertal alleles were significantly associated with specific human phenotypes, including hypercoagulation and tobacco use. Our results establish that archaic admixture influences disease risk in modern humans, provide hypotheses about the effects of hundreds of Neandertal haplotypes, and demonstrate the utility of EHR data in evolutionary analyses."

[Risk factors for:

    Mood disorders
    Actinic keratosis
    Seborrheic keratosis
    Acute upper respiratory infections
    Coronary atherosclerosis

Apropos of writing yesterday, I realized I had missed sharing some of the more interesting recent genetics discoveries. For example, this one!

What an age we live in. Not only do we know which genetic variants come from Neanderthals (!), we can also link them to specific health problems (!!).

For the majority of our analyses, we used a set of 1689 hierarchically related phenotypes (including 1087 leaf phenotypes) defined from the use of International Classification of Diseases (ICD-9) billing codes in the EHRs (11). We analyzed a set of 28,416 adults of European ancestry from across the eMERGE sites who had been genotyped on genome-wide arrays and had sufficient EHR data to define phenotypes. These individuals naturally fell into separate discovery and replication cohorts on the basis of their inclusion in the eMERGE Network Phase 1 (E1; N = 13,686 individuals) or Phase 2 (E2; N = 14,730 individuals) data releases (12).

Neandertal variants have been hypothesized to influence many phenotypes in AMHs, including lipid metabolism, immunity, depression, digestion, hair, and skin, on the basis of the enrichment of Neandertal variants in regions of the genome relevant to these traits (3, 5, 6, 9). Accordingly, we first tested these hypotheses using genome-wide complex trait analysis (GCTA) to estimate the phenotypic risk explained by 1495 genotyped common (minor allele frequency > 1%) Neandertal SNPs for a set of 46 highprevalence phenotypes from the hypothesized categories, using age, sex, and eMERGE site as covariates (Fig. 1, B and C) (15). Neandertal SNPs explained a significant [likelihood ratio test; false discovery rate (FDR) < 0.05 over all phenotype tests] percent of the risk in three traits in the E1 discovery cohort (Table 1): depression (2.03%, P = 0.0036), myocardial infarction (1.39%, P = 0.0026), and corns and callosities (1.26%, P = 0.01). Neandertal SNPs also explained a nominally significant (P < 0.1) percent of risk for nine additional traits, including actinic and seborrheic keratosis, coronary atherosclerosis, and obesity (Table 1).
Of the 12 nominally significant associations, 8 were replicated in the independent E2 data set, including actinic keratosis (P = 0.0059), mood disorders (P = 0.018), depression (P = 0.020), obesity (P = 0.030), and seborrheic keratosis (P = 0.045) at P < 0.1 (Table 1; likelihood ratio test). We also tested whether the percent of phenotypic variance explained by Neandertal SNPs remained significant in the context of non-Neandertal SNPs by including an additional genetic relationship matrix (GRM) computed from non-Neandertal SNPs across the rest of the human genome in the mixed linear model (11). Depression (P = 0.031), mood disorders (P = 0.029), and actinic keratosis (P = 0.036) were replicated with these stricter criteria in the independent E2 cohort.

Add a comment...

gwern branwen

Shared publicly  - 
A proposal for a distributed darknet market: "Beaver: A Decentralized Anonymous Marketplace with Secure Reputation", Soska et al 2016

"Reputation systems play a crucial role in establishing trust online, especially in e-commerce settings. Users in reputation systems provide feedback for other users, thereby incentivizing good behavior and disincentivizing bad behavior. With growing concerns of government surveillance and corporate data sharing, it is increasingly common that users on the web demand tools for preserving their privacy without placing trust in a third party. Unfortunately, existing centralized reputation systems need to be trusted for either privacy, correctness, or both. Existing decentralized approaches, on the other hand, are either vulnerable to Sybil attacks, present inconsistent views of the network, or leak critical information about the actions of its users. In this paper, we present Beaver, a decentralized anonymous marketplace that is resistant against Sybil attacks on vendor reputation, while preserving the anonymity of its customers. Beaver allows its participants to enjoy free open enrollment, and provides every user with the same global view of the reputation of other users through public ledger based consensus. Our use of various cryptographic primitives allow Beaver to offer high levels of usability and practicality, along with strong anonymity guarantees."

(One issue is use of fees instead of proof-of-burn; see my /r/darknetmarkets writeup.) #bitcoin #darknetmarkets #drugs  
Add a comment...

gwern branwen

Shared publicly  - 
"Continuous evolution of Bacillus thuringiensis toxins overcomes insect resistance", Badran et al 2016

"The Bacillus thuringiensis δ-endotoxins (Bt toxins) are widely used insecticidal proteins in engineered crops that provide agricultural, economic, and environmental benefits. The development of insect resistance to Bt toxins endangers their long-term effectiveness. Here we have developed a phage-assisted continuous evolution selection that rapidly evolves high-affinity protein–protein interactions, and applied this system to evolve variants of the Bt toxin Cry1Ac that bind a cadherin-like receptor from the insect pest Trichoplusia ni (TnCAD) that is not natively bound by wild-type Cry1Ac. The resulting evolved Cry1Ac variants bind TnCAD with high affinity (dissociation constant Kd = 11–41 nM), kill TnCAD-expressing insect cells that are not susceptible to wild-type Cry1Ac, and kill Cry1Ac-resistant T. ni insects up to 335-fold more potently than wild-type Cry1Ac. Our findings establish that the evolution of Bt toxins with novel insect cell receptor affinity can overcome insect Bt toxin resistance and confer lethality approaching that of the wild-type Bt toxin against non-resistant insects."
David Liu's group at Harvard has been working on a technique for a few years now called PACE, and I keep meaning to write about it. A new paper in Nature gives me the opportunity. The acronym stands for "Phage-assisted continuous evolution", and it's as neat an example of directe
Boris Borcic's profile photo
Interesting set up. Bees will love the particular application, I'm sure.
Add a comment...

gwern branwen

Shared publicly  - 
"Genome-wide association study identifies 74 [162] loci associated with educational attainment", Okbay et al 2016; abstract:

"Educational attainment is strongly influenced by social and other environmental factors, but genetic factors are estimated to account for at least 20% of the variation across individuals1. Here we report the results of a genome-wide association study (GWAS) for educational attainment that extends our earlier discovery sample1, 2 of 101,069 individuals to 293,723 individuals, and a replication study in an independent sample of 111,349 individuals from the UK Biobank. We identify 74 [162 total] genome-wide significant loci associated with the number of years of schooling completed. Single-nucleotide polymorphisms associated with educational attainment are disproportionately found in genomic regions regulating gene expression in the fetal brain. Candidate genes are preferentially expressed in neural tissue, especially during the prenatal period, and enriched for biological pathways involved in neural development. Our findings demonstrate that, even for a behavioural phenotype that is mostly environmentally determined, a well-powered GWAS identifies replicable associated genetic variants that suggest biologically relevant pathways. Because educational attainment is measured in large numbers of individuals, it will continue to be useful as a proxy phenotype in efforts to characterize the genetic influences of related phenotypes, including cognition and neuropsychiatric diseases."

supplementary info:!data/kuzq8

[The long-awaited SSGAC followup to Rietveld et al 2013. This almost quadruples the sample size, yielding no less than 162 genome-wide-significant SNP hits! The full polygenic score explains >3.8% of years-of-education. This is going to make Mendelian randomization and other designs far more powerful, and of course, has considerable implications for embryo selection/editing. With all the hits to play with, Okbay et al 2016 is able to do a lot of fun things like noting that the SNPs seem heavily involved in fetal brain development (sorry, everyone who was dreaming of a 'CRISPR booster shot' for increasing your adult intelligence), changes in involvement over a life span, changes in education heritability over time (reduced, probably because schooling requires less intelligence), and fine-mapping/annotation estimation of probability hits are causal (decent for a subset). Very cool.

The downside: the paper is atrociously written and very confusing. For example, the 162 hits is mentioned nowhere in the paper. You have to read the supplementary information to learn that, astoundingly enough. Many of the results are unclear about what datasets went into using them. The paper itself is half-useless since all the details are in the 150-page supplement and the separate tables. Some things never get reported; for example, I can't find the full polygenic score for predicting cognitive performance, even though they casually use it in a mediation analysis, showing that intelligence explains much of how these particular genes are boosting educational achievement. Overall, the paper is overstuffed and underdone: more like 1.5 papers, really.

I think what happened is this. I heard that the original paper was accepted by Nature sometime around October 2015, and I've been waiting impatiently ever since. Then the UK Biobank papers started rolling out while the paper was stuck in peer review or publishing hell. They tried to incorporate the first batch of n=100k Biobank data, but only had time to sort of half-ass the updated paper and use it as a heldout validation set. This is why it's so confusing, and why the title and abstract emphasize 74 hits (which was the original SSGAC results) while the full meta-analysis yields 162 - because it was easier to throw that into the supplement than revise the entire paper.]

Extended Data Fig. 2 shows the estimated effect sizes of the lead SNPs. The estimates range from 0.014 to 0.048 standard deviations per allele (2.7 to 9.0 weeks of schooling), with incremental R^2 in the range 0.01% to 0.035%.
To quantify the amount of population stratification in the GWAS estimates that remains even after the stringent controls used by the cohorts (Supplementary Information section 1.4), we used linkagedisequilibrium (LD) score regression 4 . The regression results indicate that ~8% of the observed inflation in the mean χ 2 is due to bias rather than polygenic signal (Extended Data Fig. 3a), suggesting that stratification effects are small in magnitude. We also found evidence for polygenic association signal in several within-family analyses, although these are not powered for individual SNP association testing (Supplementary Information section 2 and Extended Data Fig. 3b). To further test the robustness of our findings, we examined the withinsample and out-of-sample replicability of SNPs reaching genomewide significance (Supplementary Information sections 1.7–1.8). We found that SNPs identified in the previous educational attainment meta-analysis replicated in the new cohorts included here, and conversely, that SNPs reaching genome-wide significance in the new cohorts replicated in the old cohorts. For the out-of-sample replication analyses of our 74 lead SNPs, we used the interim release of the UK Biobank 5 (UKB) (n = 111,349). As shown in Extended Data Fig. 4, 72 out of the 74 lead SNPs have a consistent sign (P = 1.47 × 10 −19 ), 52 are significant at the 5% level (P = 2.68 × 10 −50 ), and 7 reach genomewide significance in the UK Biobank data set (P = 1.41 × 10 −42 ). For comparison, the corresponding expected numbers, assuming each SNP’s true effect size is its estimated effect adjusted for the winner’s curse, are 71.4, 40.3, and 0.6. (Supplementary Information section 1.8.2). We also find out-of-sample replicability of our overall GWAS results: the genetic correlation between EduYears in our metaanalysis sample and in the UKB data is 0.95 (s.e. = 0.021; Supplementary Table 1.14).

As shown in Fig. 2, based on overall summary statistics for associated variants, we find genetic covariance between increased educational attainment and increased cognitive performance (P = 9.9 × 10 −50 ), increased intra­ cranial volume (P = 1.2 × 10 −6 ), increased risk of bipolar disorder (P = 7 × 10 −13 ), decreased risk of Alzheimer’s (P = 4 × 10 −4 ), and lower neuroticism (P = 2.8 × 10 −8 ). We also found positive, statistically significant, but very small, genetic correlations with height (P = 5.2 × 10 −15 ) and risk of schizophrenia (P = 3.2 × 10 −4 ).

[hm, doesn't increased bipolar & schizophrenia risk contradict the previous Biobank papers?]

To consider potential biological pathways, we first tested whether SNPs in particular regions of the genome are implicated by our GWAS results. Unlike what has been found for other phenotypes, SNPs in regions that are DNase I hypersensitive in the fetal brain are more likely to be associated with EduYears by a factor of ~5 (95% confidence interval 2.89–7.07; Extended Data Fig. 7). Moreover, the 15% of SNPs residing in regions associated with histones marked in the central nervous system (CNS) explain 44% of the heritable variation (Extended Data Fig. 8a and Supplementary Table 4.4.2). This enrichment factor of ~3 for CNS (P = 2.48 × 10 −16 ) is greater than that of any of the other nine tissue categories in this analysis.
Given that our findings disproportionately implicate SNPs in regions regulating brain-specific gene expression, we examined whether genes located near EduYears-associated SNPs show elevated expression in neural tissue. We tested this hypothesis using data on mRNA transcript levels in the 37 adult tissues assayed by the Genotype-Tissue Expression Project (GTEx) 10 . Remarkably, the 13 GTEx tissues that are components of the CNS—and only those 13 tissues—show significantly elevated expression levels of genes near EduYears-associated SNPs (false discovery rate <0.05; Extended Data Fig. 8b and Supplementary Table 4.5.2).
To investigate possible functions of the candidate genes from the GWAS-implicated loci, we examined the extent of their overlap with groups of genes (‘gene sets’) whose products are known or predicted to participate in a common biological process 11 . We found 283 gene sets significantly enriched by the candidate genes identified in our GWAS (false discovery rate <0.05; Supplementary Table 4.5.1). To facilitate interpretation, we used a standard procedure 11 to group the 283 gene sets into ‘clusters’ defined by degree of gene overlap. The resulting 34 clusters, shown in Fig. 3, paint a coherent picture, with many clusters corresponding to stages of neural development: the proliferation of neural progenitor cells and their specialization (the cluster npBAF complex), the migration of new neurons to the different layers of the cortex (forebrain development, abnormal cerebral cortex morphology), the projection of axons from neurons to their signalling targets (axonogenesis, signalling by Robo receptor), the sprouting of dendrites and their spines (dendrite, dendritic spine organization), and neuronal signalling and synaptic plasticity throughout the lifespan (voltage-gated calcium channel complex, synapse part, synapse organization). Many of our results implicate candidate genes and biological pathways that are active during distinct stages of prenatal brain development. To directly examine how the expression levels of candidate genes identified in our GWAS vary over the course of development, we used gene expression data from the BrainSpan Developmental Transcriptome 12 . As shown in Extended Data Fig. 9, these candidate genes exhibit above-baseline expression in the brain throughout life but especially higher expression levels in the brain during prenatal development (1.36 times higher prenatally than postnatally, P = 6.02 × 10 −8 ).

We constructed polygenic scores 13 to assess the joint predictive power afforded by the GWAS results (Supplementary Information section 5.2). Across our two holdout samples, the mean predictive power of a polygenic score constructed from all measured SNPs is 3.2% (P = 1.18 × 10 −39 ; Supplementary Table 5.2 and Supplementary Information section 5).

we studied mediation of the association between the all-SNPs polygenic score and EduYears in two of our cohorts. We found that cognitive performance can statistically account for 23–42% of the association (P < 0.001) and the personality trait ‘openness to experience’ for approximately 7% (P < 0.001; Supplementary Information section 6).

Extended Data Figure 10 | The predictive power of a polygenic score (PGS) varies in Sweden by birth cohort. Five-year rolling regressions of years of education on the PGS (left axis in all four panels), share of individuals not affected by the comprehensive school reform (a, right axis), and average distance to nearest junior high school (b, right axis), nearest high school (c, right axis) and nearest college/university (d, right axis). The shaded area displays the 95% confidence intervals for the PGS effect.

[Since now everyone goes to high school and it has less to do with intelligence, of course the genetic contribution matters less... Not all college majors are created equal - a 4 year degree in Education is not the same as a 4-year degree in physics.]

We examined two phenotypes: a continuous variable measuring the number of years of schooling completed (EduYears, N = 293,723) and an indicator variable for college completion (College, N = 280,007). All analyses were performed at the cohort level according to a pre-specified and publicly archived analysis plan. Summary statistics provided by cohorts were uploaded to a central server and subsequently meta-analyzed. The lead PI of each cohort affirmed that the results contributed to the study were based on analyses approved by the local Research Ethics Committee and/or Institutional Review Board responsible for overseeing research. All participants provided written informed consent. Supplementary Table 1.1 provides basic information about the participating cohorts. Our Analysis Plan was preregistered at With one exception, the analyses reported here follow the original plan. The exception is that the original plan treated EduYears and College symmetrically whereas throughout the manuscript, we treat EduYears as the primary variable and de-emphasize College. After circulation of the Analysis Plan to our cohorts, a paper was posted on bioRxiv showing that the genetic correlation between the two measures is very high, with the point estimate suggesting a perfect genetic correlation 1 . Previously, we had considered as plausible the possibility that College would have better power for detecting associations at the upper end of the distribution of EduYears. However, since College is constructed by dichotomizing EduYears, the very high genetic correlation suggests that the College phenotype is for all intents and purposes merely a coarsening of the EduYears phenotype.

Supplementary Table 1.5 provides study-specific details on the analysis. Column 2 shows the association software used by each study analyst. The EduYears analyses are based on summary statistics from all 64 samples listed in Supplementary Table 1.1. Of the 64 samples, whose combined sample size is N=293,723, 5 were from single-sex cohorts, and 59 contained pooled results from mixed-sex cohorts (who additionally uploaded separate results for men and women).

Some cohorts made no adjustment for nonindependence but instead sought to restrict the estimation samples to conventionally unrelated individuals. For example, 23andMe restrict their estimation sample to conventionally unrelated individuals by ensuring that no pair of participants in the final estimation sample share more than 700 centimorgans of their genome identical-by-descent 9 .

For most EA cohorts, the average F st value was below 0.004, which agrees well with previous reports that F st is around 0.004 between European nations 17 . The largest F st , a value of 0.02, was observed for the cohort OGP-Talana. It is known that the central-eastern Sardinia region, Ogliastra, has been secluded from the surrounding regions for most of its history. Such isolation is expected to generate an unusually high F st . 18

Indeed, Rietveld et al. (2013) 7 reported GCTA-GREML estimates of SNP heritability for each of two cohorts (STR and QIMR), and the mean estimate was 22.4%. Assuming that 22.4% is in fact the true SNP heritability, the calculations outlined in the SOM of Rietveld et al. (pp. 22-23) generate a prediction of R 2 = 11.0% for a score constructed from the GWAS estimates of this paper and of R 2 = 6.1% for a score constructed from the combined (discovery + replication cohorts, but excluding the validation cohorts) GWAS sample of N = ~117,000-119,000 in Rietveld et al.—substantially higher than the 3.85% that we achieve here (with the score based on all genotyped SNPs) and the 2.2% Rietveld et al. achieved, respectively.

1.7 Within-Sample Replication
Following the suggestion of a referee, we attempted to replicate the genome-wide associations reported in our previous GWAS of EA 3 in the new cohorts that were added to this study. Conversely, we also examined if the SNPs that reach genome-wide significance in a meta-analysis of the new cohorts replicate in the Rietveld et al. cohorts.
Cohort Overlap with Rietveld et al. (2013)
The analyses of EduYears in Rietveld et al. 3 were based on a discovery sample of 101,069 individuals and a combined sample (discovery + replication) of 126,559 individuals. Some of the cohorts that contributed to the Rietveld et al. study did not participate in the present study (N = 13,981). Overall, the combined sample size of the Rietveld et al. cohorts that contributed to our study is N = 126,413 individuals. This number exceeds the difference between 126,559 and 13,981 because some of the original Rietveld et al. cohorts completed additional genotyping since 2013, and were hence able to contribute larger samples to the current study.
Methods in Within-Sample Replication Analyses
Rietveld et al. reported three genome-wide significant SNPs in their discovery sample, all of which replicated in their replication sample. These three SNPs also yielded lower P-values in the “combined” (discovery + replication) sample. In a meta-analysis of the combined sample, four additional SNPs reached genome-wide significance. Of these, five were genome-wide significant in the EduYears analyses. The remaining two only reached genome-wide significance in the analyses of College, but both had P-values just shy of genome-wide significance in the combined-sample EduYears analysis. Given our decision to make EduYears the primary phenotype, and to facilitate comparisons of effect sizes, we attempt to replicate all of the seven original associations in our meta-analyses of the EduYears variable. To examine if the seven associations replicate in our new cohorts, we split our overall sample into two subsamples comprising: (1) cohorts that participated in Rietveld et al. 3 and (2) all new cohorts that were added to the current study. In what follows we refer to the former as the “Rietveld Cohorts” and the latter as the “New Cohorts.” We refer to the combined-sample meta-analysis results reported by Rietveld et al. 3 as the “Rietveld et al. (2013) Cohorts.”
Within-Sample Replication Results
Supplementary Table 1.13 reports the results of the replication analysis. In the upper panel, we report for the seven SNPs, their standardized effect sizes, standard errors, and P-values. We report these statistics from three separate meta-analyses of EduYears conducted in: (i) the Rietveld et al. (2013) Cohorts (ii) the Rietveld Cohorts, and (iii) the New Cohorts. The reference allele is chosen to be the allele associated with higher values of EduYears in Rietveld et al.’s analysis (2013).
Given the high degree of overlap between cohorts in the previous EA meta-analysis 3 and the Rietveld Cohorts, the similarity of the effect-size estimates is unsurprising. Reassuringly, the sign of the estimated coefficient in the New Cohorts is always in the predicted direction, and for all but one of the seven SNPs we can reject the null hypothesis of no effect at the 5% significance level (two SNPs, rs4851266 and rs9320913, reach genome-wide significant also in the replication sample). For six of the seven SNPs, the 95% confidence intervals for the estimated effect sizes overlap across the Rietveld Cohorts and the New Cohorts.
To further examine replicability, we examined if SNPs that reach genome-wide significance in a meta-analysis of the New Cohorts replicate in the Rietveld Cohorts. Applying the pruning algorithm described in Supplementary Information section 1.6.1 to meta-analysis results for the New Cohorts resulted in 14 approximately independent SNPs. The results from this replication analyses are reported in Panel B of Supplementary Table 1.13. The results are similar to those of the replication of the associations from the Rietveld Cohorts in the New Cohorts: the signs align for all 14 SNPs, and 12 SNP replicate at P-value < 0.05 in the Rietveld Cohorts (none of them at genome-wide significance, but 5 at P-value < 10 -5 ). In the two replication analyses, the average effects in the replication samples are about 35% smaller than the estimated effect of the genome-wide significant association, roughly consistent with the degree of inflation one would expect from a Winner’s Curse correction of the sort described and performed in the next subsection.

Using procedures identical to those described in SI Section 1.6, we conducted a meta-analysis of the EduYears phenotype, combining the results from our discovery cohorts (N = 293,723) and the results from the UKB replication cohort (N = 111,349). Expanding the overall sample size to N = 405,072 increases the number of approximately independent genome-wide significant loci from 74 to 162.

[huh? Why does the title & abstract focus on 74 hits rather than 162 hits?]

Running the LD Score regression on these data, we estimate an intercept of 1.0491 (Extended Data Fig. 3a), which is significantly larger than 1 (the standard error reported by the LDSC software is 0.0091). By comparison, the mean χ 2 statistics for all the SNPs in the LD Score regression is 1.5966. This suggests that there is some confounding bias (due to population stratification, cryptic relatedness, or other confounds) but that it accounts for only a small part of the inflation in the chi-square statistics. Thus, the inflation is largely attributable to true polygenic signal throughout the genome.
We note that the amount of inflation due to confounding bias is likely to be even smaller in our main GWAS results (e.g., in the estimates for the genome-wide significant SNPs)

As a first step, we estimate genetic overlap between EA and several other phenotypes. We define genetic overlap as the degree to which common regions of the genome are associated with different traits, i.e., the extent to which multiple phenotypes are associated with the same underlying genetic variants. Instead of relying on family data or individual-level genetic data, we estimate genetic overlap using the LD Score Regression procedure developed by Bulik-Sullivan et al 6 . Note that this procedure does not require that the GWAS samples are independent. In addition, we develop another SNP-based estimate of genetic overlap based on different assumptions that requires GWAS results from independent samples as inputs. This new measure of genetic overlap is conceptually similar, but not identical, to the measure estimated in bivariate GREML 7 . We compare the different measures of genetic overlap theoretically and empirically.
Next, we systematically investigate evidence of genetic overlap between EA and phenotypes related to (1) mental health and psychometric traits (including general cognitive performance and neuroticism), (2) brain anatomy, and (3) anthropometric traits. Henceforth, we refer to these phenotypes collectively as “MHBA” phenotypes. We chose to include in the analysis phenotypes for which the phenotypic correlation between EA and the trait has previously been established l and GWAS summary statistics of the trait are available in the public domain. The final list of phenotypes includes: Alzheimer’s disease 8 , bipolar disorder 9 , schizophrenia 10 , cognitive performance 11,12 , neuroticism 13 , volumes of subcortical brain regions and total intracranial volume 14,m , BMI 15 , and height 16 . The links we used to access the GWAS results for these traits are listed in Supplementary Table 3.1.

In Fig. 2 in the main text, we report the estimates of genetic overlap from the LD Score regression, along with 95% confidence intervals. In Supplementary Table 3.1, we report estimation results from both methods described above. Cognitive performance shows the strongest genetic overlap with EduYears (r = 0.82 & r LD = 0.75). We also find substantial genetic overlap for mental health phenotypes, in particular for neuroticism (-0.37 & -0.41), Alzheimer’s disease (-0.20 & -0.31), and bipolar disorder (0.25 & 0.28). The positive genetic overlap between EduYears and bipolar disorder is noteworthy given that the phenotypic correlation is negative 19,20 .
Furthermore, we see substantial positive genetic overlap for intracranial volume (0.39 & 0.34) and height (0.16 & 0.13), as well as a strong negative overlap for BMI (-0.44 & -0.26).

Consistent with our finding sign concordance with EduYears less than 50%, we find negative correlation of SNP coefficients with EduYears for Alzheimer’s, BMI, and neuroticism. Consistent with their sign concordance greater than 50%, we find positive correlation of SNP coefficients for cognitive performance, intracranial volume, and height (although for height, the sign conconcordance is not statistically distinguishable from 50%). An intriguing pattern is found for schizophrenia, which has a positive but near-zero estimated genetic correlation (r LD = 0.08 with P = 3.2×10 -4 ) and a nearly equal percentage of concordant SNPs and discordant SNPs among the set of 74 that we tested (51% concordant)—and yet, as reported above, the enrichment of association of these SNPs for schizophrenia is strong (P < 0.002). We now turn to potential explanations for this result and discuss related literature.
Discussion Our work builds on earlier epidemiological research using genetically informative designs 3– 5,25–29 . First, our results corroborate earlier findings that the genetic contribution to the positive relationship between cognitive performance and EA is substantial, but not perfect 1,30,31 . Second, earlier studies found that neuroticism is a powerful negative predictor of achievement across various domains including job performance, academic achievement, and performance on tests of cognitive performance, partly through test anxiety 32–36 . The strong negative genetic overlap between EA and neuroticism suggests that SNPs associated with EA may be good candidates for association with neuroticism. Third, our finding of a negative genetic correlation between EA and BMI corroborates earlier evidence from twin studies suggesting that the negative relationship between EA and BMI 37– 41 is partially due to common genetic factors 2,25,42 . A possible hypothesis to explain this finding is that the genetic effects on BMI may be partially mediated by individual differences in self-control, impulsivity, and reward sensitivity 43–48 , which are also linked to learning and academic achievement 45–48 . Interestingly, the most recent GWAS on BMI found that genes associated with BMI are much more strongly expressed in the nervous system and sense organs than in the digestive system 15 . However, future research is needed to better understand the mechanisms underlying these findings.

Fifth, our results relate to ongoing research on schizophrenia and bipolar disorder. Earlier work has demonstrated links between these mental disorders on the one hand, and school performance, cognitive performance, creativity, and educational attainment on the other. Although these latter measures are related to each other and share a genetic basis, the phenotypic and genetic correlations between them are far from perfect 30,50,51 . Furthermore, their relationship with schizophrenia and bipolar disorder is rather complex and possibly U-shaped. On the one hand, low cognitive performance and low school performance have been reported as risk factors for schizophrenia and bipolar disorder 19,52–55 . For example, evidence from a large, population-based Swedish Multi-Generation Register suggests a weak negative correlation (-0.11) between IQ and psychosis (a term referring to mental disorders including both schizophrenia and bipolar disorder) 5 . Furthermore, it is demonstrated in ref. 28 that rare copy-number variants that are known to cause schizophrenia also predict lower cognitive performance in healthy individuals.
On the other hand, a higher prevalence of psychosis among individuals high in cognitive performance and creativity has been frequently reported 56–58 , and polygenic risk scores for bipolar disorder and schizophrenia have been reported to predict creativity in independent samples 29 . This suggests that some genetic variants that increase the risk for psychosis may also have positive effects on cognitive performance.
The relationship between educational attainment and schizophrenia specifically is similarly complex. Although early-onset schizophrenia is associated with school dropout 59 , no clear relationship is found between educational attainment and risk of schizophrenia 60 . More generally, the relationship between education and schizophrenia appears to depend on age at onset, duration, and severity of the disease, factors that often are not measured 61 . The failure to account for these factors in many empirical studies may contribute to the relatively weak or even seemingly contradictory results.
As suggested in ref. 62, it is possible that the clinical diagnoses of schizophrenia and bipolar disorder mask several disease subtypes that are caused by different biological mechanisms. This is one possible interpretation of our results for schizophrenia: The strong enrichment for association of our EA lead SNPs with schizophrenia, combined with a nearly equal percentage of concordant and discordant associations of our lead SNPs with these mental disorders, could point to different sub-types of schizophrenia that are lumped together by the current disease classification system. Alternatively, it may be that SNPs that are associated with schizophrenia happen to be in LD with SNPs that are associated with educational attainment simply because both sets of SNPs are primarily located in genes or genomic regions that are expressed in the brain. Such co-localization would generate a haphazard pattern of sign concordance. Follow-up research will need to differentiate between these different interpretations of our results.

This background suffices to motivate the biological questions that arise in the interpretation of GWAS results and the means by which these questions might be tentatively addressed. For starters, since a GWAS locus typically contains many other SNPs in LD with the defining lead SNP and with each other, it is natural to ask: which of these SNPs is the actual causal site responsible for the downstream phenotypic variation? Many SNPs in the genome appear to be biologically inert—neither encoding differences in protein composition nor affecting gene regulation—and a lead GWAS SNP may fall into this category and nonetheless show the strongest association signal as a result of statistical noise or happenstance LD with multiple causal sites. Fortunately, much is known from external sources of data about whether variation at a particular site is likely to have biological consequences, and exploiting these resources is our general strategy for fine-mapping loci: nominating individual sites that may be causally responsible for the GWAS signals. Descriptions of genomic sites or regions based on external sources of data are known as annotations, and readers will not go far astray if they interpret this term rather literally (as referring to a note of explanation or comment added to a text in one of the margins). If we regard the type genome as the basic text, then annotations are additional comments describing the structural or functional properties of particular sites or the regions in which they reside. For example, all nonsynonymous sites that influence protein structures might be annotated as such. An annotation can be far more specific than this; for instance, all sites that fall in a regulatory region active in the fetal liver might bear an annotation to this effect.
A given causal site will exert its phenotypic effect through altering the composition of a gene product or regulating its expression. Conceptually, once a causal site has been identified or at least nominated, the next question to pursue is the identity of the mediating gene. In practice, because only a handful of genes at most will typically overlap a GWAS locus, we can make\ some progress toward answering this question without precise knowledge of the causal site. The difficulty of the problem, however, should still not be underestimated. It is natural to assume that a lead GWAS SNP lying inside the boundaries of a particular gene must reflect a causal mechanism involving that gene itself, but in certain cases such a conclusion would be premature. It is possible for a causal SNP lying inside a certain gene to exert its phenotypic effect by regulating the expression of a nearby gene or for several genes to intervene between the SNP and its regulatory target.
Supplementary Table 4.1 ranks each gene overlapping a DEPICT-defined locus by the number of discrete evidentiary items favoring that gene (see Supplementary Information section 4.5 for details regarding DEPICT). These lines of evidence are taken from a number of our analyses to be detailed in the following subsections. Our primary tool for gene prioritization is DEPICT, which can be used to calculate a P-value and associated FDR for each gene. It is important to keep in mind, however, that a gene-level P-value returned by DEPICT refers to the tail probability under the null hypothesis that random sampling of loci can account for annotations and patterns of co-expression shared by the focal gene with genes in all other GWAS-identified loci. Although it is very reasonable to expect that genes involved in the same phenotype do indeed share annotations and patterns of co-expression, it may be the case that certain causal genes do not conform to this expectation and thus fail to yield low DEPICT P-values. This is why we do not rely on DEPICT alone but also the other lines of evidence described in the caption of Supplementary Table 4.1.

However, a priori we know that some SNPs are more likely to be associated with the phenotype than others; for example, it is often assumed that nonsynonymous SNPs are more likely to influence phenotypes than sites that fall far from all known genes. So a P-value of 5×10 −7 , say, though not typically considered significant at the genome-wide level, might merit a second look if the SNP in question is nonsynonymous.
Formalizing this intuition can be done with Bayesian statistics, which combines the strength of evidence in favor of a hypothesis (in our case, that a genomic site is associated with a phenotype) with the prior probability of the hypothesis. Deciding how to set this prior is often subjective. However, if many hypotheses are being tested (for example, if there are thousands of nonsynonymous polymorphisms in the genome), then the prior can be estimated from the data themselves using what is called “empirical Bayes” methodology. For example, if it turns out that SNPs with low P-values tend to be nonsynonymous sites rather than other types of sites, then the prior probability of true association is increased at all nonsynonymous sites. In this way a nonsynonymous site that otherwise falls short of the conventional significance threshold can become prioritized once the empirically estimated prior probability of association is taken into account. Note that such favorable reweighting of sites within a particular class is not set a priori, but is learned from the GWAS results themselves. In our case, we split the genome into approximately independent blocks and estimate the prior probability that each block contains a causal SNP that influences the phenotype and (within each block) the conditional prior probability that each individual SNP is the causal one. Each such probability is allowed to depend on annotations describing structural or functional properties of the genomic region or the SNPs within it. We can then empirically estimate to extent to each annotation predicts association with the focal phenotype. For a complete description of the fgwas method, see ref. 1.
For application to the GWAS of EduYears, we used the same set of 450 annotations as ref. 1; these are available at
Reweighted GWAS and Fine Mapping
We reweighted the GWAS results using the functional-genomic results described above. Using a regional posterior probability of association (PPA) greater than 0.90 as the cutoff, we identified 102 regions likely to harbor a causal SNP with respect to EduYears (Extended Data Fig. 7c and Supplementary Table 4.2.1). All but two of our 74 lead EduYears-associated SNPs fall within one of these 102 regions. The exceptions are rs3101246 and rs2837992, which attained PPA > 0.80 (Extended Data Fig. 7c). In previous applications of fgwas, the majority of novel loci that attained the equivalent of genome-wide significance only upon reweighting later attained the conventional P < 5×10 −8 in larger cohorts 1 .
Within each region attaining PPA > 0.90, each SNP received a conditional posterior probability of being the causal SNP (under the assumption that there is just one causal SNP in the region). The method of assigning this latter posterior probability is similar to that of ref. 6, except that the input Bayes factors are reweighted by annotation-dependent and hence SNP-varying prior probabilities. In essence, the likelihood of causality at an individual SNP derives from its Bayes factor with respect to phenotypic association (which is monotonically related to the P-value under reasonable assumptions), whereas the prior probability is derived from any empirical genome-wide tendency for the annotations borne by the SNP to predict evidence of association. Thus, the SNP with the largest posterior probabilities of causality tend to exhibit among the strongest P-values within their loci and functional annotations that predict association throughout the genome. Note that proper calibration of this posterior probability requires that all potential causal sites have been either genotyped or imputed, which may not be the case in our application; we did not include difficult-to-impute non-SNP sites such as insertions/deletions in the GWAS meta-analysis. With this caveat in mind, we identified 17 regions where fine mapping amassed over 50 percent of the posterior probability on a single SNP (Supplementary Table 4.2.2). Of our 74 lead EduYears SNPs, 9 are good candidates for being the causal sites driving their association signals [12%]. One of our top SNPs, rs4500960, is in nearly perfect LD with the causal candidate rs2268894 (and is indeed the second most likely causal SNP in this region according to fgwas). The causal candidate rs6882046 is within 75kb of two lead SNPs on chromosome 5 (rs324886 and rs10061788), but no two of these three SNPs show strong LD. Interestingly, the remaining 6 causal candidates lie in genomic regions that only attain the equivalent of genome-wide significance upon Bayesian reweighting. Of the 17 causal candidates, 9 lie in regions that are DNase I hypersensitive in the fetal brain.

Table 4.2.2:
Posterior probability of causality

[mean(c(0.524760, 0.567184, 0.629610, 0.632536, 0.682947, 0.697862, 0.725158, 0.755457, 0.766500, 0.781563, 0.784373, 0.832675, 0.837746, 0.842271, 0.885280, 0.968627, 0.992035)) = 0.76, 0.76*19=14.4]

The results from both approaches show that prediction accuracy increases as more SNPs are used to construct the score, with the maximum predictive power achieved when using all the genotyped SNPs (with Approach 1). In that case, the weighted average across the two cohorts of the incremental R 2 is ~3.85%.

[Versus 2% from Rietveld's n=100k; this is in line with the rough doubling of the main SSGAC sample size. The additional UK Biobank sample of n=111k does not seem to have been used but if it was used, should boost the polygenic score to ~5.3%?]

...The magnitude of predictive power that we observe is less than one might have expected on the basis of statistical genetics calculations 6 and GCTA-GREML estimates of “SNP heritability” from individual cohorts. Indeed, Rietveld et al. (2013) 7 reported GCTA-GREML estimates of SNP heritability for each of two cohorts (STR and QIMR), and the mean estimate was 22.4%. Assuming that 22.4% is in fact the true SNP heritability, the calculations outlined in the SOM of Rietveld et al. (pp. 22-23) generate a prediction of R 2 = 11.0% for a score constructed from the GWAS estimates of this paper and of R 2 = 6.1% for a score constructed from the combined (discovery + replication cohorts, but excluding the validation cohorts) GWAS sample of N = ~117,000-119,000 in Rietveld et al.—substantially higher than the 3.85% that we achieve here (with the score based on all genotyped SNPs) and the 2.2% Rietveld et al. achieved, respectively.
These discrepancies between the scores’ predicted and estimated R 2 may be due to the failure of some of the assumptions underlying the calculation of the predicted R 2 . An alternative (or additional) explanation is that the true SNP heritability for the GWAS sample pooled across cohorts is lower than 22.4%. That would be the case if the true GWAS coefficients differ across cohorts, perhaps due to heterogeneity in phenotype measurement or gene-by-environment interactions. If so, then a polygenic score constructed from the pooled GWAS sample would be expected to have lower predictive power in an individual cohort than implied by the calculations above. Based on that reasoning, the R 2 of 2.2% observed by Rietveld et al. (2013) could be rationalized by assuming that the proportion of variance accounted for by common variants across the pooled Rietveld cohorts is only 12.7% 6 . (We obtain a similar estimate, 11.5% with a standard error of 0.45%, when we use LD Score regression 5 to estimate the SNP heritability using our pooled-sample meta-analysis results from this paper, excluding deCODE and without GC. While we believe this estimate is based on cohort results without GC, it is biased downward if any cohort in fact applied GC.) If we assume that the 12.7% is valid also for the cohorts considered in this study, we would predict an R 2 equal to 4.5%, somewhat higher than we observe in HRS and STR but much closer. However, the degree of correlation in coefficients across cohorts appears to be relatively high (Supplementary Table 1.10 reports estimates of the genetic correlation between selected cohorts and deCODE; although the correlation estimates vary a lot across cohorts, they tend to be large for the largest cohorts, and the weighted average is 0.76). We do not know whether a pooled-cohort SNP heritability of 12.7% or lower can be reconciled with the observed degree of correlation in coefficients across cohorts.

The results are reported in Supplementary Tables 6.3 and 6.4. In both the STR and the HRS, cognitive performance significantly mediates the effect of PGS on EduYears; in the HRS, Openness to Experience is also a significant mediator. The indirect effects for the other mediating variables are not significant s .
The results for cognitive performance are similar across STR and HRS. In both datasets, a one-standard deviation increase in PGS is associated with ~0.6-0.7 more years of education, and a one-standard deviation increase in cognitive performance is associated with ~0.15 more years of education. In both datasets, the direct effect (θ 1 ) of PGS on EduYears is ~0.3-0.4 and the total indirect effect (β 1 θ 2 ) is ~0.19-0.31. This implies that a one-standard-deviation increase in PGS is associated with ~0.3-0.4 more years of education, keeping the mediating variables constant, and that changing the mediating variables to the levels they would have attained had PGS increased by one standard deviation (but keeping PGS fixed) increases years of education by ~0.19-0.31 years. Lastly, in both datasets, the partial indirect effect (θ 21 β 11 ) of cognitive performance is large and very significant: the estimates are equal to 0.29 and 0.14—or 42% and 23% of the total effect (γ 1 )—in STR and HRS, respectively. The results also suggest that a one-standard deviation increase in Openness to Experience is associated with ~0.06 more years of education, and the estimated partial indirect effect for Openness to Experience is equal to 0.04—or 7% of the total effect (γ 1 ).

[Razib comments:

> But look at all the functional associations and analysis in this paper! Some serious biology in this. The figure from the paper to the left [] which shows how the genes associated with this SNP hits are expressed in different tissue/types and organs. These are the biggest effect SNPs for years of education in the genome, so it makes sense that they’d be way over-expressed in the brain. It is definitely more convincing to those who might be skeptical a priori than some statistically robust associations (well, it should be more convincing at least).
[ "The findings have proved divisive. Some researchers hope that the work will aid studies of biology, medicine and social policy, but others say that the emphasis on genetics obscures factors that have a much larger impact on individual attainment, such as health, parenting and quality of schooling. “Policymakers and funders should pull the plug on this sort of work,” said anthropologist Anne Buchanan and genetic anthropologist Kenneth Weiss at Pennsylvania State University in University Park in a statement to Nature. “We gain little that is useful in our understanding of this sort of trait by a massively large genetic approach in normal individuals.”"

'you can't prove genetics matters, which is why your funding should be cut, so you can't prove genetics matter'

"The results of this study and future work will enable us to better understand how these pathways interact," King continued. "Perhaps ultimately, we'll be able to learn why and how educational attainment seems to be protective of cognition in later life."'
[specifically, the Mendelian randomization will demonstrate that education has no protective effect...]


#psychology #intelligence #education #gwas #genetics  
Add a comment...

gwern branwen

Shared publicly  - 
Everything is heritable: "Detection of human adaptation during the past 2,000 years", Field et al 2016:

"Detection of recent natural selection is a challenging problem in population genetics, as standard methods generally integrate over long timescales. Here we introduce the Singleton Density Score (SDS), a powerful measure to infer very recent changes in allele frequencies from contemporary genome sequences. When applied to data from the UK10K Project, SDS reflects allele frequency changes in the ancestors of modern Britons during the past 2,000 years. We see strong signals of selection at lactase and HLA, and in favor of blond hair and blue eyes. Turning to signals of polygenic adaptation we find, remarkably, that recent selection for increased height has driven allele frequency shifts across most of the genome. Moreover, we report suggestive new evidence for polygenic shifts affecting many other complex traits. Our results suggest that polygenic adaptation has played a pervasive role in shaping genotypic and phenotypic variation in modern humans."


Previously: "Genome-wide patterns of selection in 230 ancient Eurasians" ( ), Mathieson et al 2015. Background on methods:

As I mentioned then, recently on SSC (, and Cochran emphasizes (, as more ancient genomes become available, we will spot more and more signs of selection in the form of soft polygenic sweeps creating between-population differences. Like with finding IQ genes, it is a matter of sample size / power, and as the sample size increases geneticists will be 'surprised' to discover all the selection which has been going on over the past 10000 years.

This paper is particularly neat because it bypasses the slow accumulation of ancient DNA for direct analysis of abundant modern SNP / whole-genome data. As they only use ~3k whole genomes†, we may be able to expect much bigger upcoming finds of selection.

† I am not clear why the sample is so small; they only report signs of selections on SNPs and I'm not sure whether they use whole-genomes simply because they have access to them and not, say, the UK Biobank, or whether the whole-genomes are genuinely needed to do the ancestry/coalescing reconstruction and all the selection is on SNPs because their functions have often been established already in GWASes so they can say 'there was selection on SNP X for trait Y'. EDIT: apparently the method does depend on whole-genomes to find unique mutations according to one of the authors:

"We use 3k genomes (from UK10K) because we need whole-genome data, not genotyping data, to compute our singleton density score (SDS). Singletons are variants private to 1 individual which aren't detectable in genotyping for variants already known; as a result, we are limited to using whole-genome sequencing data. As we get more of that for more populations, we hope to zoom in even closer on evolution - won't that be exciting! :)"
gwern branwen's profile photo
Add a comment...

gwern branwen

Shared publicly  - 
Web interface to scores of GWAS results, allowing combined inference over them all: Gets around privacy & scaling problems by using LD regression on summary statistics.
Add a comment...
Have him in circles
2,725 people
Prince Aksel's profile photo
Tomasz Żełudziewicz's profile photo
Liu Kamala's profile photo
Niklas Riewald's profile photo
Onlyplot Onlyplot's profile photo
David Rose's profile photo
Damien Jones's profile photo
Eugene Portnoy's profile photo
beatriz sarah gali's profile photo

gwern branwen

Shared publicly  - 
"Genetic link between family socioeconomic status and children's educational achievement estimated from genome-wide SNPs", Krapohl & Plomin 2016:

"One of the best predictors of children’s educational achievement is their family’s socioeconomic status (SES), but the degree to which this association is genetically mediated remains unclear. For 3000 UK-representative unrelated children we found that genome-wide single-nucleotide polymorphisms could explain a third of the variance of scores on an age-16 UK national examination of educational achievement and half of the correlation between their scores and family SES. Moreover, genome-wide polygenic scores based on a previously published genome-wide association meta-analysis of total number of years in education accounted for ~3.0% variance in educational achievement and ~2.5% in family SES. This study provides the first molecular evidence for substantial genetic influence on differences in children’s educational achievement and its association with family SES.

Here we report the first investigation of genetic influence on the variance of children’s educational achievement using DNA alone. The same DNA-based methods can also be used to estimate genetic influence on the covariance between traits.17 This enabled us to investigate possible genetic mediation of the best predictor of children’s educational achievement, their family’s SES.18, 19 This correlation is often interpreted causally as family SES causing differences in children’s educational achievement.20 However, it remains unclear whether and to what extent the association between family SES and children’s educational achievement is genetically mediated, because twin and family research is limited to studying phenotypes that can vary within a family. Key aspects of children’s environment such as poverty, parental education and neighborhood cannot be investigated using the twin method because it is methodologically impossible to decompose variance in phenotypes shared within twin pairs.

GWA attempts aimed at identifying individually significant SNPs have generally captured only extremely small fractions of genetic variance of complex traits, the so-called missing heritability problem.22 However, evidence has been accumulating that significant portions of phenotypic variation can be explained by the ensemble of markers not achieving genome-wide significance.23 Markers are identified from GWAs using an initial discovery sample to construct a genome-wide polygenic score (GPS) in an independent replication sample by calculating the effect-size-weighted sum of trait-associated alleles for each individual. An aggregate GPS score can be used to assess genetic influence on trait variation.
As they are tapping into the same genetic signal, GPS based on GWA results and GCTA can be applied to the same data sets, with both estimating the polygenic contribution to trait variance or a shared polygenic covariance between traits captured by the additive effects of common SNPs. We therefore employ a two-method approach using GCTA and GPS to explore the genetic influence on the variance of children’s educational achievement and on the covariance between family SES and children’s educational achievement. Our study had four objectives:
(1) To estimate, for the first time using DNA data, genetic influences on children’s educational achievement on an age-16 UK national examination of educational achievement using genome-wide genotypes from >3000 conventionally unrelated children. Specifically, we conduct GCTA11 to quantify pairwise genomic similarity between each pair of individuals across millions of SNPs throughout the genome in order to estimate the proportion of phenotypic variation in children’s educational achievement captured by all SNPs simultaneously.
(2) To investigate genetic mediation of the phenotypic correlation between family SES and children’s educational achievement, we conduct bivariate GCTA to estimate the proportion of phenotypic covariation between children’s family SES and children’s educational achievement captured by children's genotypes.
(3) To create a GPS based on the results of a large GWA study on adults’ total years of schooling13 and investigate its association with variance in children’s educational achievement and their family SES.
(4) To examine the role of general cognitive ability (intelligence) in the genetic nexus between children’s educational achievement and their family SES. Molecular evidence as well as twin studies have shown that cognitive ability is heritable and accounts for substantial portion of genetic variance in educational achievement.7, 24, 25, 26 In addition, recent molecular evidence from the present sample of unrelated individuals showed high genetic correlation between family SES and children’s intelligence at age 7 and 12 years.27 Based on this evidence, it is important to address the question to what extent the genetic link between family SES and children’s educational achievement is mediated by intelligence. For this reason, we perform GCTA mediation analyses to test for a direct genetic link between family SES and children’s educational achievement independent of cognitive ability. Complementarily, we test whether the GPS of adults’ total years of schooling explains variance in children’s educational achievement independently of cognitive ability.

DNA data were available for 3747 children whose first language was English and had no major medical or psychiatric problems. From that sample, 3665 DNA samples were successfully hybridized to Affymetrix GeneChip 6.0 SNP genotyping arrays (Affymetrix, Santa Clara, CA, USA) using standard experimental protocols as part of the WTCCC2 project (for details see Trzaskowski et al.).31 In addition to nearly 700 000 genotyped SNPs, more than one million other SNPs were imputed from HapMap 2, 3 and WTCCC controls using IMPUTE v.2 software.32 A total of 3152 DNA samples (1446 males and 1706 females) survived quality control criteria for ancestry, heterozygosity, relatedness and hybridization intensity outliers. To control for ancestral stratification, we performed principal component analyses on a subset of 100 000 quality-controlled SNPs after removing SNPs in linkage disequilibrium (r2>0.2).33 Using the Tracy–Widom test,34 we identified 8 axes with P<0.05 that were used as covariates in GCTA and polygenic score analyses.

Educational achievement: Educational achievement was operationalized as performance on the standardized UK-wide examination, the General Certificate of Secondary Education (GCSE), taken by almost all (>99%) pupils at the end of compulsory education at typically at the age of 16 years. English, mathematics and science are compulsory subjects. Five or more GCSEs with grades A*–C are required for further education, including GCSE English and GCSE mathematics. The joint performance on these three compulsory subjects determines admission to further education and employability...The GCSE measure for the present analyses was the mean grade of the three compulsory core subjects, mathematics, English (mean grade of ‘English Language’ and ‘English Literature’), and science (mean of any science subjects taken), requiring at least two measures to be nonmissing. Scores on the three compulsory core subjects were highly correlated (0.65–0.81).
Intelligence (IQ): Individuals were assessed at the ages of 2, 3, 4, 7, 9, 10, 12, 14, and 16 years on general cognitive ability using a battery of parent-administered and phone- and web-based tests. At ages 2, 3, and 4, tests were parent-administered and validated against standard tests administered by a trained tester. At age 7, tests were administered over the phone; at age 9, parents administered the tests; and at the ages 10–16, tests were web based. At each testing age, individuals completed at least two ability tests that assessed verbal and nonverbal intelligence. Psychometric properties of the tests have been described in detail elsewhere,36 with the exception of the measurements used at age 16 years, where subjects completed a web-based adaptation of Raven’s Standard and Advanced Progressive Matrices and the Mill-Hill Vocabulary Scale.37, 38, 39

The present sample size of ~3000 yields 80% power to detect a GCTA heritability estimate of 30% (α=0.05) and genetic correlation estimate of 0.6 (α=0.05; VG1=0.20; VG2: 0.30; rPh=0.50).

Polygenic scores: We created polygenic scores from genome-wide data of over 3000 unrelated children using GWA results for total years of schooling from an independent discovery sample.13 The same quality control criteria as for the GCTA analyses were applied to the data. Polygenic risk scores were constructed using the P-values and β-weights from the recent large (N=126 559) GWA based on years of education.6 Quality-controlled SNPs were pruned for linkage disequilibrium based on P-value informed clumping in PLINK,44 using R2=0.25 cutoff within a 200-kb window. We removed the major histocompatibility complex region of the genome because of its complex linkage disequilibrium structure. 144 890 SNPs survived linkage disequilibrium pruning. For each individual, multiple polygenic scores were generated using the PLINK score option based on the top SNPs from the GWA analysis of educational attainment for varying significance thresholds (from 0.01 to 0.50). Numbers of SNPs per threshold are summarized in Supplementary Table 3. The scores were calculated as the sum across SNPs of the number of reference alleles for each SNP multiplied by the effect size (β-coefficient) derived from the GWA analysis of years of education.

Phenotypically, children’s educational achievement correlated 0.50 (0.02 s.e.) with their family SES. Both variables also correlated with intelligence: 0.55 (0.02 s.e.) for educational achievement and 0.38 (0.02 s.e.) for family SES (Supplementary Table 1).
Bivariate GCTA: Bivariate GCTA showed that the estimated proportion of variance tagged by the sampled SNPs was 0.31 (0.12 s.e.) in educational achievement, and 0.20 (0.11 s.e.) in family SES (Figure 1). The genetic correlation, indicating the extent to which the same SNPs are associated with family SES and children’s educational achievement, was near unity (rG=1.02 (0.25 s.e.)).
Based on the genetic correlation between the two traits and the genetic contribution to variance of each trait respectively, GCTA estimates the genetic contribution to the phenotypic correlation between the two traits: C(G)=r1,2 (G) √ (V1 (G) × V2 (G)), applied to the data: 0.25=1.02 × √ (0.31 × 0.20). Hence, GCTA estimated the genetic contribution to the phenotypic correlation between family SES and children’s educational achievement as 0.25 (0.09 s.e.), indicating that the proportion of the observed correlation tagged by the additive effects of available SNPs was 50% (that is, 0.25/0.50; Figure 1). This suggests approximately half of the phenotypic correlation between children’s family SES and their educational achievement was mediated genetically.

Our GCTA heritability estimate of 20% for family SES tagged by children’s genotypes is very similar to GCTA heritability estimates of years of education in adulthood and socioeconomic measures tagged by adults’ genotypes themselves in previous studies.13, 14, 15 This is remarkable as children’s genotypes are only a proxy for their parents’ genotypes. In other words, GCTA effects on family SES estimated from children’s DNA only reflect the extent to which children inherit parental characteristics associated with the family SES created by the parents. One such factor is intelligence, and we find that children’s intelligence accounts for about one-third of the GCTA association between family SES and children’s educational achievement. However, it is interesting that two-thirds of the GCTA association is not accounted for by children’s intelligence. This finding of intelligence-independent shared genetic variance between family SES and children’s educational achievement suggests that differences in educational achievement at the end of compulsory education and the level of education and occupation attained in adulthood are not merely the manifestation of differences in intelligence. This is in line with twin research that suggests that the heritability of educational achievement reflects many genetically influenced traits such as personality and self-efficacy, not just intelligence.48

Our results also contribute to the extensive debate about meritocracy and social mobility62 that has largely ignored the fact that parents and their offspring are genetically related. Usually a lower correlation between parental and offspring SES is seen as an index of social mobility.63 However, considering genetics, we know that removing environmental sources of variation will not remove genetically driven resemblance between parents and offspring. To the contrary, as environmental differences diminish, individual differences that remain will to a larger proportion be due to genetic differences; that is, heritability would increase, which has also been demonstrated empirically.55 That way, heritability could be seen as an index of social mobility."
Molecular Psychiatry publishes work aimed at elucidating biological mechanisms underlying psychiatric disorders and their treatment
Add a comment...

gwern branwen

Shared publicly  - 
"One-shot Learning with Memory-Augmented Neural Networks", Santoro et al 2016:

"Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of "one-shot learning." Traditional gradient-based networks require a lot of data to learn, often through extensive iterative training. When new data is encountered, the models must inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference. Architectures with augmented memory capacities, such as Neural Turing Machines (NTMs), offer the ability to quickly encode and retrieve new information, and hence can potentially obviate the downsides of conventional models. Here, we demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external memory that focuses on memory content, unlike previous methods that additionally use memory location-based focusing mechanisms."
Add a comment...

gwern branwen

Shared publicly  - 
Local food is not about eating locally. A deep investigation this did not require - as the restaurateurs point out, no one would pay the genuine prices of local food, because local food is comically expensive.

Are they villains here? Certainly not. Qui vult decipi decipiatur. Since local food is not about eating locally, they are doing their customers a favor by providing the fantasy & requisite signaling at an affordable price, and then taking the blame when discovered.
Steven Rose's profile photoKarl Krueger's profile photogwern branwen's profile photo
Application to homeopathic remedies is actually quite apt here.
Add a comment...

gwern branwen

Shared publicly  - 
Rapamycin trials summary. Unfortunately, I must still have that old 'cloud-to-butt' browser plugin installed, because for some reason, most of the places this article should read 'human', it keeps reading 'dog'. Very odd. Still, I'm glad to hear that the human rapamycin trials are showing initial positive results, as expected, and doubtless rapamycin enrolment will start ramping up as fast as an adaptive trial process deems optimal for saving QALYs.
A drug that lengthened the lives of laboratory mice is being tested on dogs as scientists look for alternatives to treating one disease at a time.
Add a comment...

gwern branwen

Shared publicly  - 
Margaret, are you grieving
Add a comment...
Have him in circles
2,725 people
Prince Aksel's profile photo
Tomasz Żełudziewicz's profile photo
Liu Kamala's profile photo
Niklas Riewald's profile photo
Onlyplot Onlyplot's profile photo
David Rose's profile photo
Damien Jones's profile photo
Eugene Portnoy's profile photo
beatriz sarah gali's profile photo
Basic Information