Press question mark to see available shortcut keys

"Genome-wide association study identifies 74 [162] loci associated with educational attainment", Okbay et al 2016; abstract:

"Educational attainment is strongly influenced by social and other environmental factors, but genetic factors are estimated to account for at least 20% of the variation across individuals1. Here we report the results of a genome-wide association study (GWAS) for educational attainment that extends our earlier discovery sample1, 2 of 101,069 individuals to 293,723 individuals, and a replication study in an independent sample of 111,349 individuals from the UK Biobank. We identify 74 [162 total] genome-wide significant loci associated with the number of years of schooling completed. Single-nucleotide polymorphisms associated with educational attainment are disproportionately found in genomic regions regulating gene expression in the fetal brain. Candidate genes are preferentially expressed in neural tissue, especially during the prenatal period, and enriched for biological pathways involved in neural development. Our findings demonstrate that, even for a behavioural phenotype that is mostly environmentally determined, a well-powered GWAS identifies replicable associated genetic variants that suggest biologically relevant pathways. Because educational attainment is measured in large numbers of individuals, it will continue to be useful as a proxy phenotype in efforts to characterize the genetic influences of related phenotypes, including cognition and neuropsychiatric diseases."

supplementary info: http://www.nature.com/nature/journal/vaop/ncurrent/extref/nature17671-s1.pdf http://www.nature.com/nature/journal/vaop/ncurrent/extref/nature17671-s2.xlsx http://www.thessgac.org/#!data/kuzq8 http://ssgac.org/documents/FAQ_74_loci_educational_attainment.pdf

[The long-awaited SSGAC followup to Rietveld et al 2013. This almost quadruples the sample size, yielding no less than 162 genome-wide-significant SNP hits! The full polygenic score explains >3.8% of years-of-education. This is going to make Mendelian randomization and other designs far more powerful, and of course, has considerable implications for embryo selection/editing. With all the hits to play with, Okbay et al 2016 is able to do a lot of fun things like noting that the SNPs seem heavily involved in fetal brain development (sorry, everyone who was dreaming of a 'CRISPR booster shot' for increasing your adult intelligence), changes in involvement over a life span, changes in education heritability over time (reduced, probably because schooling requires less intelligence), and fine-mapping/annotation estimation of probability hits are causal (decent for a subset). Very cool.

The downside: the paper is atrociously written and very confusing. For example, the 162 hits is mentioned nowhere in the paper. You have to read the supplementary information to learn that, astoundingly enough. Many of the results are unclear about what datasets went into using them. The paper itself is half-useless since all the details are in the 150-page supplement and the separate tables. Some things never get reported; for example, I can't find the full polygenic score for predicting cognitive performance, even though they casually use it in a mediation analysis, showing that intelligence explains much of how these particular genes are boosting educational achievement. Overall, the paper is overstuffed and underdone: more like 1.5 papers, really.

I think what happened is this. I heard that the original paper was accepted by Nature sometime around October 2015, and I've been waiting impatiently ever since. Then the UK Biobank papers started rolling out while the paper was stuck in peer review or publishing hell. They tried to incorporate the first batch of n=100k Biobank data, but only had time to sort of half-ass the updated paper and use it as a heldout validation set. This is why it's so confusing, and why the title and abstract emphasize 74 hits (which was the original SSGAC results) while the full meta-analysis yields 162 - because it was easier to throw that into the supplement than revise the entire paper.]

Extended Data Fig. 2 shows the estimated effect sizes of the lead SNPs. The estimates range from 0.014 to 0.048 standard deviations per allele (2.7 to 9.0 weeks of schooling), with incremental R^2 in the range 0.01% to 0.035%.
To quantify the amount of population stratification in the GWAS estimates that remains even after the stringent controls used by the cohorts (Supplementary Information section 1.4), we used linkagedisequilibrium (LD) score regression 4 . The regression results indicate that ~8% of the observed inflation in the mean χ 2 is due to bias rather than polygenic signal (Extended Data Fig. 3a), suggesting that stratification effects are small in magnitude. We also found evidence for polygenic association signal in several within-family analyses, although these are not powered for individual SNP association testing (Supplementary Information section 2 and Extended Data Fig. 3b). To further test the robustness of our findings, we examined the withinsample and out-of-sample replicability of SNPs reaching genomewide significance (Supplementary Information sections 1.7–1.8). We found that SNPs identified in the previous educational attainment meta-analysis replicated in the new cohorts included here, and conversely, that SNPs reaching genome-wide significance in the new cohorts replicated in the old cohorts. For the out-of-sample replication analyses of our 74 lead SNPs, we used the interim release of the UK Biobank 5 (UKB) (n = 111,349). As shown in Extended Data Fig. 4, 72 out of the 74 lead SNPs have a consistent sign (P = 1.47 × 10 −19 ), 52 are significant at the 5% level (P = 2.68 × 10 −50 ), and 7 reach genomewide significance in the UK Biobank data set (P = 1.41 × 10 −42 ). For comparison, the corresponding expected numbers, assuming each SNP’s true effect size is its estimated effect adjusted for the winner’s curse, are 71.4, 40.3, and 0.6. (Supplementary Information section 1.8.2). We also find out-of-sample replicability of our overall GWAS results: the genetic correlation between EduYears in our metaanalysis sample and in the UKB data is 0.95 (s.e. = 0.021; Supplementary Table 1.14).

As shown in Fig. 2, based on overall summary statistics for associated variants, we find genetic covariance between increased educational attainment and increased cognitive performance (P = 9.9 × 10 −50 ), increased intra­ cranial volume (P = 1.2 × 10 −6 ), increased risk of bipolar disorder (P = 7 × 10 −13 ), decreased risk of Alzheimer’s (P = 4 × 10 −4 ), and lower neuroticism (P = 2.8 × 10 −8 ). We also found positive, statistically significant, but very small, genetic correlations with height (P = 5.2 × 10 −15 ) and risk of schizophrenia (P = 3.2 × 10 −4 ).

[hm, doesn't increased bipolar & schizophrenia risk contradict the previous Biobank papers?]

To consider potential biological pathways, we first tested whether SNPs in particular regions of the genome are implicated by our GWAS results. Unlike what has been found for other phenotypes, SNPs in regions that are DNase I hypersensitive in the fetal brain are more likely to be associated with EduYears by a factor of ~5 (95% confidence interval 2.89–7.07; Extended Data Fig. 7). Moreover, the 15% of SNPs residing in regions associated with histones marked in the central nervous system (CNS) explain 44% of the heritable variation (Extended Data Fig. 8a and Supplementary Table 4.4.2). This enrichment factor of ~3 for CNS (P = 2.48 × 10 −16 ) is greater than that of any of the other nine tissue categories in this analysis.
Given that our findings disproportionately implicate SNPs in regions regulating brain-specific gene expression, we examined whether genes located near EduYears-associated SNPs show elevated expression in neural tissue. We tested this hypothesis using data on mRNA transcript levels in the 37 adult tissues assayed by the Genotype-Tissue Expression Project (GTEx) 10 . Remarkably, the 13 GTEx tissues that are components of the CNS—and only those 13 tissues—show significantly elevated expression levels of genes near EduYears-associated SNPs (false discovery rate <0.05; Extended Data Fig. 8b and Supplementary Table 4.5.2).
To investigate possible functions of the candidate genes from the GWAS-implicated loci, we examined the extent of their overlap with groups of genes (‘gene sets’) whose products are known or predicted to participate in a common biological process 11 . We found 283 gene sets significantly enriched by the candidate genes identified in our GWAS (false discovery rate <0.05; Supplementary Table 4.5.1). To facilitate interpretation, we used a standard procedure 11 to group the 283 gene sets into ‘clusters’ defined by degree of gene overlap. The resulting 34 clusters, shown in Fig. 3, paint a coherent picture, with many clusters corresponding to stages of neural development: the proliferation of neural progenitor cells and their specialization (the cluster npBAF complex), the migration of new neurons to the different layers of the cortex (forebrain development, abnormal cerebral cortex morphology), the projection of axons from neurons to their signalling targets (axonogenesis, signalling by Robo receptor), the sprouting of dendrites and their spines (dendrite, dendritic spine organization), and neuronal signalling and synaptic plasticity throughout the lifespan (voltage-gated calcium channel complex, synapse part, synapse organization). Many of our results implicate candidate genes and biological pathways that are active during distinct stages of prenatal brain development. To directly examine how the expression levels of candidate genes identified in our GWAS vary over the course of development, we used gene expression data from the BrainSpan Developmental Transcriptome 12 . As shown in Extended Data Fig. 9, these candidate genes exhibit above-baseline expression in the brain throughout life but especially higher expression levels in the brain during prenatal development (1.36 times higher prenatally than postnatally, P = 6.02 × 10 −8 ).

We constructed polygenic scores 13 to assess the joint predictive power afforded by the GWAS results (Supplementary Information section 5.2). Across our two holdout samples, the mean predictive power of a polygenic score constructed from all measured SNPs is 3.2% (P = 1.18 × 10 −39 ; Supplementary Table 5.2 and Supplementary Information section 5).

we studied mediation of the association between the all-SNPs polygenic score and EduYears in two of our cohorts. We found that cognitive performance can statistically account for 23–42% of the association (P < 0.001) and the personality trait ‘openness to experience’ for approximately 7% (P < 0.001; Supplementary Information section 6).

Extended Data Figure 10 | The predictive power of a polygenic score (PGS) varies in Sweden by birth cohort. Five-year rolling regressions of years of education on the PGS (left axis in all four panels), share of individuals not affected by the comprehensive school reform (a, right axis), and average distance to nearest junior high school (b, right axis), nearest high school (c, right axis) and nearest college/university (d, right axis). The shaded area displays the 95% confidence intervals for the PGS effect.

[Since now everyone goes to high school and it has less to do with intelligence, of course the genetic contribution matters less... Not all college majors are created equal - a 4 year degree in Education is not the same as a 4-year degree in physics.]

We examined two phenotypes: a continuous variable measuring the number of years of schooling completed (EduYears, N = 293,723) and an indicator variable for college completion (College, N = 280,007). All analyses were performed at the cohort level according to a pre-specified and publicly archived analysis plan. Summary statistics provided by cohorts were uploaded to a central server and subsequently meta-analyzed. The lead PI of each cohort affirmed that the results contributed to the study were based on analyses approved by the local Research Ethics Committee and/or Institutional Review Board responsible for overseeing research. All participants provided written informed consent. Supplementary Table 1.1 provides basic information about the participating cohorts. Our Analysis Plan was preregistered at https://osf.io/paj9m/. With one exception, the analyses reported here follow the original plan. The exception is that the original plan treated EduYears and College symmetrically whereas throughout the manuscript, we treat EduYears as the primary variable and de-emphasize College. After circulation of the Analysis Plan to our cohorts, a paper was posted on bioRxiv showing that the genetic correlation between the two measures is very high, with the point estimate suggesting a perfect genetic correlation 1 . Previously, we had considered as plausible the possibility that College would have better power for detecting associations at the upper end of the distribution of EduYears. However, since College is constructed by dichotomizing EduYears, the very high genetic correlation suggests that the College phenotype is for all intents and purposes merely a coarsening of the EduYears phenotype.

Supplementary Table 1.5 provides study-specific details on the analysis. Column 2 shows the association software used by each study analyst. The EduYears analyses are based on summary statistics from all 64 samples listed in Supplementary Table 1.1. Of the 64 samples, whose combined sample size is N=293,723, 5 were from single-sex cohorts, and 59 contained pooled results from mixed-sex cohorts (who additionally uploaded separate results for men and women).

Some cohorts made no adjustment for nonindependence but instead sought to restrict the estimation samples to conventionally unrelated individuals. For example, 23andMe restrict their estimation sample to conventionally unrelated individuals by ensuring that no pair of participants in the final estimation sample share more than 700 centimorgans of their genome identical-by-descent 9 .

For most EA cohorts, the average F st value was below 0.004, which agrees well with previous reports that F st is around 0.004 between European nations 17 . The largest F st , a value of 0.02, was observed for the cohort OGP-Talana. It is known that the central-eastern Sardinia region, Ogliastra, has been secluded from the surrounding regions for most of its history. Such isolation is expected to generate an unusually high F st . 18

Indeed, Rietveld et al. (2013) 7 reported GCTA-GREML estimates of SNP heritability for each of two cohorts (STR and QIMR), and the mean estimate was 22.4%. Assuming that 22.4% is in fact the true SNP heritability, the calculations outlined in the SOM of Rietveld et al. (pp. 22-23) generate a prediction of R 2 = 11.0% for a score constructed from the GWAS estimates of this paper and of R 2 = 6.1% for a score constructed from the combined (discovery + replication cohorts, but excluding the validation cohorts) GWAS sample of N = ~117,000-119,000 in Rietveld et al.—substantially higher than the 3.85% that we achieve here (with the score based on all genotyped SNPs) and the 2.2% Rietveld et al. achieved, respectively.

1.7 Within-Sample Replication
Following the suggestion of a referee, we attempted to replicate the genome-wide associations reported in our previous GWAS of EA 3 in the new cohorts that were added to this study. Conversely, we also examined if the SNPs that reach genome-wide significance in a meta-analysis of the new cohorts replicate in the Rietveld et al. cohorts.
1.7.1
Cohort Overlap with Rietveld et al. (2013)
The analyses of EduYears in Rietveld et al. 3 were based on a discovery sample of 101,069 individuals and a combined sample (discovery + replication) of 126,559 individuals. Some of the cohorts that contributed to the Rietveld et al. study did not participate in the present study (N = 13,981). Overall, the combined sample size of the Rietveld et al. cohorts that contributed to our study is N = 126,413 individuals. This number exceeds the difference between 126,559 and 13,981 because some of the original Rietveld et al. cohorts completed additional genotyping since 2013, and were hence able to contribute larger samples to the current study.
1.7.2
Methods in Within-Sample Replication Analyses
Rietveld et al. reported three genome-wide significant SNPs in their discovery sample, all of which replicated in their replication sample. These three SNPs also yielded lower P-values in the “combined” (discovery + replication) sample. In a meta-analysis of the combined sample, four additional SNPs reached genome-wide significance. Of these, five were genome-wide significant in the EduYears analyses. The remaining two only reached genome-wide significance in the analyses of College, but both had P-values just shy of genome-wide significance in the combined-sample EduYears analysis. Given our decision to make EduYears the primary phenotype, and to facilitate comparisons of effect sizes, we attempt to replicate all of the seven original associations in our meta-analyses of the EduYears variable. To examine if the seven associations replicate in our new cohorts, we split our overall sample into two subsamples comprising: (1) cohorts that participated in Rietveld et al. 3 and (2) all new cohorts that were added to the current study. In what follows we refer to the former as the “Rietveld Cohorts” and the latter as the “New Cohorts.” We refer to the combined-sample meta-analysis results reported by Rietveld et al. 3 as the “Rietveld et al. (2013) Cohorts.”
1.7.3
Within-Sample Replication Results
Supplementary Table 1.13 reports the results of the replication analysis. In the upper panel, we report for the seven SNPs, their standardized effect sizes, standard errors, and P-values. We report these statistics from three separate meta-analyses of EduYears conducted in: (i) the Rietveld et al. (2013) Cohorts (ii) the Rietveld Cohorts, and (iii) the New Cohorts. The reference allele is chosen to be the allele associated with higher values of EduYears in Rietveld et al.’s analysis (2013).
Given the high degree of overlap between cohorts in the previous EA meta-analysis 3 and the Rietveld Cohorts, the similarity of the effect-size estimates is unsurprising. Reassuringly, the sign of the estimated coefficient in the New Cohorts is always in the predicted direction, and for all but one of the seven SNPs we can reject the null hypothesis of no effect at the 5% significance level (two SNPs, rs4851266 and rs9320913, reach genome-wide significant also in the replication sample). For six of the seven SNPs, the 95% confidence intervals for the estimated effect sizes overlap across the Rietveld Cohorts and the New Cohorts.
To further examine replicability, we examined if SNPs that reach genome-wide significance in a meta-analysis of the New Cohorts replicate in the Rietveld Cohorts. Applying the pruning algorithm described in Supplementary Information section 1.6.1 to meta-analysis results for the New Cohorts resulted in 14 approximately independent SNPs. The results from this replication analyses are reported in Panel B of Supplementary Table 1.13. The results are similar to those of the replication of the associations from the Rietveld Cohorts in the New Cohorts: the signs align for all 14 SNPs, and 12 SNP replicate at P-value < 0.05 in the Rietveld Cohorts (none of them at genome-wide significance, but 5 at P-value < 10 -5 ). In the two replication analyses, the average effects in the replication samples are about 35% smaller than the estimated effect of the genome-wide significant association, roughly consistent with the degree of inflation one would expect from a Winner’s Curse correction of the sort described and performed in the next subsection.

Using procedures identical to those described in SI Section 1.6, we conducted a meta-analysis of the EduYears phenotype, combining the results from our discovery cohorts (N = 293,723) and the results from the UKB replication cohort (N = 111,349). Expanding the overall sample size to N = 405,072 increases the number of approximately independent genome-wide significant loci from 74 to 162.

[huh? Why does the title & abstract focus on 74 hits rather than 162 hits?]

Running the LD Score regression on these data, we estimate an intercept of 1.0491 (Extended Data Fig. 3a), which is significantly larger than 1 (the standard error reported by the LDSC software is 0.0091). By comparison, the mean χ 2 statistics for all the SNPs in the LD Score regression is 1.5966. This suggests that there is some confounding bias (due to population stratification, cryptic relatedness, or other confounds) but that it accounts for only a small part of the inflation in the chi-square statistics. Thus, the inflation is largely attributable to true polygenic signal throughout the genome.
We note that the amount of inflation due to confounding bias is likely to be even smaller in our main GWAS results (e.g., in the estimates for the genome-wide significant SNPs)

As a first step, we estimate genetic overlap between EA and several other phenotypes. We define genetic overlap as the degree to which common regions of the genome are associated with different traits, i.e., the extent to which multiple phenotypes are associated with the same underlying genetic variants. Instead of relying on family data or individual-level genetic data, we estimate genetic overlap using the LD Score Regression procedure developed by Bulik-Sullivan et al 6 . Note that this procedure does not require that the GWAS samples are independent. In addition, we develop another SNP-based estimate of genetic overlap based on different assumptions that requires GWAS results from independent samples as inputs. This new measure of genetic overlap is conceptually similar, but not identical, to the measure estimated in bivariate GREML 7 . We compare the different measures of genetic overlap theoretically and empirically.
Next, we systematically investigate evidence of genetic overlap between EA and phenotypes related to (1) mental health and psychometric traits (including general cognitive performance and neuroticism), (2) brain anatomy, and (3) anthropometric traits. Henceforth, we refer to these phenotypes collectively as “MHBA” phenotypes. We chose to include in the analysis phenotypes for which the phenotypic correlation between EA and the trait has previously been established l and GWAS summary statistics of the trait are available in the public domain. The final list of phenotypes includes: Alzheimer’s disease 8 , bipolar disorder 9 , schizophrenia 10 , cognitive performance 11,12 , neuroticism 13 , volumes of subcortical brain regions and total intracranial volume 14,m , BMI 15 , and height 16 . The links we used to access the GWAS results for these traits are listed in Supplementary Table 3.1.

In Fig. 2 in the main text, we report the estimates of genetic overlap from the LD Score regression, along with 95% confidence intervals. In Supplementary Table 3.1, we report estimation results from both methods described above. Cognitive performance shows the strongest genetic overlap with EduYears (r = 0.82 & r LD = 0.75). We also find substantial genetic overlap for mental health phenotypes, in particular for neuroticism (-0.37 & -0.41), Alzheimer’s disease (-0.20 & -0.31), and bipolar disorder (0.25 & 0.28). The positive genetic overlap between EduYears and bipolar disorder is noteworthy given that the phenotypic correlation is negative 19,20 .
Furthermore, we see substantial positive genetic overlap for intracranial volume (0.39 & 0.34) and height (0.16 & 0.13), as well as a strong negative overlap for BMI (-0.44 & -0.26).

Consistent with our finding sign concordance with EduYears less than 50%, we find negative correlation of SNP coefficients with EduYears for Alzheimer’s, BMI, and neuroticism. Consistent with their sign concordance greater than 50%, we find positive correlation of SNP coefficients for cognitive performance, intracranial volume, and height (although for height, the sign conconcordance is not statistically distinguishable from 50%). An intriguing pattern is found for schizophrenia, which has a positive but near-zero estimated genetic correlation (r LD = 0.08 with P = 3.2×10 -4 ) and a nearly equal percentage of concordant SNPs and discordant SNPs among the set of 74 that we tested (51% concordant)—and yet, as reported above, the enrichment of association of these SNPs for schizophrenia is strong (P < 0.002). We now turn to potential explanations for this result and discuss related literature.
3.3.4
Discussion Our work builds on earlier epidemiological research using genetically informative designs 3– 5,25–29 . First, our results corroborate earlier findings that the genetic contribution to the positive relationship between cognitive performance and EA is substantial, but not perfect 1,30,31 . Second, earlier studies found that neuroticism is a powerful negative predictor of achievement across various domains including job performance, academic achievement, and performance on tests of cognitive performance, partly through test anxiety 32–36 . The strong negative genetic overlap between EA and neuroticism suggests that SNPs associated with EA may be good candidates for association with neuroticism. Third, our finding of a negative genetic correlation between EA and BMI corroborates earlier evidence from twin studies suggesting that the negative relationship between EA and BMI 37– 41 is partially due to common genetic factors 2,25,42 . A possible hypothesis to explain this finding is that the genetic effects on BMI may be partially mediated by individual differences in self-control, impulsivity, and reward sensitivity 43–48 , which are also linked to learning and academic achievement 45–48 . Interestingly, the most recent GWAS on BMI found that genes associated with BMI are much more strongly expressed in the nervous system and sense organs than in the digestive system 15 . However, future research is needed to better understand the mechanisms underlying these findings.

Fifth, our results relate to ongoing research on schizophrenia and bipolar disorder. Earlier work has demonstrated links between these mental disorders on the one hand, and school performance, cognitive performance, creativity, and educational attainment on the other. Although these latter measures are related to each other and share a genetic basis, the phenotypic and genetic correlations between them are far from perfect 30,50,51 . Furthermore, their relationship with schizophrenia and bipolar disorder is rather complex and possibly U-shaped. On the one hand, low cognitive performance and low school performance have been reported as risk factors for schizophrenia and bipolar disorder 19,52–55 . For example, evidence from a large, population-based Swedish Multi-Generation Register suggests a weak negative correlation (-0.11) between IQ and psychosis (a term referring to mental disorders including both schizophrenia and bipolar disorder) 5 . Furthermore, it is demonstrated in ref. 28 that rare copy-number variants that are known to cause schizophrenia also predict lower cognitive performance in healthy individuals.
On the other hand, a higher prevalence of psychosis among individuals high in cognitive performance and creativity has been frequently reported 56–58 , and polygenic risk scores for bipolar disorder and schizophrenia have been reported to predict creativity in independent samples 29 . This suggests that some genetic variants that increase the risk for psychosis may also have positive effects on cognitive performance.
The relationship between educational attainment and schizophrenia specifically is similarly complex. Although early-onset schizophrenia is associated with school dropout 59 , no clear relationship is found between educational attainment and risk of schizophrenia 60 . More generally, the relationship between education and schizophrenia appears to depend on age at onset, duration, and severity of the disease, factors that often are not measured 61 . The failure to account for these factors in many empirical studies may contribute to the relatively weak or even seemingly contradictory results.
As suggested in ref. 62, it is possible that the clinical diagnoses of schizophrenia and bipolar disorder mask several disease subtypes that are caused by different biological mechanisms. This is one possible interpretation of our results for schizophrenia: The strong enrichment for association of our EA lead SNPs with schizophrenia, combined with a nearly equal percentage of concordant and discordant associations of our lead SNPs with these mental disorders, could point to different sub-types of schizophrenia that are lumped together by the current disease classification system. Alternatively, it may be that SNPs that are associated with schizophrenia happen to be in LD with SNPs that are associated with educational attainment simply because both sets of SNPs are primarily located in genes or genomic regions that are expressed in the brain. Such co-localization would generate a haphazard pattern of sign concordance. Follow-up research will need to differentiate between these different interpretations of our results.

This background suffices to motivate the biological questions that arise in the interpretation of GWAS results and the means by which these questions might be tentatively addressed. For starters, since a GWAS locus typically contains many other SNPs in LD with the defining lead SNP and with each other, it is natural to ask: which of these SNPs is the actual causal site responsible for the downstream phenotypic variation? Many SNPs in the genome appear to be biologically inert—neither encoding differences in protein composition nor affecting gene regulation—and a lead GWAS SNP may fall into this category and nonetheless show the strongest association signal as a result of statistical noise or happenstance LD with multiple causal sites. Fortunately, much is known from external sources of data about whether variation at a particular site is likely to have biological consequences, and exploiting these resources is our general strategy for fine-mapping loci: nominating individual sites that may be causally responsible for the GWAS signals. Descriptions of genomic sites or regions based on external sources of data are known as annotations, and readers will not go far astray if they interpret this term rather literally (as referring to a note of explanation or comment added to a text in one of the margins). If we regard the type genome as the basic text, then annotations are additional comments describing the structural or functional properties of particular sites or the regions in which they reside. For example, all nonsynonymous sites that influence protein structures might be annotated as such. An annotation can be far more specific than this; for instance, all sites that fall in a regulatory region active in the fetal liver might bear an annotation to this effect.
A given causal site will exert its phenotypic effect through altering the composition of a gene product or regulating its expression. Conceptually, once a causal site has been identified or at least nominated, the next question to pursue is the identity of the mediating gene. In practice, because only a handful of genes at most will typically overlap a GWAS locus, we can make\ some progress toward answering this question without precise knowledge of the causal site. The difficulty of the problem, however, should still not be underestimated. It is natural to assume that a lead GWAS SNP lying inside the boundaries of a particular gene must reflect a causal mechanism involving that gene itself, but in certain cases such a conclusion would be premature. It is possible for a causal SNP lying inside a certain gene to exert its phenotypic effect by regulating the expression of a nearby gene or for several genes to intervene between the SNP and its regulatory target.
Supplementary Table 4.1 ranks each gene overlapping a DEPICT-defined locus by the number of discrete evidentiary items favoring that gene (see Supplementary Information section 4.5 for details regarding DEPICT). These lines of evidence are taken from a number of our analyses to be detailed in the following subsections. Our primary tool for gene prioritization is DEPICT, which can be used to calculate a P-value and associated FDR for each gene. It is important to keep in mind, however, that a gene-level P-value returned by DEPICT refers to the tail probability under the null hypothesis that random sampling of loci can account for annotations and patterns of co-expression shared by the focal gene with genes in all other GWAS-identified loci. Although it is very reasonable to expect that genes involved in the same phenotype do indeed share annotations and patterns of co-expression, it may be the case that certain causal genes do not conform to this expectation and thus fail to yield low DEPICT P-values. This is why we do not rely on DEPICT alone but also the other lines of evidence described in the caption of Supplementary Table 4.1.

However, a priori we know that some SNPs are more likely to be associated with the phenotype than others; for example, it is often assumed that nonsynonymous SNPs are more likely to influence phenotypes than sites that fall far from all known genes. So a P-value of 5×10 −7 , say, though not typically considered significant at the genome-wide level, might merit a second look if the SNP in question is nonsynonymous.
Formalizing this intuition can be done with Bayesian statistics, which combines the strength of evidence in favor of a hypothesis (in our case, that a genomic site is associated with a phenotype) with the prior probability of the hypothesis. Deciding how to set this prior is often subjective. However, if many hypotheses are being tested (for example, if there are thousands of nonsynonymous polymorphisms in the genome), then the prior can be estimated from the data themselves using what is called “empirical Bayes” methodology. For example, if it turns out that SNPs with low P-values tend to be nonsynonymous sites rather than other types of sites, then the prior probability of true association is increased at all nonsynonymous sites. In this way a nonsynonymous site that otherwise falls short of the conventional significance threshold can become prioritized once the empirically estimated prior probability of association is taken into account. Note that such favorable reweighting of sites within a particular class is not set a priori, but is learned from the GWAS results themselves. In our case, we split the genome into approximately independent blocks and estimate the prior probability that each block contains a causal SNP that influences the phenotype and (within each block) the conditional prior probability that each individual SNP is the causal one. Each such probability is allowed to depend on annotations describing structural or functional properties of the genomic region or the SNPs within it. We can then empirically estimate to extent to each annotation predicts association with the focal phenotype. For a complete description of the fgwas method, see ref. 1.
4.2.3
Methods
For application to the GWAS of EduYears, we used the same set of 450 annotations as ref. 1; these are available at https://github.com/joepickrell/1000-genomes.
...4.2.6
Reweighted GWAS and Fine Mapping
We reweighted the GWAS results using the functional-genomic results described above. Using a regional posterior probability of association (PPA) greater than 0.90 as the cutoff, we identified 102 regions likely to harbor a causal SNP with respect to EduYears (Extended Data Fig. 7c and Supplementary Table 4.2.1). All but two of our 74 lead EduYears-associated SNPs fall within one of these 102 regions. The exceptions are rs3101246 and rs2837992, which attained PPA > 0.80 (Extended Data Fig. 7c). In previous applications of fgwas, the majority of novel loci that attained the equivalent of genome-wide significance only upon reweighting later attained the conventional P < 5×10 −8 in larger cohorts 1 .
Within each region attaining PPA > 0.90, each SNP received a conditional posterior probability of being the causal SNP (under the assumption that there is just one causal SNP in the region). The method of assigning this latter posterior probability is similar to that of ref. 6, except that the input Bayes factors are reweighted by annotation-dependent and hence SNP-varying prior probabilities. In essence, the likelihood of causality at an individual SNP derives from its Bayes factor with respect to phenotypic association (which is monotonically related to the P-value under reasonable assumptions), whereas the prior probability is derived from any empirical genome-wide tendency for the annotations borne by the SNP to predict evidence of association. Thus, the SNP with the largest posterior probabilities of causality tend to exhibit among the strongest P-values within their loci and functional annotations that predict association throughout the genome. Note that proper calibration of this posterior probability requires that all potential causal sites have been either genotyped or imputed, which may not be the case in our application; we did not include difficult-to-impute non-SNP sites such as insertions/deletions in the GWAS meta-analysis. With this caveat in mind, we identified 17 regions where fine mapping amassed over 50 percent of the posterior probability on a single SNP (Supplementary Table 4.2.2). Of our 74 lead EduYears SNPs, 9 are good candidates for being the causal sites driving their association signals [12%]. One of our top SNPs, rs4500960, is in nearly perfect LD with the causal candidate rs2268894 (and is indeed the second most likely causal SNP in this region according to fgwas). The causal candidate rs6882046 is within 75kb of two lead SNPs on chromosome 5 (rs324886 and rs10061788), but no two of these three SNPs show strong LD. Interestingly, the remaining 6 causal candidates lie in genomic regions that only attain the equivalent of genome-wide significance upon Bayesian reweighting. Of the 17 causal candidates, 9 lie in regions that are DNase I hypersensitive in the fetal brain.

Table 4.2.2:
Posterior probability of causality
0.992035
0.766500
0.842271
0.567184
0.697862
0.524760
0.632536
0.885280
0.968627
0.781563
0.629610
0.837746
0.725158
0.755457
0.784373
0.682947
0.832675


[mean(c(0.524760, 0.567184, 0.629610, 0.632536, 0.682947, 0.697862, 0.725158, 0.755457, 0.766500, 0.781563, 0.784373, 0.832675, 0.837746, 0.842271, 0.885280, 0.968627, 0.992035)) = 0.76, 0.76*19=14.4]

The results from both approaches show that prediction accuracy increases as more SNPs are used to construct the score, with the maximum predictive power achieved when using all the genotyped SNPs (with Approach 1). In that case, the weighted average across the two cohorts of the incremental R 2 is ~3.85%.

[Versus 2% from Rietveld's n=100k; this is in line with the rough doubling of the main SSGAC sample size. The additional UK Biobank sample of n=111k does not seem to have been used but if it was used, should boost the polygenic score to ~5.3%?]

...The magnitude of predictive power that we observe is less than one might have expected on the basis of statistical genetics calculations 6 and GCTA-GREML estimates of “SNP heritability” from individual cohorts. Indeed, Rietveld et al. (2013) 7 reported GCTA-GREML estimates of SNP heritability for each of two cohorts (STR and QIMR), and the mean estimate was 22.4%. Assuming that 22.4% is in fact the true SNP heritability, the calculations outlined in the SOM of Rietveld et al. (pp. 22-23) generate a prediction of R 2 = 11.0% for a score constructed from the GWAS estimates of this paper and of R 2 = 6.1% for a score constructed from the combined (discovery + replication cohorts, but excluding the validation cohorts) GWAS sample of N = ~117,000-119,000 in Rietveld et al.—substantially higher than the 3.85% that we achieve here (with the score based on all genotyped SNPs) and the 2.2% Rietveld et al. achieved, respectively.
These discrepancies between the scores’ predicted and estimated R 2 may be due to the failure of some of the assumptions underlying the calculation of the predicted R 2 . An alternative (or additional) explanation is that the true SNP heritability for the GWAS sample pooled across cohorts is lower than 22.4%. That would be the case if the true GWAS coefficients differ across cohorts, perhaps due to heterogeneity in phenotype measurement or gene-by-environment interactions. If so, then a polygenic score constructed from the pooled GWAS sample would be expected to have lower predictive power in an individual cohort than implied by the calculations above. Based on that reasoning, the R 2 of 2.2% observed by Rietveld et al. (2013) could be rationalized by assuming that the proportion of variance accounted for by common variants across the pooled Rietveld cohorts is only 12.7% 6 . (We obtain a similar estimate, 11.5% with a standard error of 0.45%, when we use LD Score regression 5 to estimate the SNP heritability using our pooled-sample meta-analysis results from this paper, excluding deCODE and without GC. While we believe this estimate is based on cohort results without GC, it is biased downward if any cohort in fact applied GC.) If we assume that the 12.7% is valid also for the cohorts considered in this study, we would predict an R 2 equal to 4.5%, somewhat higher than we observe in HRS and STR but much closer. However, the degree of correlation in coefficients across cohorts appears to be relatively high (Supplementary Table 1.10 reports estimates of the genetic correlation between selected cohorts and deCODE; although the correlation estimates vary a lot across cohorts, they tend to be large for the largest cohorts, and the weighted average is 0.76). We do not know whether a pooled-cohort SNP heritability of 12.7% or lower can be reconciled with the observed degree of correlation in coefficients across cohorts.

The results are reported in Supplementary Tables 6.3 and 6.4. In both the STR and the HRS, cognitive performance significantly mediates the effect of PGS on EduYears; in the HRS, Openness to Experience is also a significant mediator. The indirect effects for the other mediating variables are not significant s .
The results for cognitive performance are similar across STR and HRS. In both datasets, a one-standard deviation increase in PGS is associated with ~0.6-0.7 more years of education, and a one-standard deviation increase in cognitive performance is associated with ~0.15 more years of education. In both datasets, the direct effect (θ 1 ) of PGS on EduYears is ~0.3-0.4 and the total indirect effect (β 1 θ 2 ) is ~0.19-0.31. This implies that a one-standard-deviation increase in PGS is associated with ~0.3-0.4 more years of education, keeping the mediating variables constant, and that changing the mediating variables to the levels they would have attained had PGS increased by one standard deviation (but keeping PGS fixed) increases years of education by ~0.19-0.31 years. Lastly, in both datasets, the partial indirect effect (θ 21 β 11 ) of cognitive performance is large and very significant: the estimates are equal to 0.29 and 0.14—or 42% and 23% of the total effect (γ 1 )—in STR and HRS, respectively. The results also suggest that a one-standard deviation increase in Openness to Experience is associated with ~0.06 more years of education, and the estimated partial indirect effect for Openness to Experience is equal to 0.04—or 7% of the total effect (γ 1 ).

[Razib comments: http://www.unz.com/gnxp/74-loci-for-cognitive-development-yes-this-is-happening/

> But look at all the functional associations and analysis in this paper! Some serious biology in this. The figure from the paper to the left [http://www.unzcloud.com/wp-content/uploads/2016/05/Screenshot-2016-05-11-12.20.14-300x184.png] which shows how the genes associated with this SNP hits are expressed in different tissue/types and organs. These are the biggest effect SNPs for years of education in the genome, so it makes sense that they’d be way over-expressed in the brain. It is definitely more convincing to those who might be skeptical a priori than some statistically robust associations (well, it should be more convincing at least).
]
[http://infoproc.blogspot.com/2016/05/74-snp-hits-from-ssgac-gwas.html]
[http://www.nature.com/news/gene-variants-linked-to-success-at-school-prove-divisive-1.19882 "The findings have proved divisive. Some researchers hope that the work will aid studies of biology, medicine and social policy, but others say that the emphasis on genetics obscures factors that have a much larger impact on individual attainment, such as health, parenting and quality of schooling. “Policymakers and funders should pull the plug on this sort of work,” said anthropologist Anne Buchanan and genetic anthropologist Kenneth Weiss at Pennsylvania State University in University Park in a statement to Nature. “We gain little that is useful in our understanding of this sort of trait by a massively large genetic approach in normal individuals.”"

'you can't prove genetics matters, which is why your funding should be cut, so you can't prove genetics matter'
]

"The results of this study and future work will enable us to better understand how these pathways interact," King continued. "Perhaps ultimately, we'll be able to learn why and how educational attainment seems to be protective of cognition in later life."' https://www.sciencedaily.com/releases/2016/05/160511134721.htm
[specifically, the Mendelian randomization will demonstrate that education has no protective effect...]

[http://www.theatlantic.com/science/archive/2016/05/the-genetics-of-staying-in-school/482052/?single_page=true]

#psychology #intelligence #education #gwas #genetics  
Shared publiclyView activity