Not quite my genome: my 23andMe exome analysis.

After months of anxious wait, I finally received my exome data from the 23andMe Exome Pilot about a week ago. When I enrolled last year, I was quite excited to join the ranks of those that had their genome sequenced, at least armed with my exome – the 1% of the genome that are genes. With the 30,000+ genomes reportedly sequenced to date, and probably even more exomes, this is nowadays a dubious distinction. However, I wanted to do this under my own terms, not being kept hostage to research protocols that usually forbid that research subjects receive their results. And I was really curious to see if there is anything medically relevant to learn from such data given that I am “healthy.” Well, the moment finally came.

The file download took quite some time - it was big. This means the raw reads were included. Good, I will align these and make variant “calls” myself at a later date to see if there are discrepancies. Fortunately, I already had installed the tool that 23andMe requires to unpack and mount the file, and I quickly browsed through the report provided. It was clear that this short report showed just some examples of interesting variants, and that this was far from a satisfying result. There is no deception here: these were just the expectations set by 23andMe for this pilot from the beginning.

Knowing this, I had prepared for months to take matter in my hands, and therefore I was ready for the next step. For this analysis, I was lucky enough to have previously secured a beta testing account for the genome/exome analysis tool that Omicia Inc., a start up in genome interpretation from Emeryville CA, has been developing and testing in recent months (disclosure: I advise for them). Through this tool, I have been looking previously at the genomes of other people - the famous first 10[1], the Complete Genomics diversity set and other genomes from ancient individuals that I was analyzing for a joint research project. However, this time it was my own exome! The anticipation was building up as I logged in to the Omicia beta test server and tried to speed through the variant submission process.

I located the file with my genetic variants provided by 23andMe (the VCF file), selected it in the genome upload dialogue and, in a minute, my file was accepted and submitted to the Omicia annotation pipeline. After just 30 minutes, I was ready to explore my exome using their "variant mining" report. All of my variants have been decorated with chromosome location, the gene harboring them, depth of coverage and quality of the calls, the class of variant as it relates to protein encoded by each gene (synonymous, nonsynonymous, splice, stop gain/loss, frameshift), the change in amino acid (if any), a number of scores attempting to evaluate the damage the variant might cause in the protein product (SIFT, PolyPhen, MutationTaster, PhyloP, and their own), the allele frequencies around the world (if available), and finally, for some of them, cross-references to entries in databases that collect genotype-phenotype relationships of medical significance (OMIM, HGMD, PharmGKB, GWAS hits). In total, close to 9,000 of my variants had some protein impact and of these, over 500 had some database annotation of medical or functional relevance. How someone could live with that many protein-changing variants? Well, it is now clear that this is pretty normal for human beings. It looked like it was going to be a big task to figure out which of these really could be relevant for me.

After browsing some of the annotations, a surprise - I recognized one of the genes right away: ATG16L1. I was homozygous for a disease-associated variant that I helped to discover! Here it was: the link to my own paper[2]. Wow! Even if in principle I should be worried because I carry two bad copies of that gene (it has been shown to be a causative variant), somehow I felt amazed to be reacquainted with this old friend. Besides, this variant was associated with Crohn’s disease, a gastrointestinal disease that I certainly don't have and at my age it is unlikely I will ever have. This is a very common situation with variants associated with common/complex disease - they point to relevant genes and pathways, and may lead to new understanding of the disease or even new therapies, but per se have low predicting power in an individual. And then the environment plays a significant role - perhaps having a good exposure to intestinal disease bugs in my childhood helped me. But, wait, can this explain some of the problems one of my children has experienced? Hmm; let's see NOD2, a gene that it's also associated with the disease - a homozygous, rare, probably deleterious variant, but never before associated with the disease. Could this be a double hit? May be. However, this is just 2 of about 75 genes currently associated reproducibly with this disease. I created a list of these genes from a recent review, saved as a personal gene set, and filter my data again. I have 27 variants in these genes, 3 of which seem severe; now what? And this is how my exploration began. I won't forget that day - I was really excited.

I needed to organize and focus my search. Here is how I decided to proceed. First I searched for homozygous variants previously associated with disease, since having two copies of a bad gene more likely result in disease. This was easily accomplished by selecting for homozygous genotype in the filters and restricting to evidence from the Online Mendelian Inheritance in Man database (OMIM). The quality of the genotype call is important, because if the depth of coverage is low (e.g. 6x), chances are that a homozygote call could be in reality a heterozygote that was just not sampled enough in the process of the random shotgun sequencing. Filtering by coverage (>15x) was quite easy to do, although interestingly it removes 33% of the variants. This analysis yielded some interesting hits, but (luckily) nothing really bad - just some susceptibility variants. One of them was a variant that has been associated with drug addiction: FAAH. It exhibited a large odds ratio for drug addition susceptibility in some small studies (so confidence is not very high), but thankfully it may not be associated with alcohol abuse and thus my occasional wine consumption is hopefully innocuous. I have never done drugs, and this may have been in hindsight a very good choice having this variant – lets not tempt fate in the future. Then, I cast my net a bit wider and looked for heterozygous variants in these genes, which could mean I am carrier of the disease (I relaxed the minimum coverage to 6x to recover more variants). Okay, I found the same carrier variant that 23andMe previously reported from my genotypes, but now I see a few more autosomal recessives. All this was interesting, but nothing really bad appeared. A variant in the F9 gene that has been associated with reduced risk for deep vein thrombosis. Good, given the number of flights that I take per year. Some of the other variants in this list are just phenotypic polymorphisms (e.g. hair thickness, eye, skin color) but other disease associated ones are difficult to interpret as at times they increase risk and, at other times, they decrease risk - how you add up all this?

Omicia also allows filtering variants by collections of genes curated by their association to disease. One of this collections is what they call the "top 10" list, that includes the most actionable genes for a given disease/area (e.g. Alzheimer's, cancer, Parkinson’s, cardiology, epileptics, psychiatric, aging, respiratory) selected by a panel of experts. This list is short enough so I was able to look at all variants in these genes carefully, but I mostly focused on variants with high probability of being damaging by using the impact scores. Most disease areas came empty, which is already great news, with a few exceptions of heterozygous and thus less concerning variants. One of these exceptions was a likely deleterious variant in a gene involved in Alzheimer’s disease, A2M. That doesn’t sound encouraging. Now, evidence suggests this gene is involved in the clearance of components of the beta-amyloid deposits, which could make the disease to advance faster, but there are no reports linking this to the onset of the disease. I think I am good for now.

Some advice here: beware of long genes such as BRCA1 or HTT (the Huntington’s gene), which because they are very long, frequently carry benign heterozygous variation that have not been associated with disease. When I looked at the HTT coming up in the disease areas, I was suddenly concerned. However the variant I carry was not the repeat expansion associated with Huntington’s disease (which, by the way, would be very difficult to identify with the current technology), so this is likely just a benign variant (I also lack any family history of such disease). In addition some heterozygous benign variants in the BRCA1 gene showed up, but these are common in the population. There is also a set of highly polymorphic genes that often show damaging variants, such as olfactory receptors, mucins, etc., which rarely are interesting, so filtering them out is appropriate an easily done. Also beware of disease associated variants in homozygous state where the alternative allele (different from the reference) is the common one on the population - this just means that the reference genome assembly carries the rare or disease associated allele and illustrates a problem with the standard analysis where only discrepancies to the reference are reported as variants. You can filter those by excluding variants with alternate allele frequency between 0.5 and 1.

Next, I wanted to see if more rare, novel variants that I carry, and presumably my close ancestors carried[3], could damage genes of medical relevance. Omicia has curated sets of genes linked by literature to medical specialties areas as in the index of the Harrison textbook of internal medicine. This is a much bigger list of genes, so I restricted my search again to likely deleterious variants, but in this case I also filtered by allele frequency to variants of less than 5% in global populations (those without frequency information are considered zero). Here, many more genes surface, mostly from a variety of phenotypes, and where the big challenge lies: how to evaluate new variants that are not specifically associated with disease, but could be more damaging to medically important genes? I think this is where I will spend much more time, chasing these variant and awaiting for new methods to deal with this conundrum.

Finally, a dream comes true. Since my early days as a geneticist I experimented with mutations in the genome of the bacteria E. coli that introduce a premature termination signal in a gene instead of an amino-acid, and result in truncated proteins when the gene is translated. This often results in loss of a critical function, say for example, needed by a bacteriophage (a bacterial virus) to infect the bacteria and grow. These mutations where so special that carried mystical names: amber, opal, ochre[4]. Since then I was wondering when I could do the first real experiment on my genome: Do I carry any “loss of function” mutations in my genome in any of the 20,000 genes? This list must be likely short. Daniel MacArthur and collaborators from the 1000 Genomes Project just published a nice paper in Science about the abundance of loss of function variants among human populations[5], and here I am, just weeks later, doing the same analysis in my exome in just a few minutes. How cool is that? In total my exome carried 41 stop gained variants, 5 of them in homozygous state. This identified 2 additional potential carrier mutations in heterozygous state – the stop was ahead of previously reported rare autosomal recessive diseases. My eyes lit up, and, of course, I was fervently reading up the direct link to the primary literature provided by Omicia. At first sight, probably nothing to worry about for me - but this is information definitely worth sharing with my children.

An additional complexity in this analysis is that I am from mixed ancestry, with ancestors originating in different continental populations that were isolated for long time and mixed relatively recently. What is the effect of this mixed ancestry in my susceptibility to disease? Most disease studies have been carried out in populations of European descent. Therefore, there may be novel susceptibility or protective variants in other continental populations that suffered different population bottlenecks, expansion, and perhaps even adaptations to new environments. A rare variant in Europeans could be common in other continents due to genetic drift. And if rare variants in my immediate "clan" have a stronger influence in my health, these are less likely to be shared between populations[6]. Again, more data is needed and I just hope that “GWAS fatigue” doesn’t kill the studies in other continental populations that would be helpful to illuminate the non-European part of my genome, as suggested by Carlos Bustamante, Esteban Burchard and me previuoulsy[7].

At this point the analysis of my exome seems a bit boring as it appears I have a relatively healthy genome - nothing too serious, a handful of complex disease associated variants for diseases that so far I don’t suffer, and new personal variants of unknown significance. And I could discover that in a few hours doing my own “genome project” (all right; exome project) – this is the closest to instant gratification in genomics. More work and new knowledge is needed to assess the impact of my novel variants, and to combine these with the multiple disease-associated susceptibility alleles in scores that predict whether I carry an extra genetic “load” for a given disease. I will definitely keep an eye on the variants I found in the disease gene sets, and will keep continuously analyzing these.

While I try to figure out all this, and await my full genome sequencing to find if I carry regulatory or structural variants of consequence, I think I will follow the wise advice from my wife – eat better, do more exercise, and enjoy life while you can.

Literature cited

1. B. Moore et al., Global analysis of disease-related DNA sequence variation in 10 healthy individuals: Implications for whole genome-based clinical diagnostics, Genetics in Medicine 13, 210–217 (2011).
2. J. Hampe et al., A genome-wide association scan of nonsynonymous SNPs identifies a susceptibility variant for Crohn disease in ATG16L1, Nat. Genet. 39, 207–211 (2006).
3. J. R. Lupski, J. W. Belmont, E. Boerwinkle, R. A. Gibbs, Clan Genomics and the Complex Architecture of Human Disease, Cell 147, 32–43 (2011).
4. F. W. Stahl, The amber mutants of phage T4, Genetics 141, 439–442 (1995).
5. D. G. MacArthur et al., A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes, Science 335, 823–828 (2012).
6. A. Keinan, A. G. Clark, Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants, Science 336, 740–743 (2012).
7. C. D. Bustamante, E. G. Burchard, F. M. De La Vega, Genomics for the world, Nature 475, 163–165 (2011).
Shared publiclyView activity