June 9, 2020

Largest catalog of human genetic diversity

At a Glance

  • Researchers have created a massive catalog of human genome data, along with tools to understand it.
  • Using DNA from over 140,000 people, they analyzed genomic variation, how variants affect gene function, and which may cause disease or serve as new drug targets.
Illustration of DNA strands and globe A set of new papers show the potential of the new gnomAD resource, which includes DNA sequences from more than 140,000 individuals, including people from Asia and Africa. Mkarco / iStock / Getty Images Plus

The genome is the complete set of your DNA, including all of your genes. The human genome was first decoded nearly two decades ago. The genetic sequencing of thousands of genomes has allowed researchers to begin to understand how the human body is built and maintained.

But each person’s genome is unique. Not enough genomes have been sequenced to understand all the ways that genetic variation can contribute to disease. To better understand the genetic diversity of the human genome, the Genome Aggregation Database (gnomAD) Consortium was formed over eight years ago to collect and study the genomes of people around the world.

The international gnomAD team of over 100 scientists released its first set of discoveries in a collection of seven papers published on May 27, 2020 in Nature, Nature Communications, and Nature Medicine. The work was funded in part by several NIH institutes (see Funding section below for full list).

The flagship paper cataloged the genetic variation in both the protein coding and non-coding regions of human DNA. Included were more than 125,000 exomes (which include only the parts that code for proteins) and 15,000 whole genomes, from populations in Europe, East and South Asia, Africa, and more. The researchers identified a total of 241 million variants that were either small single point mutations (changes in a single DNA building block, called a nucleotide) or insertions or deletions of short pieces of DNA.

The team explored how likely certain variants are to cause a loss of function in the proteins produced from the gene. Protein-coding genes were categorized based on their ability to tolerate genetic variations without being disrupted or inactivated by them. This analysis found more than 443,000 genetic variants that were likely to cause a loss of protein function.

The second paper explored why mutations identified as likely to cause a loss of function don’t always cause the problems that might be expected. The team found that such variants are within segments of DNA that are often spliced out of the final mRNA copies of the gene used to produce proteins.

A third paper detailed the analysis of more than 433,000 structural variants in the human genome. Structural variants are changes that span long stretches of DNA, of at least 50 nucleotides. Structural variants were less likely to appear in protein coding regions than in non-protein coding regions. The team estimated that only about 0.13% of people carry a structural variant with any clinical significance.

The fourth paper explored how loss of function variations could be used to identify new drug targets. The fifth paper provided an example of how gnomAD could be used to validate drug targets. It analyzed the effects of loss of function variants in a gene called LRRK2, which has been associated with Parkinson’s disease. The results suggest that therapies to inhibit the LRRK2 protein would be unlikely to cause severe side effects.

The sixth paper described the impacts of variants in the region that sits immediately before the protein coding region of genes, called the 5’ untranslated region. The researchers identified specific genes where variants in this region could lead to disease. One novel variant they uncovered was tied to neurofibromatosis. Finally, the last paper showed how gnomAD could be used to analyze multi-nucleotide variants—clusters of two or more variants that are often inherited together.

“The wide-ranging impact this resource has already had on medical research and clinical practice is a testament to the incredible value of genomic data sharing and aggregation,” says Dr. Daniel MacArthur at the Broad Institute of MIT and Harvard, who is a lead author on the papers. “More than 350 independent studies have already made use of gnomAD for research on cancer predisposition, cardiovascular disease, rare genetic disorders, and more since we made the data available.”

The consortium’s next steps are to expand gnomAD to increase the number of genomes and diversity of populations included. “We are very far from saturating discoveries or solving variant interpretation,” MacArthur says. “The next steps for the consortium will be focused on increasing the size and population diversity of these resources, and linking the resulting massive-scale genetic data sets with clinical information.”

—by Tianna Hicklin, Ph.D.

Related Links

References: 

Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, Gauthier LD, Brand H, Solomonson M, Watts NA, Rhodes D, Singer-Berk M, England EM, Seaby EG, Kosmicki JA, Walters RK, Tashman K, Farjoun Y, Banks E, Poterba T, Wang A, Seed C, Whiffin N, Chong JX, Samocha KE, Pierce-Hoffman E, Zappala Z, O'Donnell-Luria AH, Minikel EV, Weisburd B, Lek M, Ware JS, Vittal C, Armean IM, Bergelson L, Cibulskis K, Connolly KM, Covarrubias M, Donnelly S, Ferriera S, Gabriel S, Gentry J, Gupta N, Jeandet T, Kaplan D, Llanwarne C, Munshi R, Novod S, Petrillo N, Roazen D, Ruano-Rubio V, Saltzman A, Schleicher M, Soto J, Tibbetts K, Tolonen C, Wade G, Talkowski ME; Genome Aggregation Database Consortium, Neale BM, Daly MJ, MacArthur DG. Nature. 2020 May;581(7809):434-443. doi: 10.1038/s41586-020-2308-7. Epub 2020 May 27. PMID: 32461654.

Cummings BB, Karczewski KJ, Kosmicki JA, Seaby EG, Watts NA, Singer-Berk M, Mudge JM, Karjalainen J, Satterstrom FK, O'Donnell-Luria AH, Poterba T, Seed C, Solomonson M, Alföldi J; Genome Aggregation Database Production Team; Genome Aggregation Database Consortium, Daly MJ, MacArthur DG. Nature. 2020 May;581(7809):452-458. doi: 10.1038/s41586-020-2329-2. Epub 2020 May 27. PMID: 32461655.

Collins RL, Brand H, Karczewski KJ, Zhao X, Alföldi J, Francioli LC, Khera AV, Lowther C, Gauthier LD, Wang H, Watts NA, Solomonson M, O'Donnell-Luria A, Baumann A, Munshi R, Walker M, Whelan CW, Huang Y, Brookings T, Sharpe T, Stone MR, Valkanas E, Fu J, Tiao G, Laricchia KM, Ruano-Rubio V, Stevens C, Gupta N, Cusick C, Margolin L; Genome Aggregation Database Production Team; Genome Aggregation Database Consortium, Taylor KD, Lin HJ, Rich SS, Post WS, Chen YI, Rotter JI, Nusbaum C, Philippakis A, Lander E, Gabriel S, Neale BM, Kathiresan S, Daly MJ, Banks E, MacArthur DG, Talkowski ME. Nature. 2020 May;581(7809):444-451. doi: 10.1038/s41586-020-2287-8. Epub 2020 May 27. PMID: 32461652.

Minikel EV, Karczewski KJ, Martin HC, Cummings BB, Whiffin N, Rhodes D, Alföldi J, Trembath RC, van Heel DA, Daly MJ; Genome Aggregation Database Production Team; Genome Aggregation Database Consortium, Schreiber SL, MacArthur DG. Nature. 2020 May;581(7809):459-464. doi: 10.1038/s41586-020-2267-z. Epub 2020 May 27. PMID: 32461653.

Whiffin N, Armean IM, Kleinman A, Marshall JL, Minikel EV, Goodrich JK, Quaife NM, Cole JB, Wang Q, Karczewski KJ, Cummings BB, Francioli L, Laricchia K, Guan A, Alipanahi B, Morrison P, Baptista MAS, Merchant KM; Genome Aggregation Database Production Team; Genome Aggregation Database Consortium, Ware JS, Havulinna AS, Iliadou B, Lee JJ, Nadkarni GN, Whiteman C; 23andMe Research Team, Daly M, Esko T, Hultman C, Loos RJF, Milani L, Palotie A, Pato C, Pato M, Saleheen D, Sullivan PF, Alföldi J, Cannon P, MacArthur DG. Nat Med. 2020 May 27. doi: 10.1038/s41591-020-0893-5. Online ahead of print. PMID: 32461697.

Whiffin N, Karczewski KJ, Zhang X, Chothani S, Smith MJ, Evans DG, Roberts AM, Quaife NM, Schafer S, Rackham O, Alföldi J, O'Donnell-Luria AH, Francioli LC; Genome Aggregation Database Production Team; Genome Aggregation Database Consortium, Cook SA, Barton PJR, MacArthur DG, Ware JS. Nat Commun. 2020 May 27;11(1):2523. doi: 10.1038/s41467-019-10717-9.PMID: 32461616.

Wang Q, Pierce-Hoffman E, Cummings BB, Alföldi J, Francioli LC, Gauthier LD, Hill AJ, O'Donnell-Luria AH; Genome Aggregation Database Production Team; Genome Aggregation Database Consortium, Karczewski KJ, MacArthur DG. Nat Commun. 2020 May 27;11(1):2539. doi: 10.1038/s41467-019-12438-5. PMID: 32461613.

Funding: NIH’s National Institute of General Medical Sciences (NIGMS), National Human Genome Research Institute (NHGRI), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), National Institute of Mental Health (NIMH), and National Heart, Lung, and Blood Institute (NHLBI), National Institute of Allergy and Infectious Diseases (NIAID), National Center for Advancing Translational Sciences (NCATS), National Institute of Dental and Craniofacial Research (NIDCR), and National Center for Research Resources (NCRR); Swiss National Science Foundation; BioMarin Pharmaceutical Inc.; Sanofi Genzyme Inc.; Broad Institute; Wellcome Trust; Medical Research Council (UK); University of Sheffield; Barts Charity; Health Data Research UK; NHS National Institute for Health Research; Rosetrees/Stoneygate Imperial College; Simons Foundation; National Science Foundation; Desmond and Ann Heathwood; Southern California Diabetes Endocrinology Research Center; Michael J. Fox Foundation; Estonian Research Council; Royal Brompton and Harefield NHS Foundation; Imperial College London; Fondation Leducq; Department of Health, UK; Swiss National Science Foundation; Imperial College Academic Health Science Centre; Nakajima Foundation Scholarship.