Loading metrics

Open Access

Review articles synthesize the best available evidence on a topic relevant to the pathogens community.

See all article types »

Genetic Assignment Methods for Gaining Insight into the Management of Infectious Disease by Understanding Pathogen, Vector, and Host Movement

* E-mail: [email protected]

Affiliation Department of Environmental Health, Rollins School of Public Health, Emory University, Atlanta, Georgia, United States of America

Affiliation Institute of Parasitic Disease, Sichuan Provincial Center for Disease Control and Prevention, Chengdu, Sichuan, People's Republic of China

Affiliation Environmental Health Sciences, School of Public Health, University of California, Berkeley, Berkeley, California, United States of America

Affiliation School of Marine and Tropical Biology, James Cook University, Townsville, Queensland, Australia

  • Justin V. Remais, 
  • Ning Xiao, 
  • Adam Akullian, 
  • Dongchuan Qiu, 
  • David Blair

PLOS

Published: April 28, 2011

  • https://doi.org/10.1371/journal.ppat.1002013
  • Reader Comments

Table 1

For many pathogens with environmental stages, or those carried by vectors or intermediate hosts, disease transmission is strongly influenced by pathogen, host, and vector movements across complex landscapes, and thus quantitative measures of movement rate and direction can reveal new opportunities for disease management and intervention. Genetic assignment methods are a set of powerful statistical approaches useful for establishing population membership of individuals. Recent theoretical improvements allow these techniques to be used to cost-effectively estimate the magnitude and direction of key movements in infectious disease systems, revealing important ecological and environmental features that facilitate or limit transmission. Here, we review the theory, statistical framework, and molecular markers that underlie assignment methods, and we critically examine recent applications of assignment tests in infectious disease epidemiology. Research directions that capitalize on use of the techniques are discussed, focusing on key parameters needing study for improved understanding of patterns of disease.

Citation: Remais JV, Xiao N, Akullian A, Qiu D, Blair D (2011) Genetic Assignment Methods for Gaining Insight into the Management of Infectious Disease by Understanding Pathogen, Vector, and Host Movement. PLoS Pathog 7(4): e1002013. https://doi.org/10.1371/journal.ppat.1002013

Editor: Marianne Manchester, University of California San Diego, United States of America

Copyright: © 2011 Remais et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported in part by the Ecology of Infectious Disease program of the National Science Foundation under Grant No. 0622743, by the National Institute for Allergy and Infectious Disease (grant K01AI091864), and the Global Health Institute Faculty Distinction Fund at Emory University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

For many infectious diseases, transmission is strongly influenced by pathogen, host, and vector migration across complex landscapes [1] . This is especially true for pathogens with environmental stages, or those carried by vectors and intermediate hosts. The spread of rabies, for instance, has been shown to be regulated by rivers that act as barriers to host movement [2] , and the onset of diseases such as measles or foot-and-mouth disease is governed in part by human or animal hosts migrating across heterogeneous landscapes [3] , [4] . Disease persistence, synchrony, and establishment are known to be modified by host migrations between populations [5] – [9] , and thus direct measures of migration rates in real transmission systems are very much needed to optimize disease management and improve intervention campaigns.

Genetic assignment methods can provide such measures; they are a set of powerful statistical approaches that, at their most basic, can be used to establish population membership of individuals. When applied to organisms distributed among spatially distinct, interconnected populations, the techniques can be used to derive quantitative estimates of movement across a network, and determine the degree to which landscape features aid or impede movement. Genetic assignment methods have, for the most part, been limited to applications in ecology and conservation biology. This is despite their utility for estimating the magnitude and direction of key movements in infectious disease systems, where they could reveal important environmental and ecological features that facilitate or limit the spread of disease with important implications for control.

For example, estimates of pathogen transport can be used to design more efficient anthelmintic treatment campaigns for important macroparasites of humans [10] , and where environmental change is occurring, estimates of the associated change in migration can aid in the identification of new risks that arise from vectors and hosts moving effectively closer than they have been historically [1] . Genetic assignment tests (ATs) have potential for estimating these pathogen, host, and vector movements, and recent improvements in theory underpinning ATs have increased their utility at fine spatial and temporal scales, while overcoming the cost, time, and scale limitations of traditional approaches such as mark-recapture experiments [11] . Here, we discuss the molecular and statistical methodologies that make possible the application of ATs. We review current applications of ATs in infectious disease epidemiology, and discuss research directions that are positioned to capitalize on use of the techniques. We use the term “migration” to encompass the movement of human hosts, the dispersal of animal hosts and vectors, and the transport of pathogens in environmental media (e.g., flowing water).

Estimating Migration Rates

While many free-living pathogens, vectors, and intermediate hosts are capable of moving several kilometers, their specific mobilities are rarely estimated or incorporated into efforts to control disease [10] , [12] . Historically, ecological migration rates were estimated using direct measures such as mark-recapture and radio tagging, which obviously present limitations when applied to small organisms, large populations with small numbers of migrants, or organisms that are difficult to durably mark [13] . Indirect genetic methods are also available, such as inferring Nm , the number of migrants exchanged between populations per generation, using gene flow estimators based on Wright's infinite island model [14] , [15] . This approach makes a number of simplifying assumptions, such as assuming symmetrical, constant migration and constant population size, assumptions which were partially relaxed with the development of coalescent-based methods [16] .

Coalescent theory describes the statistical properties of gene trees under a standard demographic model (namely the Fisher-Wright model). Present day samples of a non-recombining gene can be seen as lying on a branch of a gene tree rooted at the most recent common ancestor of the sample. Moving backward in time from each branch, genes coalesce until the common ancestor is reached, and in this way, present-day samples can be used to infer the past, including past migration among mating populations. Coalescent-based estimates of migration rates, obtained by comparison of allele frequency distributions observed in population samples, assume that all potential source populations have been sampled and that populations have followed relatively simple demographic progressions (constant size or deterministic expansion) while experiencing constant migration [16] , [17] . Migration rates obtained in this fashion reflect the effect of migration occurring over long time scales, and do not reflect (i.e., are insensitive to) contemporary changes such as interventions (e.g., vector control) and recent environmental change. ATs, through the combination of highly variable genetic markers with Bayesian statistical methods, allow the estimation of recent migration rates that strongly reflect the influence of contemporary changes.

Assignment Tests

ATs use multilocus genotypes to identify the source population of individuals that have migrated within the past several generations [18] . Early ATs estimated the probability of an individual's multilocus genotype in relation to the frequency of alleles at different loci in potential source populations. After all sampled individuals were assigned, the migration rate between two populations was estimated by dividing the number of identified migrants by the sample size of the origin population [18] – [20] . A notable recent Bayesian method [21] directly estimates migration rates (and infers inbreeding coefficients and individual migrant ancestries) by detecting the temporary disequilibrium in immigrants' genotypes relative to the population under consideration, while relaxing the assumption that genotypes within subpopulations are in Hardy–Weinberg equilibrium. A related class of clustering methods [19] , [22] , [23] aims to partition individuals into genetically distinct subpopulations without prior assumptions about population membership; i.e., the methods calculate the probability that each individual genotype originates from one of K populations, with K , the number of subpopulations, among the inferred parameters.

Bayesian models (also known as fully probabilistic models) provide a convenient means to deal with complex (and inherently stochastic) phenomena that determine the genetic properties of individuals and populations [24] . Like other Bayesian approaches, Bayesian ATs take the position that model parameters and data are random variables with a joint probability distribution specified by a probabilistic model. The model structure and parameters proposed by Wilson and Rannala's [21] notable recent method are described in detail in Text S1 . The data and parameters of the inference model implemented in [21] are summarized in Table S1 , and Figure S1 shows a probabilistic graphical model indicating the conditional dependencies in [21] . Population assignment is a trivial task if there are fixed differences (no shared alleles) between populations. However, this is rarely the case: typically historical connections, ongoing gene flow, and perhaps convergent evolution lead to the sharing of alleles between populations. Consequently, computationally intensive approaches are required to identify the likely source population of any given individual (see Text S1 ). Software implementations of Bayesian and maximum likelihood–based methods for inferring migration and population clustering parameters are widely available ( Table 1 ). The extent of population differentiation, the number of individuals that can be sampled, the number of loci, and the specific genetic markers and their polymorphism, all interact in determining the power of any approach [25] . Markers appropriate for ATs are reviewed in detail in Text S2 , and different classes of genetic markers and their corresponding advantages and disadvantages are summarized in Table S2 .

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.ppat.1002013.t001

Application of ATs in Infectious Disease Systems

Recent infectious disease applications of ATs have estimated pathogen, vector, and host dispersal characteristics in order to explain patterns of transmission and better target control activities. Here, we review four such applications.

Case 1: Chagas Disease

In the absence of a vaccine or effective theraputics, Chagas disease control is largely dependent on elimination of the vector, members of the genus Triatoma , using insecticides. The hematophagous triatomines carry Trypanosoma cruzi , the protozoan parasite that causes Chagas disease in much of Latin America. The insects are present in sylvatic and peridomestic populations, with transient and seasonal invasion of homes leading to blood meals and transmission [26] . In the Mexican Yucatán, Dumonteil, Tripet, and colleagues [26] evaluated the genetic structure of T. dimidiata to assess dispersal of individuals, better understand domestic infestation, and inform vector control. Insects were sampled from domestic, peridomestic, and sylvatic populations, genotyped at eight microsatellite loci, and analyzed using F statistics and both Bayesian- and likelihood-based ATs [18] , [27] . The authors found that T. dimidiata is capable of dispersal over large geographic distances in the Yucatán Peninsula (up to 280 km) as suggested by low population differentiation and weak genetic structure. In this case, ATs provided a clearer picture than conventional Fst, allowing for the identification of immigrants even among populations with low genetic differentiation and no detectable correlation between genetic and geographic distance (isolation by distance). ATs indicated that 10%–22% of the insects collected within homes were immigrants from the peridomestic and sylvatic areas. Dispersal was detected in the opposite direction as well, with several insects in peridomestic and sylvatic areas having originated from populations within homes. The ecological basis of genetic structure in this study provided dispersal information that supports pesticide application and refuge removal in peridomestic areas. This zone appears to serve as an important “transit area” between sylvatic and domestic populations, contributing to household reinfestation after control, and largely agreeing with the findings from a small study in Bolivia [28] .

Case 2: Coccidioides Species

The Coccidioides soil fungi, found in arid zones of the southwestern United States and northwestern Mexico, can cause community-acquired pneumonia and severe disseminated disease (coccidioidomycosis) when inhaled by a vertebrate host [29] . Several western US states have seen dramatic increases in the incidence of coccidioidomycosis (from 2.5 to 8.4 cases per 100,000 in California between 1996 and 2006, and from 21 to 91 cases per 100,000 in Arizona between 1997 and 2006), raising the need for improved surveillance measures [30] , [31] . The diagnosis and clinical management of coccidioidomycosis in areas such as New York, where the disease is not endemic, pose unique challenges, and the source of Coccidioides infections in these settings is poorly understood. To improve molecular surveillance, identify sources of infection, and allow the early detection and management of outbreaks, Fisher et al. [32] used an AT to assign Coccidioides spp. clinical isolates to their populations of origin. The application of ATs to these organisms was complicated by their haploid, rather than diploid, genome, requiring the authors to modify existing AT methods.

More than 160 isolates from eight geographical populations of Coccidioides immitis and Coccidioides posadasii were genotyped at nine microsatellite loci. Isolates were both clinical and environmental in origin, and spanned the worldwide distribution of Coccidioides spp. Sixteen clinical isolates of unknown origin were obtained from patients diagnosed in the nonendemic state of New York. Using a modified AT procedure, 12 of these isolates were assigned to source populations with high probability, most to a source that matched the recent travel history of the patient. Thus, source identification in this nonendemic area was able to detect common-source infections. In two cases, however, travel history did not match assignment, raising questions about whether genetic differentiation was driven by host travel or pathogen dispersal; either an incomplete travel history or exposure to an isolate that had dispersed a great distance could explain the mismatches [32] .

Case 3: Hosts and Vectors of Yersinia pestis

Yersinia pestis , the bacterium that causes plague, is readily passed between wildlife and humans via flea vectors. In the plains regions of North America, black-tailed prairie dogs ( Cynomys ludovicianus ) live in high-density, communal colonies that favor the spread of plague, making this species an important host for Y. pestis . Oropsylla hirsuta is a flea very commonly associated with C. ludovicianus , and is thought to contribute substantially to Y. pestis transmission [33] . Because fleas (and many other ectoparasitic disease vectors) rely on their hosts for dispersal, quantifying host movement can aid in understanding the spread of flea-borne diseases. In a study in the northern US, Jones and Britten [33] investigated the role that prairie dogs play in dispersing fleas infected with Y. pestis . The dominant hypothesis in this transmission system, and many others, is that host movements determine vector movements, and thus concordance between host and vector population genetic characteristics would be expected. The study used ATs, among other genetic analyses, to test this hypothesis, sampling 112 prairie dogs from six colonies in north-central Montana and genotyping them at 14 microsatellite loci. At the same time, 84 fleas were collected directly from prairie dog burrows and genotyped at seven microsatellite loci. Genetic structure and variability were analyzed using multiple methods, including the estimation of recent migration rates of prairie dogs and fleas using the Bayesian techniuque described in detail in Text S1 [21] .

The authors found that the host and vector differed widely in genetic structure: prairie dog hosts exhibited low intercolony migration (eight of 30 intercolony migration rates showed m ≥0.05), and the scale of their genetic neighborhood was on the order of a typical colony size. In contrast, the vector was well mixed, showing considerable migration between colony pairs (22 of 30 intercolony migration rates showed m ≥0.05) and limited colony-level population structure. Because fleas and prairie dog hosts sampled from the same locations show limited concordance in population genetics, it is likely that prairie dogs are not the primary means of O. hirsuta dispersal in these colonies. Thus, the authors concluded that other hosts should be considered when responding to plague outbreaks, as O. hirsuta occurs on a variety of host species that may be important in dispersing Y. pestis –infected fleas [33] .

Case 4: Oral Rabies Vaccination of Racoons

The common raccoon ( Procyon lotor ) is widely distributed throughout North and Central America, and is capable of occupying a broad range of habitats in close proximity to humans. P. lotor is also the most frequently reported rabid wildlife species, and is a particularly important carrier of the rabies virus in the mid-Atlantic and northeastern US. Because of the risk of transmission of rabies to humans, the US Department of Agriculture conducts routine oral rabies vaccination programs targeting P. lotor and several other important wildlife species. In a large and expensive annual program, recombinant virus vaccine is delivered to P. lotor populations in the eastern US in attractive baits. A key question in optimizing these oral rabies vaccine programs is how geographic features (e.g., rivers, mountains, etc.) can be used to better target delivery of baits along important P. lotor dispersal corridors, reducing their virus trafficing potential. In a study in southwestern Pennsylvania state, Root, Puskas,and colleagues [34] used ATs to investigate which geographic features, if any, hinder or enhance P. lotor dispersal, and thus can be used to improve oral vaccination programs.

Live raccoons were trapped from five study sites distributed along valleys separated by a high elevation ridge; the authors aimed to test the hypothesis that the ridge isolated the populations on either side. DNA from a total of 185 raccoons was genotyped at nine microsatellite loci, and Bayesian clustering [19] and ATs [18] were used to assess the number of genetic clusters and infer the population of origin of P. lotor specimens. Specimens from all five study sites were found to compose a single genetic population, and few animals were assigned to their population of origin, with many assigned to sources across the ridge (i.e., sampled from one valley, but assigned to the valley on the opposite side of the ridge; [34] ). The results indicate that neither ridge nor valley features in this setting influence P. lotor dispersal, as individuals can transcend ridges and can readily traffic virus between (and within) valleys. Thus, ridge and valley features may not be suitable for use in optimizing the geographic placement of oral vaccine baits, despite the finding in other settings that major rivers and mountains may constrain P. lotor dispersal [34] .

Contemporary movements of hosts can contribute to increased frequency and intensity of malaria epidemics in some regions [35] , [36] , while transport of free-living pathogen stages can determine the effectiveness of strategies for reducing schistosomiasis infections [10] . Thus, quantifying these movements is of great interest to the study of complex epidemiological systems, and the routine use of ATs for this purpose is anticipated [24] .

Among the epidemiological methods that can benefit from ATs are spatial models of infectious disease transmission, which incorporate knowledge of the location, movement rate, and travel direction of hosts, vectors, and pathogens to explain observed patterns of transmission and evaluate intervention options. ATs can provide a quantitative description of migration between populations in transmission models, particularly in the context of network models that explicitly represent the exchange of individuals between populations [1] . Indeed, rigorous quantification of movement between nodes has been called for in network models [4] , [37] , and ATs offer a powerful alternative to traditional methods (e.g., mark-recapture) that are difficult to apply to these systems.

Challenging epidemiological questions can be addressed by ATs. The source of infection for recombining organisms (as opposed to those organisms where genetic structure is principally clonal) can be determined. As in the Coccidioides case, independent loci can be used to estimate the relatedness between isolates and, when combined with travel patterns of infected hosts, assignments can be used to improve surveillance in nonendemic areas, leading to the identification of common source cases that may have otherwise gone undiagnosed [32] . Moreover, ATs can also provide valuable confirmation (or refutation) that a particular host is responsible for the spread of pathogens or vectors [33] .

Another key epidemiologcal use for ATs is in assessing the landscape determinants of disease spread. ATs make it possible to formally test previously held beliefs about the role of specific landscape features in governing the mobility of vectors, hosts, and pathogens. Just as valleys and ridges were found not to govern the movement of racoon vectors of rabies [34] , conventional wisdom on other landscape determinants of spread can give way to quantitative evidence from ATs. For this to happen, landscape factors must be rigorously characterized and included in the analysis. Simple Euclidean distance between populations has been shown to be inadequate for this purpose [3] , [4] , and thus alternative (non-Euclidean) distance measures that account for landscape complexity [1] must be employed following the lead of the ecological sciences where much has been learned using this approach [38] , [39] .

Diffusive processes are ubiquitous in infectious disease transmission [1] , and despite limited efforts to quantify these processes in the past, research interest is growing rapidly. The authors of this review are engaged in an application of ATs to Schistosoma japonicum , the parasite that causes schistosomiasis in East and Southeast Asia. This organism is subject to transport in the environment via multiple pathways [10] : parasites are carried in advective flows along canals and streams as both larvae and ova; within snail intermediate hosts, parasites are conveyed among and between aquatic and riparian habitats; and for adult worms, human and animal hosts serve as vehicles. ATs provide a powerful means to comprehensively assess the role of these diffusive processes in schistosome transmission, and when combined with landscape data, can offer insights into how anthropogenic change can modify diffusion parameters, thereby influencing transmission. High priority research questions can be addressed, such as which environmental pathways are most influential in maintaining parasite transmission in endemic areas, and which are efficient at spreading the parasite into new regions or among new vulnerable subpopulations?

ATs represent just one analytical avenue in a sophisticated suite of powerful genetic analysis tools available for such epidemiological applications, including other methods for inferring demographic parameters and for identifying genes or genomic regions involved in human diseases [24] , [40] . There is diversity even within the set of techniques for estimating migration, and thus, looking forward, comparisons among estimators will be increasingly important, both to validate methods for application to specific hypotheses and to establish confidence in estimates for a particular system.

Supporting Information

Probabilistic graphical model indicating the conditional dependencies (directed edges) in the Wilson and Rannala [21] method. Nodes represent observed (data; squares) and unobserved (parameters; circles) random variables. The observed variables are the vector of sampled source populations S and the matrix of multilocus genotypes of sampled specimens, X . Among the unobserved variables (parameters) are the quantities of interest in infectious disease systems, including the interpopulation migration rates in matrix m and the specific migrant ancestry of individuals in vector M .

https://doi.org/10.1371/journal.ppat.1002013.s001

Data and parameters of the inference model implemented in Wilson and Rannala's [21] Bayesian assignment test.

https://doi.org/10.1371/journal.ppat.1002013.s002

Descriptions of different types of genetic markers and the corresponding advantages and disadvantages when analyzed using assignment tests.

https://doi.org/10.1371/journal.ppat.1002013.s003

Bayesian assignment tests.

https://doi.org/10.1371/journal.ppat.1002013.s004

Genetic markers.

https://doi.org/10.1371/journal.ppat.1002013.s005

Acknowledgments

We are grateful for the assistance of Paul Brindley of George Washington University and Jessica McCoury at Emory University.

  • View Article
  • Google Scholar
  • 15. Wright S (1969) Evolution and the genetics of populations: the theory of gene frequencies. Volume 2. Chicago: University of Chicago Press.
  • 16. Clobert J (2001) Dispersal. New York: Oxford University Press.
  • 36. Prothero RM (1965) Migrants and malaria. London: Longmans.
  • 40. Ziegler A, König I (2006) A statistical approach to genetic epidemiology: concepts and applications. Weinheim: Wiley-VCH. 335 p.
  • Research Article
  • Open access
  • Published: 18 May 2018

Using genomic relationship likelihood for parentage assignment

  • Kim E. Grashei 1 , 2 ,
  • Jørgen Ødegård 1 , 2 &
  • Theo H. E. Meuwissen 2  

Genetics Selection Evolution volume  50 , Article number:  26 ( 2018 ) Cite this article

4257 Accesses

14 Citations

3 Altmetric

Metrics details

Parentage assignment is usually based on a limited number of unlinked, independent genomic markers (microsatellites, low-density single nucleotide polymorphisms (SNPs), etc.). Classical methods for parentage assignment are exclusion-based (i.e. based on loci that violate Mendelian inheritance) or likelihood-based, assuming independent inheritance of loci. For true parent–offspring relations, genotyping errors cause apparent violations of Mendelian inheritance. Thus, the maximum proportion of such violations must be determined, which is complicated by variable call- and genotype error rates among loci and individuals. Recently, genotyping using high-density SNP chips has become available at lower cost and is increasingly used in genetics research and breeding programs. However, dense SNPs are not independently inherited, violating the assumptions of the likelihood-based methods. Hence, parentage assignment usually assumes a maximum proportion of exclusions, or applies likelihood-based methods on a smaller subset of independent markers. Our aim was to develop a fast and accurate trio parentage assignment method for dense SNP data without prior genotyping error- or call rate knowledge among loci and individuals. This genomic relationship likelihood (GRL) method infers parentage by using genomic relationships, which are typically used in genomic prediction models.

Using 50 simulated datasets with 53,427 to 55,517 SNPs, genotyping error rates of 1–3% and call rates of ~ 80 to 98%, GRL was found to be fast and highly (~ 99%) accurate for parentage assignment. An iterative approach was developed for training using the evaluation data, giving similar accuracy. For comparison, we used the Colony2 software that assigns parentage and sibship simultaneously to increase the power of the likelihood-based method and found that it has considerably lower accuracy than GRL. We also compared GRL with an exclusion-based method in which one of the parameters was estimated using GRL assignments.This method was slightly more accurate than GRL.

Conclusions

We show that GRL is a fast and accurate method of parentage assignment that can use dense, non-independent SNPs, with variable call rates and unknown genotyping error rates. By offering an alternative way of assigning parents, GRL is also suitable for estimating the expected proportion of inconsistent parent–offspring genotypes for exclusion-based models.

In the field of animal genetics, low-density single nucleotide polymorphisms (SNPs), microsatellites, and amplified fragment length polymorphisms (AFLP) have long been the preferred types of genomic data for parentage assignment due to their low cost [ 1 , 2 , 3 ]. In practice, the foundation of parentage assignment rests on exclusion- and likelihood-based methods [ 4 ]. Exclusion-based methods rely on their ability to exclude false parent–offspring combinations when the offspring’s candidate parents’ genotypes violate Mendel’s laws. These methods are often used due to their ease of interpretation, but the number of expected exclusions depends on allele frequencies in the population and on genotype call rates and error rates [ 5 ]. Exclusion-based methods also require more loci than likelihood-based methods since only genotypes with Mendelian inconsistencies are used [ 6 ]. Likelihood-based methods often calculate the likelihood ratio (LR) of the genotype of the offspring, which is the probability of the offspring’s genotype given the genotypes of the candidate parents, relative to the probability of observing the genotype in the population by chance. The LR statistic effectively gives more weight to rare alleles. Different loci are typically assumed independent, such that total LR is multiplied over all loci. Likelihood-based methods have higher power than exclusion-based methods, but their interpretation is more complicated. Both likelihood- and exclusion-based models usually assume known and homogenous genotype error rates and independent loci, and do not account for variation in genotype call rates [ 5 , 7 , 8 ], which are all important assumptions when working with high-density SNP data. For dense SNP chip data, the assumption of independent inheritance among loci is not realistic (i.e., alleles are inherited on large DNA segments), which may lead to inflated LR values when using conventional likelihood-based methods.

Parentage can also be assigned and tested by using realized genomic relationships. The interrelationship between parents governs the expected inbreeding in offspring, as well as parent–offspring relationships. Realized genomic relationships assess the average genomic similarity across loci and do not assume independence of the loci. Increasing the number of markers in the calculations, increases the precision of the genomic relationships. Our aim was to study whether genomic relationships can be used to perform computationally fast and accurate parentage testing with high-density SNP data.

Residual genomic relationships

Estimates of genomic relationships require large numbers of loci [ 9 ], and their expectation is proportional to the genetic covariance between individuals. The proposed method for parentage testing is developed for trio parentage testing, i.e. using a single offspring and two parental candidates. The method uses genomic relationships estimated by VanRaden’s first method [ 10 ], in which the genomic relationship between two individuals is calculated as follows:

where \(r_{ij}\) is the genomic relationship between individuals \(i\) and \(j\) , \(m_{it}\) and \(m_{jt}\) are the genotypes (coded 0, 1 or 2 for the alternative homozygous, the heterozygous, and the homozygous reference genotypes, respectively) for individuals \(i\) and \(j\) at locus \(t\) , \(p_{t}\) is the allele frequency in the population at locus \(t\) , and \(c\) is the number of loci (i.e. SNPs). Genomic relationships can be calculated even for extremely dense genomic data (even up to full sequence), and do not assume independence of the loci. Figure  1 shows the relationships in a trio consisting of an offspring and two (candidate) parents.

A trio of offspring (O), first parent (P1) and second parent (P2). The variables near the arrows indicate genetic relationships between individuals, while the variables over P1 and P2, and below O, are the individuals’ genetic relationships to themselves, respectively. Sexes are included in the figure but are not used by the GRL method

We used Eq. ( 1 ) to estimate the genomic interrelationships between parents and offspring, i.e., the relationship of the offspring with itself ( \(r_{O,O}\) ), relationships of the two parent candidates with themselves ( \(r_{{P_{1} ,P_{1} }}\) and \(r_{{P_{2} ,P_{2} }}\) ), relationships of the offspring with both parent candidates ( \(r_{{O,P_{1} }}\) and \(r_{{O,P_{2} }}\) ), and relationships between the parent candidates ( \(r_{{P_{1} ,P_{2} }}\) ), see Fig.  1 .

Expected genomic relationships of an offspring with its true parents (TP) are [ 11 ]:

In other words, the relationship of an offspring with a parent is the average of the genomic relationship of the parent with itself and the relationship between the two parents. The expected relationship of the offspring with itself is [ 12 ]:

where \(0.5r_{P1,P2}\) is the expected inbreeding coefficient of the offspring. Three residual relationships are defined as differences between actual and expected genomic relationships:

Inbreeding is accounted for when using the above residuals, as well as the direction of the relationships. For example, using the offspring as a candidate parent, and/or using a true parent as the offspring, will result in large residuals, i.e., realized relationships that deviate substantially from the expectations of a true parent–offspring trio.

Genomic relationship likelihood (GRL)

The above residual relationships are used to calculate a genomic relationship log-likelihood using a multivariate normal density function, assuming:

where \({\mathbf{e}} = \left[ {\begin{array}{*{20}c} {e_{O,P1} } \\ {e_{O,P2} } \\ {e_{O,O} } \\ \end{array} } \right]\) and \({\varvec{\upmu}} = \left[ {\begin{array}{*{20}c} {\mu_{1} } \\ {\mu_{2} } \\ {\mu_{3} } \\ \end{array} } \right]\) is a vector of the overall means for the residuals for true parent–offspring trios. In the absence of genotyping errors, the residuals are expected to be approximately normally distributed around zero ( \({\mathbf{e}}\sim N\left( {0,{\varvec{\Sigma}}} \right)\) , see [Additional file 1 : Figure S1]. The central limit theorem states that the sum of many independently and identically distributed variates will be approximately normally distributed. The variates in Eq. ( 1 ) may be considered as originating from a common (albeit unknown) distribution, but not all are independent (i.e., the effective number of loci is lower than the actual number of loci). Still, given a substantial number of loci distributed over the entire genome (i.e., most of the loci are indeed independent), genomic relationships (summed over all variates) are still likely to approach a normal distribution (see [ 13 ], Theorem 27.4). Plotting the residual relationships for true parent–offspring trios revealed that they were approximately normally distributed [see Additional file 1 : Figures S1, S2 and S3].

Since genotyping errors can occur in real data (and the expected residual relationship may thus deviate from 0), parameters of the distribution of residual relationships were estimated using an iterative method (see Section “Estimation of model parameters” below). Matrix \({\varvec{\Sigma}}\) is the 3 × 3 (co)variance matrix of the three residual variates in true parent–offspring trios and was also estimated using the iterative method. The genomic relationship likelihood (GRL) was defined as:

which is proportional to the natural logarithm of a multivariate normal density function. Based on (iteratively assigned) parent–offspring trios, a threshold for acceptable GRL values can be defined. In this study, we assumed that a parent–offspring trio had to have a GRL value that was within the highest 99% of the known parent–offspring GRL values, thus accepting a false negative rate of 1%.

Difference between the top two trios ( \({\mathbf{\Delta GRL}}\) )

To reduce the false positive rate and increase the true negative rate, the value of \(\Delta {\text{GRL}}\) was also assessed based on:

where \({\text{GRL}}_{1}\) \(\left( {{\text{GRL}}_{2} } \right)\) is the (second) highest GRL value achieved for an offspring across all candidate parent–offspring trios. This is analogous to the Δ statistic used in Marshall et al. [ 7 ], with more details in Appendix 1 .

In datasets where both parents of an offspring are present and no other relatives are available, \(\Delta {\text{GRL}}\) will typically be very high, since no other realistic trio exists. When other close relatives of the offspring are included among the candidate parents, \(\Delta {\text{GRL}}\) may be lower due the potential existence of multiple “likely” false parent candidates, e.g. uncles, aunts, grandparents, siblings or descendants of the offspring. High relatedness to the offspring alone is not sufficient to obtain a high value for \({\text{GRL}}_{2}\) since the method accounts for interrelationships of the whole trio. For example, if the parent candidates consist of one true parent and one full-sib of the offspring, interrelationships of the trio will typically be inconsistent because of the high relationship between the two parental candidates, although the relationships of the offspring with itself and with the parent candidates may be “normal” (these should be elevated if the relationship among the two parent candidates is high). In cases where a parent is missing but many other close relatives of the offspring are present, \({\text{GRL}}_{1}\) can, in rare cases, exceed the threshold for \({\text{GRL}}_{1}\) -values, but then \(\Delta {\text{GRL}}\) will typically be low, since multiple highly-related candidate parents are present. Thus, thresholds for assignment must be set for both \({\text{GRL}}_{1}\) and \(\Delta {\text{GRL}}\) .

Estimation of model parameters

Estimation of the GRL-parameters, i.e. \({\varvec{\upmu}}\) , \({\varvec{\Sigma}}\) and the GRL threshold, is undertaken with an iterative method which is briefly described below. The \(\Delta {\text{GRL}}\) threshold was set to 6.9, which implies that the best parent pair should be at least 1000 (= e 6.9 ) times more likely than the second-best parent pair. See Section 2 in Additional file 2 : for more details.

Step 1: allele dropping

Random matings between individuals from the dataset are performed in silico to produce simulated offspring. For simplicity, all loci are assumed to be inherited independently. The simulated trios are then used to obtain initial estimates of the GRL parameters. A smaller subset of the loci may be used in this step.

Step 2: assignment iteration

Trios are initially assigned using the GRL method based on the parameters estimated in Step 1. The method relies on the presence of true trios (albeit unknown) in the data. Parameters \({\varvec{\upmu}}\) and \({\varvec{\Sigma}}\) are then re-estimated using the newly assigned trios from evaluation data, and then used as the basis of the next assignment iteration. Iteration stops when the number of assignments is smaller than in the previous iteration. Thus, the GRL training procedure iteratively assigns trios while (re-)estimating the GRL-parameters until no more trios can be assigned. See Section 1 in Additional file 2 : for more information about the training procedure.The parameter estimates obtained in the second-to-last iteration are considered optimal. To limit the number of plausible trios to test, only individuals with a relationship larger than 0.25 with an offspring were considered as potential parents, i.e. \(r_{O,P1} > 0.25\) and \(r_{O,P2} > 0.25\) . The GRL threshold is not re-estimated in this step.

When pre-defined parameter estimates are used, the assignment process starts without estimating parameters. This is equivalent to running only the second-to-last iteration of Step 2.

Simulation study

A simulation study was conducted to investigate the strengths and weaknesses of the GRL method. QMSim [ 14 ] was used to produce simulated datasets. The initial size of the historical population was set to 500 and remained constant for 5000 generations to achieve mutation/drift equilibrium. In generation 5001, the population size was reduced to 300, of which 100 were males and 200 were females. Twenty chromosomes were simulated, each 1 Morgan long, and the number of SNPs was set such that approximately 54,000 SNPs (53,427 to 55,517) with a minor allele frequency higher than 0.05 existed in the population. The SNP mutation rate was set to 0.00003, assuming a recurrent mutation model (i.e. only two possible alleles exist). After the historical population, a recent population was simulated over five generations, with 1000 individuals per generation (5000 individuals in total). These were produced by random mating of 100 sires and 200 dams per generation, with one sire mated with two dams and each mating resulting in five recorded offspring. Of these, the last two generations were used in the parentage assignment tests. Fifty repetitions of the QMSim simulations were performed to produce 50 datasets. Genotype errors (1 and 3%) and call rates (80–100%) were added using a custom script written in the Python programming language, allowing both erroneous and missing genotypes among individuals, see Section 2 in Additional file 2 : for more information.

The GRL method was programmed in the C++ programming language that emphasizes parallel processing. The program was run in a Linux cluster environment using multiple CPU. Tests were run using the training procedure on all (evaluation) datasets. In addition, pre-estimated parameters were obtained from some of the runs with training. The datasets were not divided into offspring and parents, and thus all true offspring and parents had the potential to be assigned parents both correctly (offspring only) and incorrectly (parents and offspring).

There are three possible outcomes of the assignment process: (1) ‘Correct’, meaning correct assignment of true parents to the unknown offspring (parents must be present), (2) ‘Incorrect’, meaning wrong candidate parents were assigned and (3) ‘No-assign’, meaning no assignment was made. These were quantified for each analysis.

Comparison with a conventional likelihood-based method

To compare GRL with other methods, we analyzed five of the simulated datasets, arbitrarily chosen from all 50 datasets, by using the Colony2 software V2.0.6.3 [ 15 ]. Colony2 uses a likelihood-based method that jointly assigns both sibship and parentage based on a simulated annealing process [ 16 , 17 ]. This increases the assignment power compared to methods that use a single unknown individual (the offspring) and one or two candidate parents. Colony2 was run using a 1% genotype error (true and assumed). In addition, the following settings were chosen: (1) do not update allele frequencies, (2) assume no inbreeding, (3) no sibship scaling, (4) no sibship prior, (5) short run length, (6) use the pairwise likelihood score (PLS) and (7) allelic dropout rate set to zero for all markers. The ‘ParentPairs’-file produced by Colony2 was used to check accuracy of assignments. Any assignments for which mother, father or both were missing, or for which the assignment probability reported by Colony2 was less than 0.5, were categorized as a “No-assign”. Suggested parent pairs with at least one incorrect parent were categorized as “Incorrect” assignments and pairs with both parent candidates correct were categorized as “Correct” assignments.

Comparison with an exclusion-based method: the binomial exclusion method

We developed an exclusion-based method in which one of the parameters was estimated using GRL-assigned trios using custom scripts written in the R programming language. Exclusion ratios (ER) for the GRL-assigned trios were calculated \({\text{as }}\) the ratio of the number of exclusions for a trio and the number of loci for which all three individuals in the trio had called genotypes. We used a binomial distribution as a basis for the new assignments, i.e. \(E\sim Bin\left( {n, p} \right)\) , where \(E\) is the number of trio exclusions, \(n\) (number of trials) is the number of calls for the trio, and \(p\) (success probability) is the median ER from the GRL assigned trios.

To limit the number of trios for binomial exclusion assignment, we used the same parent–offspring genomic relationship threshold that we used for the GRL assignments, i.e. \(r_{O,P1} > 0.25\) and \(r_{O,P2} > 0.25\) . Assignment was done in a similar manner as with GRL, using both a confidence cutoff and a \(\Delta\) -score. For more information, see Section 3 in Additional file 2 :. We refer to this method as the binomial exclusion method (BEM) in the text.

Assignment results using Colony2 are shown in Fig.  2 , and the analogous GRL- and BEM results are shown in Figs.  3 and 4 . The most noticeable differences in results between GRL- and BEM are shown in Figs.  5 and 6 . Here, both methods used training estimates from a dataset with a 3% genotype error, while the true error was 1%. Results that were similar between GRL and BEM are shown in Figures S4, S5, S6, S7, S8, S9, S10 and S11 [see Additional file 3 : Figures S4, S5, S6, S7, S8, S9, S10 and S11]. In Figures S4 (GRL) and S5 (BEM), parameters were pre-estimated at a 3% genotype error (true and assumed). Figures S6 (GRL) and S7 (BEM) show the results for a true error of 3% and an assumed error of 1%. Figures S8 (GRL) and S9 (BEM) show the results for training with a 1% error rate, and Figures S10 (GRL) and S11 (BEM) for training with a 3% error rate. Total results over all datasets are shown in Table S1 [see Additional file 4 : Table S1].

Assignment results from Colony2 for individuals with (left panel) and without (right panel) available parents in the dataset. Results from five simulated datasets are averaged. The true and assumed genotype error rate was 1% for all datasets

Assignment results using GRL at a 1% genotype error rate (true and assumed) for individuals with (left panel) and without (right panel) available parents in the dataset. Results from 50 simulated datasets are averaged. Parameters were pre-estimated using one arbitrarily chosen dataset with a 1% genotype error

Assignment results using BEM at a 1% genotype error rate (true and assumed) for individuals with (left panel) and without (right panel) available parents in the dataset. Results from 50 simulated datasets are averaged. Parameters were pre-estimated using the GRL-assignments from one arbitrarily chosen dataset with a 1% genotype error

Assignment results using GRL for a 1% true genotyping error rate but using parameter estimates from a dataset with 3% genotype errors. Individuals with (left panel) and without (right panel) available parents are present in the dataset. Results from 50 simulated datasets are averaged

Assignment results using BEM for a 1% true genotyping error rate but using parameter estimates using GRL-assignments from a dataset with 3% genotype errors. Individuals with (left panel) and without (right panel) available parents are present in the dataset. Results from 50 simulated datasets are averaged

The Colony2 software was tested using a 1% true genotype error rate (assumed and true). When parents are available, Colony2 had a correct assignment rate of 22.4%, a no-assign rate of 75.4% and an incorrect assignment rate of 2.2%. For individuals without parents, the incorrect assignment rate climbed to 14.7% and the (correct) no-assign rate climbs to 85.3% (see Fig.  2 ).

Figures  3 and 4 show the comparison between GRL and BEM when parameter estimates from an arbitrarily chosen dataset were used. When parents were available in the dataset and the genotype error rate (true and assumed) was 1%, using GRL resulted in 99.5% of the individuals being correctly assigned both parents (Fig.  3 ), while 99.9% were assigned correctly with the (GRL-trained) BEM (Fig.  4 ). In both cases, no individuals with parents in the dataset were assigned incorrect parent pairs. When parents were not available, the incorrect assignment rate for GRL climbed to 0.01% for both 1% and 3% genotype error rates (Fig.  3 and Additional file 3 : Figure S4).

The most notable difference in results between GRL and BEM was found for a true genotype error rate of 1% when parameter estimates were from a dataset with a 3% error rate (Figs.  5 and 6 ). Here, GRL did not assign any trios. However, BEM assigned all trios correctly when parents were available, but incorrectly assigned 1.0% of the trios when parents were not available. When the true and assumed genotype error rates were reversed (i.e. a true error rate of 3% and an incorrectly assumed error rate of 1%), neither method assigned any trios, while the GRL method incorrectly assigned 0.02% trios, both when parents were available and when they were missing [see Additional file 3 : Figures S6 and S7] and [see Additional file 4 : Table S1].

An alternative to assuming a set of predefined parameters is to estimate these by using the evaluation data directly. Averaged results for each dataset are shown in Figures S8 and S9 [see Additional file 3 : Figures S8 and S9] (1% genotype error) and in Figures S10 and S11 [see Additional file 3 : Figures S10 and S11] (3% genotype error). These results are very similar to the results shown in Figs.  3 and 4 (1% true and assumed error rates), and Figures S4 and S5 [see Additional file 3 : Figures S4 and S5] (3% true and assumed error rates).

Parentage assignment is mostly performed using likelihood-based models with microsatellites [ 2 , 7 ], low-density SNPs [ 1 ] or exclusion-based models [ 18 ]. However, assignments methods often impose idealized assumptions, such as known age, generation and gender of all individuals, a limited number of known parental candidates, independent markers, little or no inbreeding, no stratification of the population or sample, no biased sampling of individuals, Hardy–Weinberg equilibrium (HWE) and little or no variation in genotype error or call rates within and between samples. For GRL and BEM, we perfomed assignments with unknown age, generation and gender, with no assumption as to independence of markers, HWE, inbreeding, family size or family composition, and with dense (SNP) markers, closely related individuals and varying genotype error and call rate. Colony2 assumes HWE, independent markers and no inbreeding.

Residual relationships were approximately normally distributed even when genotype errors were present [see Additional file 1 : Figures S2 and S3], but with different expectations compared to genomic data without genotype errors [see Additional file 1 : Figure S1].

It did not appear to be a problem that the parent and offspring generations were unknown when using GRL and BEM. High accuracies were achieved, although individuals had numerous close relatives that were eligible as parent candidates, such as the true parents, full- and half-sibs, own offspring, uncles/aunts and nieces/nephews. Similar results were obtained when the genotype error was increased to 3%, which was used to show that the GRL and BEM work even when the genotype error rate has quite extreme values. These properties may be useful for populations with large sibling groups, such as in fish, poultry or pigs, when generations cannot be clearly differentiated, or when the genotype error or call rates vary a lot.

A strength of the GRL training procedure is that no reference dataset with known pedigree is required for training and that the training is only partly done by simulation (allele-dropping). As long as there is a sufficient number of true (but unknown) trios present for assignment, the training can proceed. The method requires a pre-defined \(\Delta {\text{GRL}}\) threshold (i.e. the minimum acceptable value). The \(\Delta {\text{GRL}}\) is (the log of) the odds for correct assignment, given that the correct trio is among the two best trios (this is nearly always the case if true parents are present). In this study, the threshold was set to 6.9, i.e., the best trio should be at least e 6.9  = 1000 times more likely than the second-best trio. Relaxing this assumption will increase both the true and false positive assignment rates of the model, while setting a stricter threshold will have the opposite effect.

In some cases, the iterative training method may fail because the initial iteration results in no assignments. This may be caused by two factors: (1) the number of loci used in the allele-dropping simulation step may be set too high (giving too idealized parent–offspring relationships compared with evaluation data), or (2) there are no true trios present in the evaluation dataset. If reducing the number of SNPs used in the allele-dropping step does not start the iteration process, the latter may be the case. During training, there is no need to estimate or assume a genotype error rate with the GRL method, as long as the training procedure is done using the evaluation dataset.

Exclusion using parent–offspring duos (i.e. offspring and a single candidate parent) or trios is a relatively simple method for parentage assignment, by identifying incorrect parents by genotypes that violate the laws of Mendelian inheritance (“exclusion genotypes”). The GRL method is a fundamentally different approach and can be used to estimate exclusion-based parameters in true parent–offspring trios (assigned by GRL). Assignment of a single parent to an offspring is also possible using a similar method as for trios, but this was not explored in this study. The training-based GRL has the advantage that it requires no prior assumption with respect to genotype error rate or expected number of exclusions.

Binomial exclusion method

Estimation of the p -parameter for the BEM was done using trios that were assigned using GRL. An alternative to using GRL-assignments is using a training dataset with genotyped trios and known pedigree. Such a training dataset would need to have a similar genotype error rate as the evaluation dataset since having a discrepancy between the true and assumed genotype error rate could lead to decreased accuracy [see Additional file 4 : Table S1]. Since pedigree information is not always reliable, we prefer to use GRL assignments (preferably using a relatively big dataset) for parameter estimation.

Comparing GRL and the binomial exclusion method with Colony2

The GRL and BEM resulted in much more accurate assignments of parents than Colony2. Parameters for Colony2 were chosen to minimize running time, so assignment accuracy may be improved by adjusting the parameters, but at the expense of time and/or computing resources required to perform the analysis. Colony2 incorrectly assumes that marker loci are independently distributed, while GRL and BEM do not. This is likely the main reason for the poor results obtained with Colony2 on these relatively dense marker datasets.

Comparing GRL with the binomial exclusion method

Using BEM resulted in a slightly higher accuracy than GRL when the genotype error assumption was correct, or when GRL-parameters were estimated using the evaluation data (Figs.  3 and 4 ) and [see Additional file 3 : Figures S4, S5, S8 and S9]. However, when pre-estimated model parameters are used, assuming a too high genotype error rate will lead to some false assignments with BEM (Fig.  6 ), and assignment failure for the GRL method (Fig.  5 ). Thus, GRL can be used when it is crucial to minimize the false-positive rate. Assuming a too low genotype error rate resulted in both methods failing to correctly assign any trios, but GRL had a small fraction (0.016%) of false assignments while BEM did not [see Additional file 4 : Table S1]. Although the success parameter ( p , see Methods) of BEM was estimated using already GRL-assigned trios, the results indicate that the two methods are somewhat complementary and can be used together to increase overall assignment accuracy.

When the assumed genotype error rate was correct (Figs.  3 and 4 ) and [see Additional file 3 : Figures S4 and S5] or when the evaluation dataset was used to estimate parameters [see Additional file 3 : Figures S8, S9, S10 and S11], nearly all the individuals were assigned correctly and there were hardly any false assignments with either method. Thus, parameters should be estimated using the available data whenever possible, which should be the case in most situations.

Using GRL with clones or duplicated DNA

A possible novel use for the GRL method is analysis of genomic data that contain possibly duplicated genomes (e.g., by sampling of clones in plants or monozygotic twins in animals, or by duplicated sampling of DNA from the same individual). Using traditional likelihood-based or exclusion-based methods, duplicated samples/clones should be removed prior to the analysis, as these may be assigned as their own parents. For the GRL method, duplication of offspring genotypes is not a problem since GRL looks at patterns in parent–offspring relationships rather than the likelihood of each single genotype. For example, if clones of a non-inbred offspring are inserted as one or both putative parents, the GRL method would expect the offspring to be highly inbred, which will not match the observed relationship of the offspring with itself, and thus yields a low GRL value. However, duplication of parental genotypes will inevitably lead to assignment failure, since two or more trios will appear equally likely.

The GRL method is a promising trio parentage assignment method which is well suited to perform parentage assignment with high accuracy on high-density SNP datasets. GRL can be applied with success on datasets with high and/or unknown genotype error rates, highly dependent marker loci, closely-related individuals, inbreeding and in some cases clones. Estimation of the GRL parameters can be done without having a pre-existing reference dataset with known parent–offspring trio combinations. In addition, GRL can be used for training of exclusion-based methods.

Heaton MP, Leymaster KA, Kalbfleisch TS, Kijas JW, Clarke SM, McEwan J, et al. SNPs for parentage testing and traceability in globally diverse breeds of sheep. PLoS One. 2014;9:e94851.

Article   PubMed   PubMed Central   CAS   Google Scholar  

Waldbieser GC, Bosworth BG. A standardized microsatellite marker panel for parentage and kinship analyses in channel catfish, Ictalurus punctatus . Anim Genet. 2013;44:476–9.

Article   PubMed   CAS   Google Scholar  

Campbell D, Duchesne P, Bernatchez L. AFLP utility for population assignment studies: analytical investigation and empirical comparison with microsatellites. Mol Ecol. 2003;12:1979–91.

Jones AG, Small CM, Paczolt KA, Ratterman NL. A practical guide to methods of parentage analysis. Mol Ecol Resour. 2010;10:6–30.

Article   PubMed   Google Scholar  

Morrissey MB, Wilson AJ. The potential costs of accounting for genotypic errors in molecular parentage analyses. Mol Ecol. 2005;14:4111–21.

Strucken EM, Lee SH, Lee HK, Song KD, Gibson JP, Gondro C. How many markers are enough? Factors influencing parentage testing in different livestock populations. J Anim Breed Genet. 2016;133:13–23.

Marshall TC, Slate J, Kruuk LE, Pemberton JM. Statistical confidence for likelihood-based paternity inference in natural populations. Mol Ecol. 1998;7:639–55.

Purfield DC, McClure M, Berry DP. Justification for setting the individual animal genotype call rate threshold at eighty-five percent. J Anim Sci. 2016;94:4558–69.

Goddard ME, Hayes BJ, Meuwissen TH. Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet. 2011;128:409–21.

VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23.

Falconer DS, Mackay TFC. Introduction to quantitative genetics. 4th ed. Harlow: Longman group Ltd; 1996.

Google Scholar  

Malécot G. Les mathématiques de l’hérédité. Paris: Masson; 1948.

Billingsley P. Probability and measure. 3rd ed. New York: Wiley; 1995.

Sargolzaei M, Schenkel FS. QMSim: a large-scale genome simulator for livestock. Bioinformatics. 2009;25:680–1.

Jones OR, Wang J. COLONY: a program for parentage and sibship inference from multilocus genotype data. Mol Ecol Resour. 2010;10:551–5.

Wang J, Santure AW. Parentage and sibship inference from multilocus genotype data under polygamy. Genetics. 2009;181:1579–94.

Wang J. Computationally efficient sibship and parentage assignment from multilocus marker data. Genetics. 2012;191:183–94.

Hayes BJ. Efficient parentage assignment and pedigree reconstruction with dense single nucleotide polymorphism data. J Dairy Sci. 2011;94:2114–7.

Download references

Authors’ contributions

KEG wrote the software, performed the study and drafted the manuscript. JO conceived the GRL method, coordinated the whole study and contributed in writing and revising the manuscript. THEM helped finalize the theory behind the training portion of the GRL method as well as revising the manuscript critically. All authors read and approved the final manuscript.

Acknowledgements

The research leading to these results has received funding from The Research Council of Norway through both the NAERINGSPHD (Project No. 251664) and the HAVBRUK2 (Project No. 245519) programs, as well as the breeding company AquaGen AS. The authors thank Thore Egeland for helpful comments on an early version of the draft and Jinliang Wang for providing support for the Colony2 software. We also wish to thank the editors and reviewers, especially reviewer 2 whose comments lead to a significant increase in the quality of the end result.

Competing interests

KEG and JO are employed by AquaGen AS. AquaGen has applied for a patent regarding the use of the GRL methodology in parentage assignment.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Consent for publication

Not applicable.

Ethics approval and consent to participate

The research leading to these results has received funding from The Research Council of Norway through the research programs NAERINGSPHD (Project No. 251664) and the HAVBRUK2 (Project No. 245519), and AquaGen AS.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and affiliations.

AquaGen AS, P.O. Box 1240, NO-7462, Trondheim, Norway

Kim E. Grashei & Jørgen Ødegård

Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, P.O. Box 5003, NO-1432, Ås, Norway

Kim E. Grashei, Jørgen Ødegård & Theo H. E. Meuwissen

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Kim E. Grashei .

Additional files

Additonal file 1: figures s1, s2 and s3..

Residual relationships plotted for all true trios from the 50 datasets. This file contains three figures (Figures S1, S2 and S3). Residual densities for offspring to itself (top panel), offspring to real mother (mid panel) and offspring to real father (bottom panel) are shown as a continuous line in all Figs. 50,000 values were sampled from the normal distribution using the means and variances of the residuals as parameters, shown as a dashed line in each panel. Figure S1 shows results in which there is no genotype error or call rate variance, Figure S2 in which there is 1% genotype error and a ~ 80 to 100% call rate and Figure S3 in which there is a 3% genotype error and a ~ 80 to 100% call rate.

Additonal file 2.

Supplementary material. This file contains three sections with extended information about the GRL training procedure, call rate and genotype error simulation, and the binomial exclusion method (BEM), respectively.

Additonal file 3: Figures S4, S5, S6, S7, S8, S9, S10 and S11.

Assignment results using GRL or BEM for individuals with (left panel) and without (right panel) available parents in the dataset. This file contains eight figures in which assignment results from 50 simulated datasets are averaged. Parameters were pre-estimated using one arbitrarily chosen dataset in Figures S4, S5, S6 and S7, while training was performed on each evaluation dataset in Figures S8, S9, S10 and S11. Figures S4, S6, S8 and S10 show results using GRL, while Figures S5, S7, S9 and S11 show results using BEM. Figures S4 and S5 show results when there is a 3% genotype error (true and assumed), Figures S6 and S7 have pre-esimated parameters from a dataset with a 1% genotype error, while the (true) evaluation genotype error is 3%. Figures S8, S9, S10 and S11 use training on each evaluation dataset, both at 1% (Figures S8 and S9) and 3% (Figures S10 and S11) genotype errors. In all figures, the call rates are ~ 80 to 100%.

Additonal file 4: Table S1.

Summary table of total number of correct, incorrect and non-assigned trios with or without parents and genotype errors for all 50 datasets. Genotype error: either 1% or 3%, and with assumption of genotype error in parenthesis (only applicable for models that are pre-trained). Available parents: all individuals with parents available for assignment in the dataset (Yes) or where all parents are missing (No). Correct: Number of correctly assigned individuals over all 50 datasets (only applicable when parents are available). Incorrect: Number of incorrectly assigned individuals over all 50 datasets. No-assign: Number of individuals that could not be assigned parents over all 50 datasets.

Mathematical foundation for the GRL method

In this article, only the hypothesis of true parents is used for the GRL method:

We assume \({\mathbf{x}}\varvec{ }\sim N\left( {{\varvec{\upmu}}, {\varvec{\Sigma}}} \right)\) where \({\mathbf{x}}\) is the vector of residual genomic relationships, i.e. it holds the residual values for trio assignments. We define \(x_{1}\) as being the most probable trio, while \(x_{2}\) is the second most probable trio, that is \(P\left( {x_{1} |H_{1} } \right) \ge P(x_{2} |H_{1} )\) .

The difference \(\Delta_{\text{GRL}} = {\text{GRL}}_{1} - {\text{GRL}}_{2}\) , where \({\text{GRL}}_{1}\) and \({\text{GRL}}_{2}\) refer to the best and the second best trio candidates, respectively, can be shown to be identical to the natural logarithm of the probability of observing \(x_{1}\) given \(H_{1}\) divided by the probability of observing \(x_{2}\) given \(H_{1}\) . Since \({\mathbf{x}}\) is assumed to be normally distributed, the multivariate normal probability density function used is:

where \({\mathbf{x}}\) is the 3 × 1 vector of genomic residuals, \({\varvec{\upmu}}\) is 3x1 vector of expected residuals and \({\varvec{\Sigma}}\) is the 3x3 covariance matrix. If we define \(\frac{{L_{1} }}{{L_{2} }} = \frac{{f(x_{1} |H_{1} )}}{{f(x_{2} |H_{1} )}}\) (i.e. how many times more likely is \(x_{1}\) given \(H_{1}\) compared to \(x_{2}\) given \(H_{1}\) ), we find that:

If we take the natural logarithm of this ratio we get:

The above formula shows that \(\Delta_{\text{GRL}}\) has a logarithmic point probability ratio expectation. We can compare this to the \(\Delta_{Marshall}\) test statistic which is defined as in [ 7 ], that is:

where \(H_{2}\) is defined as:

LOD 1 is defined to be the LOD-score of the most likely trio, while LOD 2 is the second most likely trio. Then:

Both \(P\left( {data |H_{1} } \right)\) and \(P\left( {data |H_{2} } \right)\) can be written as follows:

where \(P_{t} \left( {g_{C} |g_{F} ,g_{M} ,H_{1} } \right)\) is the probability of observing the offspring genotype given the father and mother genotypes under \(H_{1}\) at locus t , \(P_{t} \left( {g_{F} } \right)\) is the probability of observing the father genotype at locus \(t\) , \(P_{t} \left( {g_{M} } \right)\) is the probability of observing the mother genotype at locus \(t\) , \(P_{t} \left( {g_{C} |H_{2} } \right)\) is the probability of observing the offspring genotype under \(H_{2}\) at locus \(t\) and \(c\) is the number of loci. Since \(LR = \frac{{P(data|H_{1} )}}{{P(data|H_{2} )}}\) , we can simplify \(LR\) to be:

Since \(LR_{1}\) is the likelihood ratio of the most likely trio and \(LR_{2}\) is the likelihood ratio of the second most likely trio (defined above), we can write \(LR_{1}\) and \(LR_{2}\) as:

where \(g_{{F_{1} }}\) and \(g_{{M_{1} }}\) are the genotypes of the father and mother in the most likely trio at locus \(t\) , respectively, and \(g_{{F_{2} }}\) and \(g_{{M_{2} }}\) are the genotypes of the father and mother at locus \(t\) in the second most likely trio, respectively. Since the same offspring is used in both trios, \(g_{C}\) is the same for both \(LR_{1}\) and \(LR_{2}\) for locus \(t\) .

Inserting \(LR_{1}\) and \(LR_{2}\) into the \(\Delta_{Marshall}\) -formula above we get:

where the explanation for \({\text{g}}_{C}\) , \({\text{g}}_{{F_{1} }} ,{\text{g}}_{{M_{1} }}\) , \({\text{g}}_{{F_{2} }} ,{\text{g}}_{{M_{2} }}\) is the same as above, while \({\mathbf{g}}_{C}\) , \({\mathbf{g}}_{{F_{1} }}\) , \({\mathbf{g}}_{{M_{1} }}\) , \({\mathbf{g}}_{{F_{2} }}\) and \({\mathbf{g}}_{{M_{2} }}\) are the genotypes for the offspring (or child), for most probable father and mother and for the second most probable father and mother, respectively, over all loci in vector-notation. The \(\Delta_{Marshall}\) method only uses the probability of observing the child genotypes given that \(F_{1}\) and \(M_{1}\) , or \(F_{2}\) and \(M_{2}\) are the true parents. The fact that the information in the \(H_{1}\) hypothesis is not used makes the \(\Delta_{Marshall}\) method similar to \(\Delta_{GRL}\) , we see this when the two method definitions are compared:

Both methods produce an estimated logarithmic ratio of the probability that \(C\) is the child of the two most probable parent candidates versus the probability that \(C\) is the child of the two second most probable parent candidates, hence the results produced by the two methods can be considered analogous.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Grashei, K.E., Ødegård, J. & Meuwissen, T.H.E. Using genomic relationship likelihood for parentage assignment. Genet Sel Evol 50 , 26 (2018). https://doi.org/10.1186/s12711-018-0397-7

Download citation

Received : 16 October 2017

Accepted : 04 May 2018

Published : 18 May 2018

DOI : https://doi.org/10.1186/s12711-018-0397-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Genomic Relationship
  • Parentage Assignment
  • Genotyping Error Rate
  • Likelihood-based Methods
  • Parent-offspring Trios

Genetics Selection Evolution

ISSN: 1297-9686

genetic assignment method

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 10 March 2021

Maximum likelihood parentage assignment using quantitative genotypes

  • Matthew Gray Hamilton   ORCID: orcid.org/0000-0001-8098-8845 1 , 2  

Heredity volume  126 ,  pages 884–895 ( 2021 ) Cite this article

1047 Accesses

1 Citations

4 Altmetric

Metrics details

  • Agricultural genetics
  • Animal breeding
  • Genetic markers
  • Plant breeding

The cost of parentage assignment precludes its application in many selective breeding programmes and molecular ecology studies, and/or limits the circumstances or number of individuals to which it is applied. Pooling samples from more than one individual, and using appropriate genetic markers and algorithms to determine parental contributions to pools, is one means of reducing the cost of parentage assignment. This paper describes and validates a novel maximum likelihood (ML) parentage-assignment method, that can be used to accurately assign parentage to pooled samples of multiple individuals—previously published ML methods are applicable to samples of single individuals only—using low-density single nucleotide polymorphism (SNP) ‘quantitative’ (also referred to as ‘continuous’) genotype data. It is demonstrated with simulated data that, when applied to pools, this ‘quantitative maximum likelihood’ method assigns parentage with greater accuracy than established maximum likelihood parentage-assignment approaches, which rely on accurate discrete genotype calls; exclusion methods; and estimating parental contributions to pools by solving the weighted least squares problem. Quantitative maximum likelihood can be applied to pools generated using either a ‘pooling-for-individual-parentage-assignment’ approach, whereby each individual in a pool is tagged or traceable and from a known and mutually exclusive set of possible parents; or a ‘pooling-by-phenotype’ approach, whereby individuals of the same, or similar, phenotype/s are pooled. Although computationally intensive when applied to large pools, quantitative maximum likelihood has the potential to substantially reduce the cost of parentage assignment, even if applied to pools comprised of few individuals.

You have full access to this article via your institution.

Similar content being viewed by others

genetic assignment method

Medieval DNA from Soqotra points to Eurasian origins of an isolated population at the crossroads of Africa and Arabia

Kendra Sirak, Julian Jansen Van Rensburg, … David Reich

genetic assignment method

Correcting PCR amplification errors in unique molecular identifiers to generate accurate numbers of sequencing molecules

Jianfeng Sun, Martin Philpott, … Adam P. Cribbs

genetic assignment method

100 ancient genomes show repeated population turnovers in Neolithic Denmark

Morten E. Allentoft, Martin Sikora, … Eske Willerslev

Introduction

This paper describes modifications to well-established maximum likelihood (ML) parentage-assignment methods to allow parentage to be assigned to pooled samples of multiple individuals—previously published ML methods are applicable to samples of single individuals only. This novel approach uses quantitative genotype data (also referred to as continuous genotype data; Clark et al. 2019 ) from low-density single nucleotide polymorphism (SNP) panels (Henshall et al. 2014 ). Quantitative genotype data are comprised of continuous numerical genotypes, reflecting the probabilities of all possible genotype classes (i.e., ordered allele combinations or unordered allele counts), in the form of genotype probability matrices or vectors. The use of quantitative genotype data for parentage assignment negates the need to call discrete genotype classes (Kalinowski et al. 2010 ; Marshall et al. 1998 ; Meagher and Thompson 1986 ). This is particularly important when assigning parentage to samples comprised of multiple individuals (i.e. ‘pools’), as it negates the need to call discrete genotype classes Marshall et al. 1998 ; Meagher and Thompson 1986 ). Genotype data from pools present like polyploid data and, as for polyploids, calling discrete genotypes is prone to error due to the large number of possible genotype classes at each SNP (Clark et al. 2019 ; Rahman et al. 2015 ).

Genetic markers are used for individual parentage assignment in pedigree-based selective breeding programmes: (i) where tracking identities over the life of individuals is difficult, expensive or not possible, a circumstance common in aquaculture species (Henshall et al. 2014 ; Kinghorn et al. 2010 ; Vandeputte and Haffray 2014 ); (ii) where only one parent is known with certainty, such as in the case of multiple-sire joining in livestock (Henderson 1988 ), poly cross families in trees (Burdon and Shelbourne 1971 ) and aggregated full-sib families in aquaculture (Hamilton et al. 2009 ); or (iii) for quality-control/identity-recovery purposes (Grattapaglia et al. 2014 ; Hansen and Kjaer 2006 ). Parentage assignment is also widely used in molecular ecology, including the study of conservation biology, dispersal and recruitment patterns, quantitative genetics and sexual selection (Flanagan and Jones 2019 ; Jones et al. 2010 ). A variety of genetic markers have been adopted for parentage assignment in selective breeding programmes and molecular ecology (Flanagan and Jones 2019 ; Jones et al. 2010 ). However, SNPs are increasingly used for this purpose, due to their capacity for high-speed high-throughput screening, low mutation rate, low genotyping error rate, high prevalence in the genome, and decreasing the cost of development and implementation (Hauser et al. 2011 ; Liu et al. 2016 ). A range of assays, platforms and algorithms are available for low-density SNP genotyping—most commonly using estimates of signal intensity (or area), for two axes, X and Y , corresponding to alleles A and B, from ‘intensity-based’ assays (Henshall et al. 2014 ; Rahman et al. 2015 ; Semagn et al. 2014 ) or counts of allele reads from genotyping-by-sequencing (GBS) assays (Clark et al. 2019 ), to call discrete genotypes.

Although declining, the cost of parentage assignment per individual precludes its application in many selective breeding programmes, and/or limits the circumstances or number of individuals to which it is applied. For example, further reduction in the cost of parentage assignment per individual would potentially allow: (i) pedigree-based selective breeding in additional species; (ii) an increase in the number of families and individuals able to be generated, retained and measured in current selective breeding programmes, with associated increases in the accuracy of selection; (iii) an increase in the number of environments across which progeny tests can feasibly be conducted; and/or (iv) the genetic analysis of additional traits (Bell et al. 2017 ; Henshall et al. 2014 ; Kinghorn et al. 2010 ; Vandeputte and Haffray 2014 ). Pooling samples from more than one individual, and using appropriate genetic markers and algorithms to determine parental contributions to pools, are means of further reducing the cost of parentage assignment in selective breeding programmes and molecular ecology studies (Henshall et al. 2014 ; Kinghorn et al. 2010 ).

Two distinct approaches to pool construction and parentage assignment are addressed in this paper: (i) ‘pooling-for-individual-parentage-assignment’, whereby each individual in a pool is tagged or traceable and from a known and mutually exclusive set of possible parents; and (ii) ‘pooling-by-phenotype’, whereby individuals of the same, or similar, phenotype/s are pooled (Bell et al. 2017 ; Henshall et al. 2012 ; Kinghorn et al. 2010 ) into bins (i.e. classes) of, ideally, equal width to minimise heterogeneity in measurement error. Furthermore, a novel ML parentage-assignment method using quantitative genotypes that can be used to accurately assign parentage to pools is described. This approach uses the same rationale underpinning ML parentage assignment using discrete genotypes (Kalinowski et al. 2010 ; Marshall et al. 1998 )—in that it determines the likelihood of obtaining a certain pool (or individual offspring) genotype, given a set of parental genotypes—but uses quantitative genotypes in the place of discrete genotypes. ML parentage assignment using quantitative genotypes (herein referred to as ‘quantitative ML’) is validated, using simulated data, and compared with three alternative methods—ML parentage assignment using discrete genotypes (‘discrete ML’); exclusion (Chakraborty et al. 1974 ); and solving the weighted least squares problem (Henshall et al. 2014 ; Kinghorn et al. 2010 ).

Materials and methods

Computation of quantitative genotypes using intensity-based assays.

For intensity-based assays, a method to compute quantitative genotypes, extended from Henshall et al. ( 2014 ), is outlined in Fig. 1 , with a worked example provided in Supplementary Materials 1 . Methods to compute quantitative genotypes (Fig. 1 )—that is, the parent genotype probability matrices ( G ij ) and pool unordered genotype probability vectors ( g kj )—using GBS are detailed elsewhere (Clark et al. 2019 ).

figure 1

Dark grey boxes represent samples, light grey boxes represent data inputs from intensity-based assays and unshaded boxes represent scalars, vectors and matrices derived from data inputs. Note that a ‘pool’ may be comprised of one individual (dashed arrow).

For intensity-based SNP assays, the mean and standard deviation of allelic proportions for each genotype of each SNP (i.e., individual allelic proportion parameters —refer to Fig. 1 of Henshall et al. 2014 —and pool allelic proportion parameters ) are required to account for normally distributed random errors in allelic proportions in the computation of parent genotype probability matrices and pool unordered genotype probability vectors (Fig. 1 ). These are estimated from individual allele intensities and individual genotype calls from tissue and DNA samples of individuals (i.e., not pools). Individual samples (i.e. individual parent samples or non-parent individual samples ; Fig. 1 ) should be sourced from the same, or a closely related, population to that for which parentage assignment is being undertaken, so as to maximise the probability that data for relevant loci and alleles are captured.

Estimation of individual allelic proportion parameters

Herein the approach and notation used to compute quantitative genotypes from intensity-based assays, accounting for normally distributed random errors in allelic proportions, is extended from Henshall et al. ( 2014 ). Accordingly, for a given individual and SNP, the individual allelic proportion ( p ij ) takes a value between zero and one, corresponding to the polar coordinate range from 0 to \(\frac{\pi }{2}\) (Henshall et al. 2014 ):

where p ij is the individual allelic proportion (for details refer to the Appendix of Henshall et al. 2014 ), and a 1 ij and a 2 ij are the individual allelic intensities for allele A and B for individual sample i and SNP j (adjusted for ‘uncertainty’ where applicable—refer to Eq. (1) of Henshall et al. 2014 ). Conceptually, individual allelic intensities should not be less than zero and individual allelic proportions should range between 0 and 1 but minor deviations outside these parameter boundaries can be accommodated by Eq. (1). Note that software associated with some genotyping platforms directly output a variable that is conceptually equivalent to individual (or pool) allelic proportion (e.g., ‘B-allele frequency’, refer to Bell et al. 2017 ).

For samples from diploid individuals, discrete genotype classes can be reliably called using proprietary or other software appropriate to a given genotyping platform (Henshall et al. 2014 ). Individual allelic proportion parameters , means ( \(\overline x _{AAj}\) , \(\overline x _{ABj}\) and \(\overline x _{BBj}\) ) and standard deviations ( s AAj , s ABj and s BBj ) by genotype, can then be estimated for each SNP from a sample of individuals. These individual allelic proportion parameters are specific to the genotyping assay, platform and method used.

Estimation of pool allelic proportion parameters

Pool allelic proportion parameters for each SNP can be approximated from individual allelic proportion parameters , as follows. First, means of allelic proportions for homozygous A ( \(\overline x _{jA}\) ) and homozygous B ( \(\overline x _{jB}\) ) genotypes may be assumed to equal \(\overline x _{jAA}\) and \(\overline x _{jBB}\) , respectively, and s jA and s jB to equal s jAA and s jBB , respectively. Second, mean and standard deviation of allelic proportion for the heterozygous genotypes with equal counts of A and B alleles may be assumed to equal \(\overline x _{jAB}\) and s jAB , respectively. Finally, means and standard deviations of allelic proportion for heterozygous genotypes with unequal counts of A and B alleles may be weighted according to allele counts. For example, for a pool size of three (i.e., three individuals in a pool): \(\overline x _{jAAAAAB} \approx \frac{2}{3}\overline x _{jAA} + \frac{1}{3}\overline x _{jAB}\) and \(s_{jAAAAAB} \approx \sqrt {\frac{2}{3}s_{jAA}^2 + \frac{1}{3}s_{jAB}^2}\) , \(\overline x _{jAAAABB} \approx \frac{1}{3}\overline x _{jAA} + \frac{2}{3}\overline x _{jAB}\) and \(s_{jAAAABB} \approx \sqrt {\frac{1}{3}s_{jAA}^2 + \frac{2}{3}s_{jAB}^2}\) , \(\overline x _{jAABBBB} \approx \frac{2}{3}\overline x _{jAB} + \frac{1}{3}\overline x _{jBB}\) and \(s_{jAABBBB} \approx \sqrt {\frac{2}{3}s_{jAB}^2 + \frac{1}{3}s_{jBB}^2}\) ; and \(\overline x _{jABBBBB} \approx \frac{1}{3}\overline x _{jAB} + \frac{2}{3}\overline x _{jBB}\) and \(s_{jABBBBB} \approx \sqrt {\frac{1}{3}s_{jAB}^2 + \frac{2}{3}s_{jBB}^2}\) . Note that only parameters for unordered allele combinations (i.e., ‘unordered genotypes’—genotypes corresponding to allele counts) are required. For example, the unordered genotype AAAAAB corresponds to the ordered genotypes of AAAAAB, AAAABA, AAABAA, AABAAA, ABAAAA and BAAAAA.

Alternative approaches to estimating pool allelic proportion parameters can be conceived. For example, they could be estimated directly from pools with known SNP genotypes.

Computing quantitative genotypes

For a given parent and SNP, each element of the parent genotype probability matrix ( G ij ) represents the probability that the true genotype is equal to the corresponding ordered genotype. Adopting the approach of Henshall et al. ( 2014 ), for each parent i and SNP j a quantitative genotype probability matrix ( G ij ) can be computed:

where λ ij AA is the height of the N ( \(\overline x _{j_{AA}},s_{j_{AA}}\) ) distribution at x  =  p ij , λ ijAB and λ ij BA are the height of the N( \(\overline x _{j_{AB}},s_{j_{AB}}\) ) distribution at x  =  p ij , λ ijBB is the height of the N( \(\overline x _{j_{BB}},s_{j_{BB}}\) ) distribution at x  =  p ij and Σλ ij is the sum of λ ijAA , λ ijAB and λ ijBB .

Extending the method of Henshall et al. ( 2014 ), for each pool and SNP, a pool unordered genotype probability vector ( g kj )—elements of which represents the probability that the true genotype is equal to the corresponding unordered genotype—can be computed by modifying Eq. (2) to accommodate the additional genotypes possible in pools:

where g kj is a vector with 2 n  + 1 elements, corresponding to the number of possible unordered genotypes for pool k and SNP j , n is the pool size, and ° is the Hadamard or entrywise product. The elements λ kj 1 to λ kj 2 n +1 are computed as the height at the corresponding N ( \(\overline x _{j_1},s_{j_1}\) ) to N ( \(\overline x _{j_{2n + 1}},s_{j_{2n + 1}}\) ) distributions at x  =  p kj (where p kj is the allelic proportion for pool k , computed according to Eq. (1), replacing i with k ). The elements \({\uprho}_1^{ - 1}\) to \({\uprho}_{2n + 1}^{ - 1}\) are computed as the inverse of the number of possible ordered genotypes for the unordered genotypes 1 to 2 n  + 1. For example in the case of two individuals per pool ( n  = 2): λ kj 1  = λ kjAAAA  = height of the \({\mathrm{N}}\left( {\overline x _{j_{AAAA}},s_{j_{AAAA}}} \right)\) distribution at x  =  p kj and ρ 1  = ρ AAAA  = 1 (the one ordered genotype is AAAA); λ kj 2  = λ kjAAAB  = height of the \({\mathrm{N}}\left( {\overline x _{j_{AAAB}},s_{j_{AAAB}}} \right)\) distribution at x  =  p kj and ρ 2  = ρ AAAB  = 4 (the four ordered genotypes are AAAB, AABA, ABAA and BAAA); λ kj 3  = λ kjAABB  = height of the \({\mathrm{N}}\left( {\overline x _{j_{AABB}},s_{j_{AABB}}} \right)\) distribution at x  =  p kj and ρ 3  =  ρ AABB  = 6 (the six ordered genotypes are AABB, ABAB, BAAB, ABBA, BABA, BBAA); λ kj 4  = λ kjABBB  = height of the \({\mathrm{N}}\left( {x_{j_{ABBB}},s_{j_{ABBB}}} \right)\) distribution at x  =  pkj and ρ 4  = ρ ABBB  = 4 (the four ordered genotypes are ABBB, BABB, BBAB, BBBA); and λ kj 5  = λ kjBBBB  = height of the \({\mathrm{N}}\left( {\overline x _{j_{BBBB}},s_{j_{BBBB}}} \right)\) distribution at x  =  pkj and ρ 5  =  τ BBBB  = 1 (the one ordered genotype is BBBB). Note that in the case of one individual per pool ( n  = 1), the elements of a pool unordered genotype probability vector are equal to the elements of the upper triangle of a corresponding quantitative genotype probability matrix for an individual offspring (i.e. g kj is equivalent to the upper triangle of G oj from Henshall et al. 2014 , expressed as vector). Where data for a SNP are missing in a pool, values of g kj can be assumed to equal the expected unordered transmission vector ( n nj ), where n nj equals t cj , assuming t ij equals f j for all 2 n parents (detailed below).

ML parentage assignment using quantitative genotypes

Required inputs to undertake quantitative ML are the parent genotype probability matrices ( G ij ) and the pool unordered genotype probability vectors ( g kj ), described above; and a list of possible parental combinations in the pool , an assumed genotype replacement error rate ( \(\widehat {\varepsilon _{nj}}\) ) and a critical Δ log odds (LOD) value (Fig. 2 ; Supplementary Material 1 ). The list of possible parental combinations in the pool may comprise all combinations of all possible parents but in many circumstances a shorter list can be considered, substantially reducing the computational burden. For example, in the case of pooling-for-individual-parentage-assignment, pools are constructed such that the list of possible parental combinations in a pool must include two, and only two, parents from each known and mutually exclusive set of possible parents contributing to the pool. The assumed genotype replacement error rate can theoretically range between zero and one but is generally small (e.g., <0.01). Critical Δ LOD represents a pedigree assignment acceptance threshold corresponding to a desired level of confidence in assignment (i.e., a predetermined acceptable proportion of erroneous parentage assignments)—the greater the critical Δ LOD, the lower the erroneous assignment rate in accepted assignments, but the greater the number of falsely rejected assignments (Harrison et al. 2013 ; Jones et al. 2010 ). Although computationally burdensome, critical Δ LOD is most appropriately determined by simulating populations of parents and pools ( Jones et al. 2010 ; Kalinowski et al. 2010 ; Marshall et al. 1998 ) , given an assumed genotype replacement error rate and possible parental combinations in pools .

figure 2

Light grey boxes represent data inputs and unshaded boxes represent scalars, vectors and matrices derived from data inputs. Note that the expected allele frequency vector may be estimated from the parent transmission vector or other sources (dashed arrow).

Computation of the LOD score

Intermediate vectors required to compute the LOD score (Henshall et al. 2014 ; Meagher and Thompson 1986 ), and ultimately assign parentage, include the parent allele transmission vectors ( t ij ), parent combination ordered transmission vectors ( q cj ) and parent combination unordered transition vectors ( t cj ; Fig. 2 ). For a given parent, i , and SNP, j , each element of the 2 × 1 parent allele transmission vector represents the probability that the corresponding allele (i.e., A or B) will be transmitted to its progeny. The parent allele transmission vector ( t ij ) for parent i and SNP j can be computed as (Henshall et al. 2014 ):

When marker data for a SNP from a parent is missing, the corresponding parent allele transmission vector may be assumed to equal the vector of expected allele frequencies transmitted to pools for SNP j ( f j ). The expected allele frequencies are most simply estimated by computing the average of parental t ij vectors, excluding missing data, for SNP j (Fig. 2 ). However, less generic, but more precise, approaches to the computation of expected allele frequencies can be conceived. For example, if expected parental contributions to pools are known, expected allele frequencies could be computed separately for each pool as the average of parental t ij vectors weighted by their expected parental contribution to the pool.

For each possible combination of parents , c , and SNP, j , elements of the 1 × 2 n parent combination ordered transmission vector ( q cj ; Fig. 2 ) represent the probability that the corresponding ordered genotype will be transmitted to progeny. The parent combination ordered transmission vector can be computed as the Kronecker product, ⊗ , of t 1 j ,…, t 2 nj . That is:

where 2n is the number of parents for possible parent combination c .

It is noteworthy that the length of q cj increases exponentially by 2 n (i.e., twice the pool size). However, q cj vectors can be consolidated into parent combination unordered transition vectors ( t cj ) of length 2 n  + 1, substantially reducing the computational burden. For example, if the pool size is three ( n  = 3), elements corresponding to the ordered genotypes AAAAAB, AAAABA, AAABAA, AABAAA, ABAAAA, and BAAAAA in q cj can be summed with the result becoming the element in t cj corresponding to the unordered genotype AAAAAB.

In the implementation of previously described ML parentage-assignment methods, genotyping errors are accounted for by assuming they represent the replacement of true genotypes with genotypes selected at random under Hardy–Weinberg assumptions in a small proportion of SNPs (i.e., the assumed genotype replacement error rate ) (Henshall et al. 2014 ; Jones et al. 2010 ; Kalinowski et al. 2010 ; Marshall et al. 1998 ). Such errors can be similarly accounted for in quantitative ML by adjusting the parent combination unordered transition vectors ( t cj ) and pool unordered genotype probability vectors ( g kj ):

where \(\widehat {\varepsilon _{nj}}\) is the assumed genotype replacement error rate for pool size n and SNP j .

For each pool with genotype probability vector \({\boldsymbol{g}}_ \ast ^{{\boldsymbol{kj}}}\) , the likelihood of each possible parent combination and pool duo for each SNP can be computed as:

To compute the likelihood ratio \(\left( {\frac{{L^{\left( {ck} \right)j}}}{{L^{cj}}}} \right)\) , the denominator is computed as the likelihood under the null hypothesis that the individuals in the pool are unrelated to the possible parent combination in question:

The likelihood ratio across all markers can then be computed as the sum of the natural log of the likelihood ratios, herein referred to as the LOD score:

where m is the number of SNPs.

Parentage assignment

The most likely parental combination for each pool can be identified as that with the greatest LOD ck . The Δ LOD for each parent in the most likely parental combination can then be computed as the difference between the LOD ck for the most likely parental combination and the maximum LOD ck of those parental combinations that were identical to the most likely combination except for the parent in question. If a parent’s Δ LOD is greater than critical Δ LOD it can then be accepted as correctly assigned.

An R package (R Core Team 2020 ) entitled ‘SNPpools’, available at https://github.com/mghamilton/SNPpools , was developed to implement and validate, with simulations, the quantitative ML method. Simulations were conducted for pool sizes of one (i.e., samples of individuals), two and three using the sim.parent.assign.fun function of the SNPpools package.

Generation of simulated datasets for validation and comparison

Two pedigree structures (‘Pedigree 1’ and ‘Pedigree 2’) were simulated. Pedigree 1 was constructed assuming eight dams, each crossed with eight mutually exclusive sets of ten sires (i.e. 80 sires). This pedigree structure is akin to the case of multiple-sire joining in livestock (Henderson 1988 ), poly cross families in trees (Burdon and Shelbourne 1971 ) and aggregated full-sib families in aquaculture (Hamilton et al. 2009 ), where the one parent is known and the other is known to be one of a finite set of possible parents. Pedigree 2 was constructed assuming 40 sires and 40 dams each crossed with one individual of the opposite sex to produce 40 full-sib families. This pedigree structure represents a mating design applicable to species where half-sib families are difficult to generate, such as shrimp (Dai et al. 2020 ). In both pedigree structures, all parents were assumed to be unrelated (scenarios involving half the number of possible parents/families and related parents are explored in Supplementary Materials 2 and 3 , respectively).

Pedigree 1 pools were constructed to allow ‘pooling-for-individual-parentage assignment’ and for Pedigree 2 pools were constructed assuming a ‘pooling-by-phenotype’ strategy. That is, for Pedigree 1, pools were constructed ensuring that the progeny of each dam was represented no more than once in each pool. For Pedigree 2, no constraints on the ancestry of individuals placed in pools were applied—that is, individuals from families were allocated to pools at random. For each pooling strategy and pool size, parentage was assigned to the progeny of 9600 ‘unknown parents’.

For each simulated sample (parent or pool) and 100 SNP, individual allele intensities were generated (scenarios involving 50 SNP and 200 SNP are explored in Supplementary Materials 4 and 5 , respectively). First, ‘true’ genotypes were randomly generated for parents assuming that SNPs were biallelic, SNPs were in linkage equilibrium (i.e., inherited independently) and the allele frequency for all SNPs was 0.5 (a scenario in which the allele frequency for all SNPs was 0.4 is explored in Supplementary Materials 6 ). Second, true genotypes for progeny were then generated from true parental genotypes assuming alleles were inherited according to Mendelian principles. Finally, pool genotypes were generated by concatenating the true genotypes of individuals in the pools (e.g., a pool of two individuals with genotypes AB and BB had a pool genotype of ABBB).

To account for lab-based genotyping errors, each SNP for each individual and pool was randomly categorised as correct (with probability 1 −  ε j ) or erroneous (with probability ε j ), where ε j is the true genotype replacement error rate for SNP j (Henshall et al. 2014 ). If categorised as erroneous, a random ‘observed genotype’, which may differ from the ‘true genotype’, was generated. For the purpose of simulation the true genotype replacement error rate was assumed to equal 0.01—a somewhat large and thus conservative value—for all SNPs and samples.

For each parent and SNP, individual allelic proportions ( p ij ) were generated as random normal variables:

where μ jq is the mean of the individual allelic proportion for the observed genotype of parent sample i , SNP j and genotype q ; and σ jq is the corresponding standard deviation. Assumed true inidividual allelic proportion parameters for all SNP were as follows: μ AA  = 0.05, μ AB  = 0.50, μ AB  = 0.05, μ BB  = 0.95 and σ BB  = 0.05 (scenarios in which standard deviations— σ AA , σ σ AB and σ BB —were assumed to equal 0.01 and 0.20 are explored in Supplementary Materials 7 and 8 , respectively). Individual allele intensities were subsequently calculated as follows:

Pool allelic proportions ( p kj ) and pool allele intensities ( a 1 kj and a 2 kj ) where correspondingly computed.

For each simulated parent and pool, each SNP was randomly categorised as missing (with probability 0.1) or present (with probability 0.9). For those SNPs categorised as missing, allele intensities (i.e., a 1 ij , a 2 ij , a 1 kj and/or a 2 kj ) were deleted (scenarios in which all SNP from some parents were categorised as missing are explored in Supplementary Materials 9 ).

For each simulated pool, quantitative ML was undertaken considering only possible parental combinations , and the assigned parents compared with the true parents of each pool. For comparison, parentage assignment was also undertaken for each pool by extending three previously described methods (Henshall et al. 2014 ). First, a ML parentage assignment using discrete genotypes approach was implemented. This method, ‘discrete ML’, was equivalent to quantitative ML, except that the quantitative parent genotype probability matrix ( G ij ) was replaced with a discrete genotype probability matrix, D ij , where \({\boldsymbol{D}}^{{\boldsymbol{ij}}} = \left[ {\begin{array}{*{20}{c}} 1 & 0 \\ 0 & 0 \end{array}} \right],\left[ {\begin{array}{*{20}{c}} 0 & {1/2} \\ {1/2} & 0 \end{array}} \right]\) , or \(\left[ {\begin{array}{*{20}{c}} 0 & 0 \\ 0 & 1 \end{array}} \right]\) . To compute D ij , if an element the corresponding G ij was greater than 0.98, for homozygous genotypes (i.e., diagonal elements corresponding to genotypes AA and BB), or 0.49 for heterozygous genotypes (i.e., off-diagonal elements corresponding to genotypes AB and BA), were replaced with 1 and 1/2, respectively, with all other elements equal to zero (Henshall et al. 2014 ). In addition, also using a threshold of 0.98, the pool unordered genotype probability vector g kj was replaced with d kj . Secondly, an exclusion method was applied. Using this method, the number of genotype mismatches, computed as the number of SNPs where \({\boldsymbol{t}}^{{\boldsymbol{cj}}}{\boldsymbol{d}}^{{\boldsymbol{kj}}^{\prime} } = 0\) , for each pool and possible parental combination was calculated. The possible parental combination with the lowest proportion of mismatched SNPs (and lowest standard error of proportion, where multiple combinations with the same proportion were identified) was then identified. Finally, parental contributions to pools were estimated by solving the ‘weighted least squares problem’, as detailed in Henshall et al. ( 2014 ). The possible parental combination resulting in the minimum sum of squared difference from these estimated parental contributions was then assigned to the pool. Simple worked examples of the ‘quantitative ML’, ‘discrete ML’, ‘exclusion’ and ‘least squares’ methods of parentage assignment are provided in Supplementary Materials 1 .

Simulations revealed that not only can the quantitative ML method be used to assign parentage to pools, but the quantitative ML method was more accurate than discrete ML and exclusion under every scenario examined and was generally more accurate than the least squares method (Tables 1 and 2 ; Figs. 3 and 4 ; Supplementary Materials 10 ). This indicates that the quantitative ML method may be adopted to reduce the cost and or increase the accuracy of parentage assignment in selective breeding programmes and molecular ecology studies in many circumstances.

figure 3

Histograms were generated using a pooling-for-individual-parentage-assignment approach—of one ( a , b ), two ( c , d ) and three ( e , f ) individuals. Results using quantitative maximum likelihood ( a , c and e ) and discrete maximum likelihood ( b , d and f ) parentage assignment are shown—correctly assigned individuals are in grey and incorrectly assigned individuals are represented as a black line. Where applicable, the critical Δ LOD value to achieve a 99% correct assignment rate is indicated by an arrow with the percentage of rejected assignments in parentheses.

figure 4

Histograms were generated using a pooling-by-phenotype approach—of one ( a , b ), two ( c , d ) and three ( e , f ) individuals. Results using quantitative maximum likelihood ( a , c and e ) and discrete maximum likelihood ( b , d and f ) parentage assignment are shown—correctly assigned individuals are in grey and incorrectly assigned individuals are represented as a black line. Where applicable, the critical Δ LOD value to achieve a 99% correct assignment rate is indicated by an arrow with the percentage of rejected assignments in parentheses.

Simulations, adopting quantitative ML assigned parentage for pools of one (i.e., assignment of parentage to individuals) without error under both the pooling-for-individual-parentage assignment and pooling-phenotype scenarios (Tables 1 and 2 ; Figs. 3 and 4 ). For pools of two, correct assignment rates were very high, 0.997 and 1.000, respectively, and for pools of three, the correct assignment rate was reduced to 0.917 under pooling-for-individual-parentage assignment but remained high under the pooling-by-phenotype scenario (0.982). For pools of three, the correct assignment rate was notably poor for methods relying on discrete genotype assignments—0.175 and 0.106 for discrete ML, and 0.048 and 0.157 for exclusion. However, discrete ML and exclusion are highly dependent on the accuracy of discrete genotype calls and our results, although indicative that quantitative ML is superior to these methods, are only relevant to the approach to genotype calling adopted in simulations. Furthermore, using quantitative ML, it was necessary to reject a substantial percentage of assignments (31%; Fig. 3e ) to achieve a 99% correct assignment rate (critical Δ LOD = 4.96) for pools of three using a pooling-for-individual-parentage-assignment approach. This highlights the inherent limitations of using low-density SNP panels to assign parentage to large pools.

Quantitative ML can be used to assign parentage to pooled samples using low-density SNP data and simulations showed it to be more accurate in assigning parentage to pools than other approaches—discrete ML (Kalinowski et al. 2010 ; Marshall et al. 1998 ); exclusion (Chakraborty et al. 1974 ), and solving the weighted least squares problem (Henshall et al. 2014 ; Kinghorn et al. 2010 ). In addition, unlike exclusion and solving the weighted least squares problem, ML parentage assignment allows a desired level of confidence in assignment to be specified, by defining a critical Δ LOD value below which assignments are rejected (Figs. 3 c, e and 4e ).

Two circumstances where implementation of parentage assignment to pools is applicable were identified— pooling-for-individual-parentage-assignment and pooling-by-phenotype. In the case of pooling-for-individual-parentage assignment applied to selective breeding programmes, individuals in each pool are tagged or traceable and are from a known and mutually exclusive set of possible parents. This approach to pooling makes the construction of the additive relationship matrix for identifiable (e.g., tagged) individuals possible and is particularly suited to reconstructing full-sib pedigree from multiple-sire joinings in livestock (Henderson 1988 ), poly cross families in trees (Burdon and Shelbourne 1971 ) and aggregated full-sib families in aquaculture (Hamilton et al. 2009 ). However, it is also suitable for the pooling of samples from different rounds of selection or mating, different selective breeding populations (e.g., different hatcheries or seed orchards) or selective breeding programmes involving related species with common SNPs (Hamilton et al. 2019a ; Hamilton et al. 2019b ), where parents in each are mutually exclusive. Furthermore, it is conceivable that management of selective breeding populations could be altered to allow the adoption of pooling-for-individual-parentage-assignment. For example, in aquaculture selective breeding programmes where there are, in some circumstances, limited facilities to maintain families in separate tanks or hapas prior to tagging, families could be replicated across multiple tanks or hapas each containing multiple families in an orthogonal design so as to allow subsequent pooling-for-individual-parentage-assignment and the partitioning of common rearing environment and genetic effects in genetic analyses. In molecular ecology, pooling-for-individual-parentage assignment is applicable in circumstances where sets of mutually exclusive groups of parents can be identified (e.g., samples from different geographical areas, between which gene flow in a single generation is not possible).

In the second circumstance, pooling-by-phenotype, individuals are allocated to bins (i.e., classes) according to their phenotype, from which pools are then drawn. The primary limitation of pooling-by-phenotype is that genotypes are not assigned to tagged individuals making the re-identification of candidate parents difficult. However, pooling-by-phenotype does not preclude the adoption of a modified ‘walk-back selection’ approach (Sonesson 2005 ), in which individual with desirable phenotypes are tagged and individually assigned parentage. Furthermore, it can be adopted as a means of estimating additive co/variances in a cost-effective manner, to examine genotype-by-environment interaction and/or increase the accuracy of estimated breeding values for related individuals (Burdon 1977 ; Henderson and Quaas 1976 ). Estimating genetic additive co/variances using a pooling-by-phenotype approach, is most simply achieved by generating dummy identifiers for individuals in pools, allowing an additive relationship matrix to be computed using established methods (Henderson 1975 ; Henderson and Quaas 1976 ). A further drawback of pooling-by-phenotype in circumstances where multiple traits are measured is that multiple-trait bins must be generated, each containing only those individuals with phenotypes within a specified phenotypic range for each trait (Bell et al. 2017 ). This potentially requires the range of phenotypic values in any one pool (i.e., bin width) for any one trait to be large, resulting in less precise phenotypes.

Ability to assign parentage using ML, whether applied to pools or individuals (Jones et al. 2010 ), is a function of the number of possible parents (Supplementary Materials 2 and 3 ) and the extent to which their contributions to progeny are known; the extent to which parents are genotyped (Supplementary Materials 4 ), the degree of relatedness among individuals (Supplementary Materials 5 ); the number (Supplementary Materials 3 and 4 ), information content, linkage disequilibrium (LD) and neutrality of SNPs (Holman et al. 2017 ); and the extent of SNP expression and genotyping accuracy/error (Anderson and Garza 2006 ; Liu et al. 2016 ; Weinman et al. 2015 ) (Supplementary Materials 5 and 6 ). Furthermore, in the application of quantitative ML to pools, variation in DNA contributions among individuals in pools (e.g. due to unequal tissue contributions and/or differences in DNA amplification) reduces the accuracy of genotyping and parentage assignment (Barratt et al. 2002 ; Bell et al. 2017 ; Kinghorn et al. 2010 ). Variation in DNA contributions can be minimised by pooling samples after DNA extraction, rather than pooling tissue samples prior to DNA extraction. However, pooling DNA rather than tissue substantially increases DNA extraction costs and, if the benefits of parentage assignment to pools is to be fully realised, pools must be constructed in a logistically sensible and cost-effective fashion. Furthermore, before application to pools in a new population or circumstance, the quantitative ML should be validated by applying the method to pools of individuals with known pedigree using the SNPs, species, assay and platform to be applied.

Although the cost of pedigree assignment per individual can be reduced by increasing pool size it must be recognised that, in the application of quantitative ML, this increases both the percentage of individuals with Δ LOD values below the critical Δ LOD (and thus the number of rejected assignments) and the computational burden. Quantitative ML pools is computationally intensive for large pools, as the number of possible combinations of parents increases exponentially with pool size. It has been shown herein that the quantitative ML method can be practically implemented for pools of three—all simulations were conducted on a personal computer. However, access to high performance computing facilities is likely to be necessary if the method is to be applied to larger pool sizes.

Genotype data from pools present like polyploidy data. In both cases, calling of discrete genotypes is prone to error due to the large number of possible genotype classes at each SNP (Clark et al. 2019 ; Rahman et al. 2015 ), making accurate parentage assignment difficult using discrete ML and exclusion methods (Flanagan and Jones 2019 ; Spielmann et al. 2015 ; Wang and Scribner 2014 ). Accordingly, generalisation of the quantitative ML approach to polyploids, with differing modes of inheritance (Clark et al. 2019 ), would likely improve the accuracy of ML parentage assignment and merits further investigation—the use of quantitative genotypes has been adopted to increase the power of genomic prediction (de Bem Oliveira et al. 2019 ) and genome-wide association studies (GWAS; Grandke et al. 2016 ) in polyploid populations. Modification of quantitative ML and pooling to circumstances where parental genotypes are largely unknown—such as kinship analysis (Hamilton et al. 2019a ; Hamilton et al. 2019b )—also warrants further development.

In conclusion, quantitative ML can be used to assign parentage to individuals and pools, to a desired level of confidence, using low-density SNP data. Moreover, parentage is assigned with greater accuracy using quantitative ML than by discrete ML (Kalinowski et al. 2010 ; Marshall et al. 1998 ); exclusion (Chakraborty et al. 1974 ), or solving the weighted least squares problem (Henshall et al. 2014 ; Kinghorn et al. 2010 ). The method is applicable to pools constructed using pooling-for-individual-parentage-assignment or pooling-by-phenotype approaches and has the potential to substantially reduce the cost of parentage assignment, even if applied to pools comprised of few individuals. However, before application in applied breeding programmes quantitative ML should be validated using pools of known pedigree and tissue/DNA contributions, using the SNPs, species, assay and platform to be applied; and the inherent limitations of using low-density SNP panels to assign parentage to large pools must be recognised. Generalisation of the quantitative ML approach to polyploids and kinship analysis applications warrants further investigation.

Data availability

An R package (R Core Team 2020 ) entitled ‘SNPpools’, available at https://github.com/mghamilton/SNPpools , was developed to implement and validate with simulations the quantitative ML method.

Anderson EC, Garza JC (2006) The power of single-nucleotide polymorphisms for large-scale parentage inference. Genetics 172(4):2567–2582

Article   CAS   Google Scholar  

Barratt BJ, Payne F, Rance HE, Nutland S, Todd JA, Clayton DG (2002) Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann Hum Genet 66(5-6):393–405

Bell AM, Henshall JM, Porto-Neto LR, Dominik S, McCulloch R, Kijas J et al. (2017) Estimating the genetic merit of sires by using pooled DNA from progeny of undetermined pedigree. Genet Sel Evol 49:ARTN 28

Burdon R, Shelbourne C (1971) Breeding populations for recurrent selection: Conflicts and possible solutions. N Z J Sci 1:174–193

Google Scholar  

Burdon RD (1977) Genetic correlation as a concept for studying genotype- environment interaction in forest tree breeding. Silvae Genet 26(5/6):168–175

Chakraborty R, Shaw M, Schull WJ (1974) Exclusion of paternity: the current state of the art. Am J Hum Genet 26(4):477

CAS   PubMed   PubMed Central   Google Scholar  

Clark LV, Lipka AE, Sacks EJ (2019) polyRAD: genotype calling with uncertainty from sequencing data in polyploids and diploids. G3 9(3):663–673

Dai P, Kong J, Liu J, Lu X, Sui J, Meng X et al. (2020) Evaluation of the utility of genomic information to improve genetic evaluation of feed efficiency traits of the Pacific white shrimp Litopenaeus vannamei . Aquaculture 527:735421

de Bem Oliveira I, Resende Jr MFR, Ferrao LFV, Amadeu RR, Endelman JB, Kirst M et al. (2019) Genomic prediction of autotetraploids; influence of relationship matrices, allele dosage, and continuous genotyping calls in phenotype prediction. G3 9(4):1189–1198

Article   Google Scholar  

Flanagan SP, Jones AG (2019) The future of parentage analysis: From microsatellites to SNPs and beyond. Mol Ecol 28(3):544–567

Grandke F, Singh P, Heuven HC, de Haan JR, Metzler D (2016) Advantages of continuous genotype values over genotype classes for GWAS in higher polyploids: a comparative study in hexaploid chrysanthemum. BMC Genom 17:672

Grattapaglia D, Diener PSD, dos Santos GA (2014) Performance of microsatellites for parentage assignment following mass controlled pollination in a clonal seed orchard of loblolly pine (Pinus taeda L.). Tree Genet Genomes 10(6):1631–1643

Hamilton MG, Kube PD, Elliott NG, McPherson LJ, Krsinich A (2009) Development of a breeding strategy for hybrid abalone. Proc Assoc Adv Anim Breed Genet 18:350–353

Hamilton MG, Mekkawy W, Benzie JAH (2019a) Sibship assignment to the founders of a Bangladeshi Catla catla breeding population. Genet Sel Evol 51(1):17

Hamilton MG, Mekkawy W, Kilian A, Benzie JAH (2019b) Single nucleotide polymorphisms (SNPs) reveal sibship among founders of a Bangladeshi rohu ( Labeo rohita ) breeding population. Front Genet. 10:597

Hansen OK, Kjaer ED (2006) Paternity analysis with microsatellites in a Danish Abies nordmanniana clonal seed orchard reveals dysfunctions. Can J Res-Rev Can Rech 36(4):1054–1058

Harrison HB, Saenz-Agudelo P, Planes S, Jones GP, Berumen ML (2013) On minimizing assignment errors and the trade-off between false positives and negatives in parentage analysis. Mol Ecol 22(23):5738–5742

Hauser L, Baird M, Hilborn R, Seeb LW, Seeb JE(2011) An empirical comparison of SNPs and microsatellites for parentage and kinship assignment in a wild sockeye salmon ( Oncorhynchus nerka ) population Mol Ecol Resour 11(Suppl 1):150–161

Henderson CR (1975) Best linear unbiased estimation and prediction under a selection model. Biometrics 31(2):423–447

Henderson CR (1988) Use of an average numerator relationship matrix for multiple-sire joining. J Anim Sci 66(7):1614–1621

Henderson CR, Quaas RL (1976) Multiple trait evaluation using relatives’ records. J Anim Sci 43(6):1188–1197

Henshall JM, Dierens L, Sellars MJ (2014) Quantitative analysis of low-density SNP data for parentage assignment and estimation of family contributions to pooled samples. Genet Sel Evol 46:ARTN 51

Henshall JM, Hawken RJ, Dominik S, Barendse W (2012) Estimating the effect of SNP genotype on quantitative traits from pooled DNA samples. Genet Sel Evol 44:ARTN 12

Holman LE, Onoufriou A, Hillestad B, Johnston IA (2017) A workflow used to design low density SNP panels for parentage assignment and traceability in aquaculture species and its validation in Atlantic salmon. Aquaculture 476:59–64

Jones AG, Small CM, Paczolt KA, Ratterman NL (2010) A practical guide to methods of parentage analysis. Mol Ecol Resour 10(1):6–30

Kalinowski ST, Taper ML, Marshall TC (2010) Corrigendum: revising how the computer program CERVUS accommodates genotyping error increases success in paternity assignment (vol 16, pg 1099 2007). Mol Ecol 19(7):1512–1512

Kinghorn BP, Bastiaansen JWM, Ciobanu DC, van der Steen HAM (2010) Quantitative genotyping to estimate genetic contributions to pooled samples and genetic merit of the contributing entities. Acta Agr Scand a 60(1):3–12

CAS   Google Scholar  

Liu S, Palti Y, Gao G, Rexroad CE (2016) Development and validation of a SNP panel for parentage assignment in rainbow trout. Aquaculture 452:178–182

Marshall TC, Slate J, Kruuk LEB, Pemberton JM (1998) Statistical confidence for likelihood-based paternity inference in natural populations. Mol Ecol 7(5):639–655

Meagher TR, Thompson E (1986) The relationship between single parent and parent pair genetic likelihoods in genealogy reconstruction. Theor Popul Biol 29(1):87–106

R Core Team (2020) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

Rahman A, Hellicar A, Smith D, Henshall JM (2015) Allele frequency calibration for SNP based genotyping of DNA pools: A regression based local-global error fusion method. Comput Biol Med 61:48–55

Semagn K, Babu R, Hearne S, Olsen M (2014) Single nucleotide polymorphism genotyping using Kompetitive Allele Specific PCR (KASP): overview of the technology and its application in crop improvement. Mol Breed 33(1):1–14

Sonesson AK (2005) A combination of walk-back and optimum contribution selection in fish: a simulation study. Genet Sel Evol 37(6):587–599

Spielmann A, Harris SA, Boshier DH, Vinson CC (2015) orchard: paternity program for autotetraploid species. Mol Ecol Resour 15(4):915–920

Vandeputte M, Haffray P (2014) Parentage assignment with genomic markers: a major advance for understanding and exploiting genetic variation ofquantitative traits in farmed aquatic animals. Front Genet 5:ARTN 432

Wang J, Scribner KT (2014) Parentage and sibship inference from markers in polyploids. Mol Ecol Resour 14(3):541–553

Weinman LR, Solomon JW, Rubenstein DR (2015) A comparison of single nucleotide polymorphism and microsatellite markers for analysis of parentage and kinship in a cooperatively breeding bird. Mol Ecol Resour 15(3):502–511

Download references

Acknowledgements

This work was supported by the CSIRO Agriculture and Food project ‘Genomics platforms to assist applied aquaculture breeding’ (AgSIP53). John Henshall shared his R scripts relating to quantitative analysis of low-density SNP data for parentage assignment and estimation of family contributions to pooled samples—code from these scripts was not used in the SNPpools package or for simulations but was used to further the author’s understanding of the methods presented in Henshall et al. ( 2014 ). Harry King, Peter Kube, James Kijas, Klara Verbyla, Sonja Dominik shared their insights into the potential application of SNP pooling in selective breeding programmes. James Kijas assisted with comments on draft versions of the manuscript. The CGIAR Research Program on Fish Agrifood Systems (FISH), led by WorldFish and supported by contributors to the CGIAR Trust Fund, financially supported completion of the manuscript subsequent to the author’s departure from CSIRO.

Author information

Authors and affiliations.

CSIRO Aquaculture, CSIRO Agriculture and Food, Castray Esplanade, Hobart, TAS, Australia

Matthew Gray Hamilton

WorldFish, Bayan Lepas, Penang, Malaysia

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Matthew Gray Hamilton .

Ethics declarations

Conflict of interest.

The author declares no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Associate editor: Jinliang Wang

Supplementary information

Supplementary materials 1. worked examples of parentage assignment methods, supplementary materials 2, supplementary materials 3, supplementary materials 4, supplementary materials 5, supplementary materials 6, supplementary materials 7, supplementary materials 8, supplementary materials 9, supplementary materials 10, rights and permissions.

Reprints and permissions

About this article

Cite this article.

Hamilton, M.G. Maximum likelihood parentage assignment using quantitative genotypes. Heredity 126 , 884–895 (2021). https://doi.org/10.1038/s41437-021-00421-0

Download citation

Received : 11 June 2020

Revised : 22 February 2021

Accepted : 23 February 2021

Published : 10 March 2021

Issue Date : June 2021

DOI : https://doi.org/10.1038/s41437-021-00421-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

genetic assignment method

Using genomic relationship likelihood for parentage assignment

Affiliations.

  • 1 AquaGen AS, P.O. Box 1240, NO-7462, Trondheim, Norway. [email protected].
  • 2 Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, P.O. Box 5003, NO-1432, Ås, Norway. [email protected].
  • 3 AquaGen AS, P.O. Box 1240, NO-7462, Trondheim, Norway.
  • 4 Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, P.O. Box 5003, NO-1432, Ås, Norway.
  • PMID: 29776335
  • PMCID: PMC5960170
  • DOI: 10.1186/s12711-018-0397-7

Background: Parentage assignment is usually based on a limited number of unlinked, independent genomic markers (microsatellites, low-density single nucleotide polymorphisms (SNPs), etc.). Classical methods for parentage assignment are exclusion-based (i.e. based on loci that violate Mendelian inheritance) or likelihood-based, assuming independent inheritance of loci. For true parent-offspring relations, genotyping errors cause apparent violations of Mendelian inheritance. Thus, the maximum proportion of such violations must be determined, which is complicated by variable call- and genotype error rates among loci and individuals. Recently, genotyping using high-density SNP chips has become available at lower cost and is increasingly used in genetics research and breeding programs. However, dense SNPs are not independently inherited, violating the assumptions of the likelihood-based methods. Hence, parentage assignment usually assumes a maximum proportion of exclusions, or applies likelihood-based methods on a smaller subset of independent markers. Our aim was to develop a fast and accurate trio parentage assignment method for dense SNP data without prior genotyping error- or call rate knowledge among loci and individuals. This genomic relationship likelihood (GRL) method infers parentage by using genomic relationships, which are typically used in genomic prediction models.

Results: Using 50 simulated datasets with 53,427 to 55,517 SNPs, genotyping error rates of 1-3% and call rates of ~ 80 to 98%, GRL was found to be fast and highly (~ 99%) accurate for parentage assignment. An iterative approach was developed for training using the evaluation data, giving similar accuracy. For comparison, we used the Colony2 software that assigns parentage and sibship simultaneously to increase the power of the likelihood-based method and found that it has considerably lower accuracy than GRL. We also compared GRL with an exclusion-based method in which one of the parameters was estimated using GRL assignments.This method was slightly more accurate than GRL.

Conclusions: We show that GRL is a fast and accurate method of parentage assignment that can use dense, non-independent SNPs, with variable call rates and unknown genotyping error rates. By offering an alternative way of assigning parents, GRL is also suitable for estimating the expected proportion of inconsistent parent-offspring genotypes for exclusion-based models.

Publication types

  • Research Support, Non-U.S. Gov't
  • Computational Biology / methods*
  • Computer Simulation
  • Databases, Genetic
  • Genotyping Techniques / veterinary*
  • Likelihood Functions
  • Polymorphism, Single Nucleotide*

Grants and funding

  • 251664/Norges Forskningsråd/International
  • 245519/Norges Forskningsråd/International
  • 6141/AquaGen AS/International

Book cover

International Conference on Advanced Unmanned Aerial Systems

ICAUAS 2023: Advances and Challenges in Advanced Unmanned Aerial Systems pp 123–131 Cite as

A Multi-stage Target Assignment Method Based on Improved Genetic Algorithm

  • Tianyan Zhou 8 ,
  • Ruoming An 8 ,
  • Changsheng Gao 8 &
  • Yuqing Li 8  
  • First Online: 01 February 2024

36 Accesses

Part of the Springer Aerospace Technology book series (SAT)

Aiming at the different importance of target assignment tasks in each stage of multi-stage air attack operations, an improved genetic algorithm is designed. The decimal one-dimensional multi-segment chromosome is used to represent the multi-stage target assignment scheme. By setting the differential crossover and mutation probability for each segment of the chromosome, the opportunity to find the optimal assignment scheme in the operation stage with higher assignment quality requirements is improved. At the same time, the oscillation of the assignment scheme in the secondary operation stage with lower assignment quality requirements is reduced, and the performance of the target assignment algorithm is effectively improved. Finally, the simulation example shows that the improved genetic algorithm can better solve the problem of target assignment in multi-stage air strike operations.

  • Multi-stage target assignment
  • Genetic algorithm
  • Dynamic target assignment
  • Crossover and mutation

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Chang X, et al (2023) Adaptive large neighborhood search algorithm for multi-stage weapon target assignment problem. Comput Ind Eng 181

Google Scholar  

Zhang K, et al (2020) A dynamic weapon target assignment based on receding horizon strategy by heuristic algorithm. J Phys Conf Ser 1651.1

Gao C, et al (2019) A heuristic algorithm for weapon target assignment and scheduling. Mily Oper Res 24.4

Ma Y, et al (2021) Two-stage hybrid heuristic search algorithm for novel weapon target assignment problems. Comput Ind Eng 162

Kong L, et al (2021) Solving the dynamic weapon target assignment problem by an improved multiobjective particle swarm optimization algorithm. Appl Sci 11.19

Wang T, et al (2023) Unmanned ground weapon target assignment based on deep Q-learning network with an improved multi-objective artificial bee colony algorithm. Eng Appl Artif Intell 117.PB

Sonuc E, et al (2017) A parallel simulated annealing algorithm for weapon-target assignment problem. Int J Adv Comput Sci Appl 8(4)

Kline AG et al (2019) Real-time heuristic algorithms for the static weapon target assignment problem. J Heuristics 25:377–397

Article   Google Scholar  

Lai CM, et al (2019) Simplified swarm optimization with initialization scheme for dynamic weapon–target assignment problem. Appl Soft Comput 82:105542

Wu X, et al (2021) A modified MOEA/D algorithm for solving Bi-objective multi-stage weapon-target assignment problem. IEEE Access 9:71832–71848

Liu P, et al (2023) Multi-missile dynamic weapon target assignment algorithm based on particle swarm optimization. J Nanjing Univ Aeronaut Astronaut,\ 55(01):108–115. (in Chinese)

Wu W, et al (2021) lmproved differential evolution algorithm for solving weapon-target assignment problem. Syst Eng Electron 43(04):1012–1021. (in Chinese)

Li M, et al (2023) Developments of weapon target assignment: models, algorithms, and applications. Syst Eng Electron 45(04):1049–1071. (in Chinese)

Download references

Author information

Authors and affiliations.

School of Astronautics, Harbin Institute of Technology Harbin, Harbin, China

Tianyan Zhou, Ruoming An, Changsheng Gao & Yuqing Li

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Yuqing Li .

Editor information

Editors and affiliations.

International Center for Applied Mechanics, Xi’an Jiaotong University, Xi’an, Shaanxi, China

School of Aerospace Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, China

School of Astronautics, Harbin Institute of Technology, Harbin, Heilongjiang, China

Xiaodong He

Space Engineering Design Laboratory, York University, Toronto, ON, Canada

Zhenghong Zhu

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Cite this chapter.

Zhou, T., An, R., Gao, C., Li, Y. (2024). A Multi-stage Target Assignment Method Based on Improved Genetic Algorithm. In: Liu, Z., Li, R., He, X., Zhu, Z. (eds) Advances and Challenges in Advanced Unmanned Aerial Systems. ICAUAS 2023. Springer Aerospace Technology. Springer, Singapore. https://doi.org/10.1007/978-981-99-8045-1_10

Download citation

DOI : https://doi.org/10.1007/978-981-99-8045-1_10

Published : 01 February 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-99-8044-4

Online ISBN : 978-981-99-8045-1

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Genome-wide association studies in ancestrally diverse populations: opportunities, methods, pitfalls, and recommendations.

Roseann e. peterson.

1. Virginia Institute for Psychiatric and Behavioral Genetics, Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, 23298, USA.

Karoline Kuchenbaecker

2. Division of Psychiatry, University College of London, London W1T 7NF, UK; UCL Genetics Institute, University College London, London WC1E 6BT, UK.

Raymond K. Walters

3. Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston MA 02114 USA; Department of Medicine, Harvard Medical School, Boston MA 02115 USA.

Chia-Yen Chen

Alice b. popejoy.

4. Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA 94305, USA.

Sathish Periyasamy

5. Queensland Brain Institute, The University of Queensland, Brisbane, Queensland 4072, Australia; Queensland Centre for Mental Health Research, The University of Queensland, Brisbane, Queensland 4072, Australia.

Conrad Iyegbe

6. Department of Psychosis Studies, Institute of Psychiatry, London, SE5 8AF, United Kingdom.

Rona J. Strawbridge

7. Institute of Health and Wellbeing, University of Glasgow, Glasgow, G12 8RZ, UK; Department of Medicine Solna, Karolinska Institute, Stockholm, SE17176, Sweden.

Leslie Brick

8. Department of Psychiatry and Human Behavior, Warren Alpert Medical School of Brown University, Providence, RI 02906, USA.

Caitlin Carey

Alicia martin, jacquelyn l. meyers.

9. Department of Psychiatry, State University of New York Downstate Medical Center, Brooklyn, NY 11203, USA.

10. Department of Psychology, Arizona State University, Tempe, AZ, USA, 85281.

Junfang Chen

11. Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, 68159, Germany.

Alexis C. Edwards

Allan kalungi.

12. Mental Health Project of MRC/UVRI and LSHTM Uganda Research Unit, P.O Box 49, Entebbe Uganda.

Nastassja Koen

13. Department of Psychiatry and Mental Health, University of Cape Town, Cape Town, 7925, South Africa; South African Medical Research Council Unit on Risk and Resilience in Mental Disorders, Cape Town, South Africa; Global Initiative for Neuropsychiatric Genetics Education in Research, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.

Lerato Majara

14. MRC Human Genetics Research Unit, Division of Human Genetics, Department of Pathology, Institute of Infectious Diseases and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, 7925, South Africa; Global Initiative for Neuropsychiatric Genetics Education in Research, Harvard T.H. Chan School of Public Health, Department of Epidemiology, Boston, MA, 02115, USA.

Emanuel Schwarz

Jordan smoller.

15. Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA.

Eli A. Stahl

16. Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.

Patrick Sullivan

17. Karolinska Institutet, Medical Epidemiology and Biostatistics, Stockholm, SE; University of North Carolina at Chapel Hill, Genetics and Psychiatry, Chapel Hill, NC, USA.

Evangelos Vassos

18. Social, Genetic & Developmental Psychiatry Centre, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London, SE5 8AF, UK.

Bryan Mowry

Miguel prieto.

19. Department of Psychiatry, Faculty of Medicine, Universidad de los Andes, Santiago 7620001, Chile; Mental Health Service, Clínica Universidad de los Andes, Santiago 7620001, Chile.

Alfredo Cuellar-Barboza

20. Department of Psychiatry, University Hospital and School of Medicine, Universidad Autonoma de Nuevo Leon, Monterrey, Mexico; Department of Psychiatry and Psychology, Mayo Clinic, Rochester, MN, USA.

Tim B. Bigdeli

Howard j. edenberg.

21. Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN 46202, USA.

Hailiang Huang

Laramie e. duncan.

22. Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA 94305, USA.

Associated Data

Genome-wide association studies (GWAS) have focused primarily on populations of European descent, but it is essential that diverse populations become better represented. Increasing diversity among study participants will advance our understanding of genetic architecture in all populations and ensure that genetic research is broadly applicable. To facilitate and promote research in multi-ancestry and admixed cohorts, we outline key methodological considerations and highlight opportunities, challenges, solutions, and areas in need of development. Despite the perception that analyzing genetic data from diverse populations is difficult, it is scientifically and ethically imperative, and there is an expanding analytical toolbox to do it well.

A disproportionate majority (>78%) of participants in published genome-wide association studies ( GWAS ) are of European descent ( Popejoy and Fullerton, 2016 ; Sirugo et al., 2019 ), with 71.8% of these individuals having been recruited from just three countries: the United States, the United Kingdom, and Iceland ( Mills and Rahal, 2019 ). Studies of major psychiatric disorders are no exception, having focused largely on populations of European ancestry ( Figure 1 ). Conducting GWAS in individuals of European ancestry was a practical starting point given the availability of samples and limited funding, genotyping technologies, and analytic methods. However, there is now widespread acknowledgement of the need for more diverse samples and for improved analytic methods. Broadening diversity of studied populations will improve the effectiveness of genomic medicine by expanding the scope of known human genomic variation and bolstering our understanding of disease etiology. Consensus in the field points to many benefits of increased representation of more diverse populations for locus discovery, fine-mapping, polygenic risk scores, and addressing existing health disparities ( Duncan et al., 2018 ; Hindorff et al., 2018a ; Lam et al., 2018 ; Martin et al., 2019 ; Walters et al., 2018 ).

An external file that holds a picture, illustration, etc.
Object name is nihms-1546248-f0001.jpg

Participant numbers were extracted from the largest consortium publication(s) for each psychiatric disorder and are shown as fractions of the total sample size for each disorder. Note: Sample sizes are given in parentheses. Numbers reflect cases and controls combined. MD=major depression (490,999), SCZ=schizophrenia (205,661), PTSD=post-traumatic stress disorder (188,932), BIP=bipolar disorder (51,710), ADHD=attention deficit hyperactivity disorder (55,230), AUT=autism (46,350), AD=alcohol dependence (52,848), AN=anorexia (14,477). *For schizophrenia, the African American samples from an earlier publication (2009, International Schizophrenia Consortium) were not included in the most recent PGC schizophrenia publication (2014). Ancestry information for each participant was based on principal components analysis of genetic data. See Supplemental Table S1 for consortium studies and references.

With increasing representation of global populations in GWAS, there is an opportunity for advanced methods development and a need for consensus “best practices” for analyzing the emerging complex datasets. Here, we provide background on the scientific and ethical importance of including underrepresented groups in genetics research and offer guidance for whole-genome analysis of ancestrally diverse study cohorts. We summarize currently available resources and make recommendations for avoiding practices that could lead to false-positives, loss of statistical power, or misinterpretation of results. Because this primer represents a collaborative product of the Cross-Population Special Interest Group of the Psychiatric Genomics Consortium (PGC) ( https://www.med.unc.edu/pgc/cross-population/ ), we have framed our discussion within the context of psychiatric genetics. Nevertheless, the points and recommendations outlined herein are applicable to any complex biomedical phenotype.

Genetic ancestry is estimated from DNA and provides information about shared demographic history at the population level. Individuals with similar ancestral origins have shared genomic signatures due to migration of common ancestors, mutations and recombination, genetic drift, and natural selection. These processes yield differences in allele frequencies and linkage disequilibrium ( LD ) patterns across populations ( Barrett and Cardon, 2006 ; International HapMap Consortium, 2005 ) that must be properly addressed to avoid false positive genetic findings. In addition to ancestral diversity, the current lack of racial and ethnic diversity, which are related but distinct from ancestry (see Box 1 ), hinder the development of more complete etiological models ( Banda et al., 2015 ; Medina-Gomez et al., 2015 ; Race, Ethnicity, and Genetics Working Group, 2005 ). In complex disease research, race and ethnicity can provide information about social, cultural, and environmental factors that affect risk for disease, including having a lived experience of social injustice. Given that these socio-cultural measures are often inappropriately used as a proxy for genetic ancestry, researchers and clinicians should be careful to distinguish among them in order to tease apart specific biological, environmental, and social determinants of health.

Race, Ethnicity, and Ancestry: Interpretation and Relevance for Genetic Diversity.

Inclusion of diverse study participants in genomics research has yielded important scientific insights for a range of human traits and diseases. The resolution of fine-mapping improves through cross-ancestry analysis ( Wojcik et al., 2019 ). Estimates of effect-sizes derived from cohorts of diverse ancestries tend to be more accurate than from those of a single ancestry ( Li and Keating, 2014 ). Genetic risk prediction attenuates with increasing divergence between the discovery and target populations, indicating that polygenic risk scores ( PRS ) based on Eurocentric GWAS are not equally predictive when applied to non-European populations ( Duncan et al., 2018 ; Martin et al., 2019 ). Conversely, constructing individual-level scores from cross-ancestry meta-analysis results improves overall prediction ( Grinde et al., 2019 ; Márquez-Luna et al., 2017 ).

Besides the strong scientific justifications for broader inclusion, there are important ethical, legal, and public health reasons for bolstering diversity in genomics ( Hindorff et al., 2018b ). Understanding how genetic risk and social inequities interact to influence disparities in disease risk and outcomes will be critical to improving public health.

Moreover, while integration of genomics into healthcare has the potential to improve disease prediction and optimize treatments, a lack of diversity will limit the utility of precision medicine efforts: individuals of non-European descent are more likely to receive ambiguous test results from genetic screening (e.g., variants of unknown or uncertain significance) ( Petrovski and Goldstein, 2016 ) and false positive diagnoses ( Manrai et al., 2016 ). There is also a higher chance of false negative diagnoses in individuals from ancestral backgrounds that are not well represented in clinical databases, due to missing information about additional disease-causing variants currently not on testing panels ( Minster et al., 2016 ; Moltke et al., 2014 ; Wheeler et al., 2017 ). Similarly, the potential benefits of pharmacogenetics cannot be fully realized until there is equitable representation across ancestries, as some therapeutics may be more effective and/or safer in certain populations because of differences in allele frequency, effect size, and penetrance of variants associated with drug metabolism ( Roden et al., 2011 ). Here, we provide an accessible framework for analyzing these data, while acknowledging that there are several important methodological areas in need of further development. Key terminology is bolded and defined in Box 2 .

General Terminology.

Methodological considerations.

In the analysis of multi-ancestry datasets, a significant concern is false positive genetic signals due to inflated test statistics from population stratification , which occurs when disease prevalence and allelic frequency differences are correlated within or between study cohorts ( Marchini et al., 2004 ). Two typical strategies exist for addressing this challenge while analyzing samples from multiple major/admixed populations: (1) Empirically assign samples to major continental and/or admixed populations using genome-wide data, analyze each population separately, and conduct cross-ancestry meta-analysis ( stratified meta-analysis approach ), and (2) analyze samples from multiple populations together, most commonly with a mixed model ( joint mixed model approach ). The choice between these approaches is perhaps the most broadly impactful decision currently facing analysts of genome-wide data from multiple populations since it impacts methodological considerations in all analysis steps from quality control, to reference alignment in imputation, to association model, to the suitability of results for secondary analyses. We highlight elements of GWAS where the choice between the stratified meta-analysis and joint mixed model approaches is particularly salient. Figure 2 shows a general workflow for each approach.

An external file that holds a picture, illustration, etc.
Object name is nihms-1546248-f0002.jpg

This flowchart depicts the general analysis framework for genome-wide association studies of participants with diverse ancestral backgrounds. Note: boxes with red headers indicate analyses done in samples with diverse ancestral backgrounds and blue denotes analysis done within samples in major population groups. The left path shows a strategy for the stratified meta-analysis approach and the right path shows steps for the joint mixed model approach [see Supplemental Table 2 for more detailed quality control (QC) considerations].

Genotyping Technologies

Most genome-wide DNA microarrays were designed for individuals of European ancestry. The differences in LD structure and allele frequency among populations can lead to significantly worse coverage for other ancestry groups. For example, at imputation accuracy r 2 >0.8, the Affymetrix UK Biobank array covers 84% of the variants that have minor allele frequencies (MAF) > 1% in samples of European ancestry but only 46% of those for samples of African ancestry ( Nelson et al., 2017 ). The large genetic diversity in African populations means that a larger number of variants are needed on arrays in order to provide similar coverage as in other populations ( Barrett and Cardon, 2006 ). To address this issue, some groups, such as China Kadoorie Biobank ( Chen et al., 2011 ), have designed population-specific arrays. Multi-ancestry arrays, such as the Multi-Ethnic Global Array (MEGA), Global Screening Array (GSA), and the H3Africa array ( Mulder et al., 2018 ) were designed based on panels with more diverse ancestries, and are therefore recommended. An alternative strategy is to sequence whole genomes; low-depth sequencing has received recent attention for application in diverse samples due to cost-effectiveness and higher coverage with acceptable error rates (( Gilly et al., 2018 ; Peterson et al., 2017a ); see Rare Variants ).

Quality Control

Quality control (QC) of GWAS data aims to remove low quality data and technical artifacts in order to reduce the risk of false positive associations. In diverse ancestry cohorts, the main issue is that many common QC criteria assume the sample comes from a homogeneous population. Applying standard QC procedures without adjustment for population structure leads to the erroneous removal of too many variants and samples from minority subgroups and admixed samples, reducing statistical power.

QC criteria that are dependent on population allele frequencies can generally be adapted for application in diverse cohorts by either stratifying the cohort into major populations prior to filtering (the stratified meta-analysis approach) or by adjusting the QC measure to allow for varying allele frequencies (the joint mixed model approach; see Figure 2 ). For example, individuals are often removed based on excess autosomal heterozygosity, as a potential indication of sample contamination, but the standard heterozygosity statistic assumes each variant’s expected allele frequency is constant across individuals. In diverse cohorts, regressing this heterozygosity statistic on principal components prior to identifying outliers can avoid excessive exclusions of individuals from subgroups in the cohort. Step-by-step considerations for common QC criteria, including sample QC workflows for the stratified meta-analysis and joint mixed model approaches, are given in Supplemental Methods I (see also Supplemental Table S2 , Supplemental Figure 1 ). In addition to these pre-imputation QC steps, post-imputation QC steps should also consider ancestry (see Imputation ).

Inferring Population Structure

Estimating the genetic population structure of a cohort typically serves two primary goals in GWAS: 1) to characterize the ancestral diversity of the cohort as a descriptive measure and 2) to provide a quantitative estimate of population structure that can be used in QC and in GWAS association models to reduce the risk of false positives. We focus here on use for description and QC, and later discuss methods for controlling for population structure (see Genome-wide Association ).

For cohorts with diverse ancestral backgrounds, we can estimate population structure based on genome-wide data. Currently the most common tool for estimating continuous population structure is principal component analysis ( PCA ); a listing of other approaches is included in Supplemental Methods II . PCA is a statistical method for reducing the complexity of high-dimensional data (e.g., thousands of measured variants across the genome) into orthogonal axes (principal components, PCs) that explain the largest fraction of variability in the data. The spread of data across these axes provides a visual guide to sub-structure among samples; when data points are estimated from each individual’s genetic markers, the PCs illustrate population structure. These PCs can be computed within the cohort, or can be estimated from an external reference (e.g., The 1000 Genomes Project (1KGP); ( Sudmant et al., 2015 )) and the GWAS sample can be projected onto the PC axes to allow comparison with the ancestries of known reference populations ( Peterson et al., 2017b ). However, the latter approach can be limited by the number and diversity of populations represented on the reference panel , highlighting the need for many additional diverse population references to be generated. PCs may also be used to control for ancestry structure in other QC metrics (see Quality Control and Supplemental Table S2 ).

This sample-wide estimation and visualization of genetic ancestry can be used to empirically assign genetically similar samples into more homogenous groups. This assignment is necessary for the stratified meta-analysis approach to GWAS of diverse cohorts, and is intended to reduce the risk of false positive genetic signals due to inflated test statistics from population stratification . Assigning samples to more homogeneous groups for analysis reduces stratification by limiting the degree of population structure remaining in the sample. Samples with a specific admixture can be assigned into their own major ancestral group, instead of being excluded from the analysis or forced into other ancestry groups, provided there are adequate numbers of individuals in the sample with comparable admixed backgrounds. However, it is often the case that genomic outliers (which tend to be from under-represented or admixed backgrounds) might need to be excluded if there is an insufficient number of other individuals who fall into a similar cluster. These assignment methods will not provide - and are not intended to provide - detailed ancestral background information for each individual. Rather, they provide a working solution to reduce false positives due to population stratification ( Hellwege et al., 2017 ). We stress that sample group assignment and identifying appropriate reference population panels can be difficult, particularly for admixed ancestry, thus requiring careful inspection of data and methods ( Medina-Gomez et al., 2015 ).

Imputation and Population Reference Panels

GWAS arrays genotype a portion of common variation. Genotype imputation is a cost-effective computational approach for inferring genotypes or genotype probabilities at variants that have not been directly genotyped on GWAS arrays, based on comparisons to genetic data from external reference samples. Imputation increases the number of markers available for association testing and can harmonize cohorts genotyped on different arrays for meta-analysis.

Imputation accuracy relies on having an appropriate reference panel that includes haplotypes from the population studied. Matching alleles and allele frequencies in the study cohort with reference panels as part of pre-imputation QC also relies on using reference data from a matched ancestral background. Reference panels with better coverage of haplotypes from the population of the genotyped cohort will yield a greater number of well-imputed variants for GWAS, especially among lower frequency variants ( Ahmad et al., 2017 ; Howie et al., 2012 ). Table 1 lists major imputation panels that are currently publicly available. We note that although many ongoing projects are aiming at more diverse populations ( Supplemental Table S3 ), additional efforts in more populations are needed to expand the diversity of imputation reference panels ( Kelleher et al., 2018 ).

Current imputation methods are summarized in Supplemental Methods III . Joint imputation using the largest applicable reference panel is expected to perform at least as well as subsetting that reference panel to match the target population ( Ahmad et al., 2017 ; Howie et al., 2012 ), possibly due to maintaining a larger sample size for phasing. Use of the same reference panel for all cohorts also avoids potential confounding with varying imputation quality. However, it may be necessary to consider imputation quality separately within subsets of individuals even if the samples are jointly imputed since imputation accuracy for a variant may vary widely across individuals of different ancestries.

Genome-wide Association

The core of GWAS analysis is testing the association between each variant and a target phenotype. As noted, a primary consideration for association testing in diverse cohorts is whether to stratify samples into major population groups or to analyze the full cohort jointly (assuming imputation was also done jointly). In either case, the major concern is proper control of population stratification to ensure that observed associations reflect genetic effects of each locus rather than correlations with ancestry.

Joint analysis using a mixed model approach is attractive because all participants are included irrespective of ancestry. Ideally, mixed model approaches control for population stratification by modelling distant relatedness between individuals due to ancestry ( Sul et al., 2018 ; Wojcik et al., 2019 ). Several implementations exist and some are listed in Supplementary Methods Section IV and Supplementary Table S4 . Mixed models may yield greater statistical power, both through increased sample size and by controlling for the variance explained by the genetic relatedness between individuals (i.e., a random effects component; ( Loh et al., 2018 )). However, there is evidence that basic mixed models may not fully control for population structure in diverse cohorts, especially if there is an environmental component to phenotypic associations with ancestry beyond the modelled genetic relatedness ( Conomos et al., 2018 ; Heckerman et al., 2016 ; Zhang and Pan, 2015 ). Non-genetic factors such as environmental exposures may be correlated with ancestry due to a shared local environment (familial or community effects) or due to the relationship between ancestry and socio-cultural factors such as race and ethnicity. More methodological development is needed before mixed models or other strategies for joint GWAS of a diverse cohort can be confidently recommended as robust.

When stratifying by population backgrounds, covariates such as PCs should still be used to correct for population stratification. Conventional linear or logistic regression with these covariates can be used for association testing as long as QC included exclusion of related individuals; mixed models or other alternatives with PC covariates may be applied in family-based samples stratified by ancestry ( Walters et al., 2018 ). Computing these PCs separately within each ancestry subset instead of the full study ensures better control for residual structure specific to that subset (e.g., fine structure, genotyping/technical artifacts), but at the cost of potentially reduced control for stratification related to population structure shared across subsets (Patterson et al. 2006). For analyses of admixed or multi-ancestry cohorts, PCs may still be included in the regression but additional covariates may be required to control for stratification that is not linear in PCA space ( Conomos et al., 2018 ; Heckerman et al., 2016 ; Zhang and Pan, 2015 ). For example, race and ethnicity are often correlated with socio-economic status and other environmental risk factors for disease. Self-reported ethnicity or other variables that capture trait heterogeneity on the basis of socio-cultural factors may also be appropriate to consider as covariates in those instances ( Banda et al., 2015 ; Medina-Gomez et al., 2015 ). Directly controlling for local ancestry tracts in variant-level association analyses may further improve power and reduce false positives in admixed samples ( Li & Keating 2014 ).

The meta-analysis approach, combining separate analyses of samples stratified by similar genetic background, currently has several pragmatic advantages. First, computational pipelines developed for single-ancestry analyses can be used for each cohort. Separate analysis also naturally provides ancestry-specific results, which may be valuable for secondary analyses including PRS ( Bulik-Sullivan et al., 2015 ; Lam et al., 2018 ). Reduced environmental variability within a subset may also improve power. On the other hand, splitting each cohort may be challenging due to continuous gradients of admixture or small sample sizes within an ancestry group. This loss of information from excluding individuals from diverse genomic backgrounds is a missed opportunity for discovery and validation of GWAS findings, and thus additional approaches need to be developed and leveraged.

Meta-analysis of GWAS Summary Statistics

Traditional meta-analytic approaches for GWAS rely on fixed-effects models that assume a given variant has the same true marginal effect size across all studies. This assumption is likely to be violated in meta-analyses across diverse cohorts. Even when the causal genetic effect of a variant is constant across populations, as seems common in cross-ancestry GWAS to date ( Huang et al., 2017 ; Lam et al., 2018 ), marginal effect sizes may show heterogeneity when LD structures are different. Further heterogeneity across cohorts from different populations may arise due to differences in genetic background (e.g., gene × gene interactions) and/or environmental context (e.g., gene × environment interactions), as well as differences in study design (e.g., imputation artifacts, phenotyping). As a result, it is generally appropriate to model this cross-cohort heterogeneity in meta-analysis by using a random effects or trans-ancestral meta-analysis model ( Supplementary Methods Section 5 , Supplementary Table S4 ).

Fine-mapping

A trait-associated locus from GWAS typically implicates a large genomic region with many variants of similar significance. This set may contain a few causal variants, while the association of other variants is driven by their LD with the causal one(s). Fine-mapping refines GWAS loci to a smaller set of likely causal variants to facilitate interpretation and follow-up studies ( Schaid et al., 2018 ). Fine-mapping studies in samples of European ancestry have made important advances, with some loci resolved even to single-variant resolution ( Huang et al., 2017 ; Mahajan et al., 2018 ). Because fine-mapping assumes the causal variant(s) have been observed, non-European populations face a unique challenge due to the lack of representation of many variants as a result of incomplete sampling from these populations, suboptimal chip design, and limited imputation performance.

Combining samples across ancestries has an advantage for fine-mapping: the LD patterns that differ across populations can improve the resolution, assuming that many causal variants are shared across populations, which has been shown true for some traits, including schizophrenia ( Lam et al., 2018 ; Marigorta and Navarro, 2013 ; Wojcik et al., 2019 ). Non-causal variants tagging the causal variants have marginal different effects across populations if LD is different, thus allowing the causal variant to be distinguished from non-causal variants. Furthermore, in certain populations (e.g., African), LD blocks are generally smaller, so fewer non-causal variants will tag the causal variants, improving the resolution of fine-mapping ( International HapMap Consortium, 2005 ; Schaid et al., 2018 ).

Most fine-mapping algorithms ( Huang et al., 2017 ; Schaid et al., 2018 ) can be applied to samples from multiple ancestries combined through meta-analysis. However, this strategy does not take full advantage of genomic diversity across populations. An alternate Bayesian fine-mapping strategy ( Lam et al., 2018 ) more precisely mapped the schizophrenia genetic associations through explicitly modeling diversity in LD between East Asian and European samples. This approach works on a presumption that the causal variants and their effect sizes are identical across populations, which is not always true. PAINTOR ( Kichaev and Pasaniuc, 2015 ) relaxes this presumption by allowing the effect size to vary across populations, although the causal variant still needs to be the same. Fine-mapping methods will benefit from continued development that appropriately models LD and relies on fewer assumptions.

Polygenic Risk Scores in Diverse Populations

PRS are individual-level estimates of the relative genetic contribution to a phenotype, computed for each genotyped individual in a target sample based on GWAS results from a discovery sample. PRS are useful for validating GWAS results in external cohorts and have the potential to provide individualized risk prediction from genetic data ( Khera et al., 2018 ; Martin et al., 2019 ). The predictive value of PRS profiling depends both on the statistical power of the discovery (training) dataset— specifically, enrichment in the genome-wide distribution of association test statistics that is attributable to aggregate, additive genetic effects — and the relevant characteristics of the target (testing) dataset.

In particular, PRS accuracy is also a function of recent human demographic history, such that a greater proportion of phenotypic variance is explainable in target populations that are genetically more similar to the population studied in the discovery GWAS. Stated another way, with increasing genetic “distance” between the discovery and target datasets, there is often attenuation of polygenic predictive value. Furthermore, because most participants in large GWAS have been broadly European ( Figure 1 ), most PRS currently perform best in target samples of European ancestries, with markedly worse performance in other populations, especially in individuals of African descent ( Duncan et al., 2018 ; Martin et al., 2019 ).

A practical question is how to construct polygenic scores for recently admixed individuals or individuals who are genetically distant from those in the largest existing GWAS. Use of trans-ancestry meta-analytic results to weight alleles can increase prediction accuracy ( Grinde et al., 2019 ), and MultiPred is an approach that combines PRS based on European training data with PRS based on training data from the target population ( Márquez-Luna et al., 2017 ). Current methods development is focused on improving handling of allele frequency differences and LD within and across populations. Given current limitations in understanding similarities and differences in polygenic risk across populations, caution is advised in interpreting differences in PRS across ancestries ( Novembre and Barton, 2018 ).

Heritability and Genetic Correlation

GWAS can provide insights into the genetic architecture of human traits, including SNP heritability and genetic correlation . Several methods have been proposed for estimating these parameters from genotype data ( Supplemental Table S4 ; Supplemental Methods Section V ), but estimation and interpretation of these quantities is more challenging in diverse populations. Heritability estimates may differ between populations due to variation in both environmental factors and population genetic forces. Cross-population differences in phenotype measurement ( Section XI ) may further complicate interpretation. In evaluating shared genetic variance across populations, genetic correlation between groups can be defined either as the correlation of allelic effect sizes (genetic-effect correlation) or the correlation of the relative contribution to total phenotypic variance (genetic-impact correlation), and for all variants or for common variants present in a study. Each value is potentially informative, but divergence in allele frequencies and LD patterns between populations will lead to differences between these parameters ( Galinsky et al., 2019 ).

As detailed in the supplement, most common methods for estimating SNP heritability and genetic correlation either require modification or may not be suitable for use in multi-ancestry studies. Methods relying on relatedness estimation (e.g., genomic relatedness matrix restricted maximum likelihood; GREML) require estimation methods robust to population structure ( Conomos et al., 2018 ; Thornton et al., 2012 ), and methods modelling LD (e.g., LD Score regression; LDSC) require either ancestry-matched reference panels or individual level data for LD calculations ( Luo et al., 2018 ). Ancestry-matched reference panels, along with the large GWAS sample sizes required for robust estimation using these methods, may be especially challenging to acquire for studies in underrepresented or admixed groups.

Beyond these most common methods, local ancestry tracts in admixed population samples can be leveraged to estimate heritability ( Zaitlen et al., 2014 ) and both genetic-effect and genetic-impact correlations of observed variants can be estimated using Popcorn ( Brown et al., 2016 ) if LD information is available and the two populations are relatively homogeneous. Recent studies estimating cross-ancestry genetic effect correlations have found moderate to high correlations for most phenotypes ( Bigdeli et al., 2017 ; Brick et al., 2019 ; Lam et al., 2018 ). The extent to which these cross-ancestry genetic correlations reflect consistent effects at any particular locus remains a question for fine mapping analyses.

Rare Variant Association Analysis

Rare SNPs and structural variants have been implicated in complex disease ( Bomba et al., 2017 ). Due to their more recent origin, rare variants tend to be more geographically clustered and can be population specific. They can also be particularly important from both clinical and biological perspectives because some confer a large increase in disease risk. However, there is severely limited power to identify trait associations of individual rare variants. Therefore, aggregation methods such as burden tests, variance-component tests, and hybrid tests have been developed to test the combined effect of several variants. Using this approach, variants can be combined within genes or regulatory genetic elements ( Gilly et al., 2018 ; Kuchenbaecker and Appel, 2018 ). Ancestry groups may carry different driving variants at the same locus, as demonstrated by the association of different functional variants in ADH1B with alcohol use disorder in African Americans compared with European and Asian Americans ( Edenberg and McClintick, 2018 ). Therefore, aggregate testing can be particularly suitable to projects involving different ancestral groups because they focus on functional units rather than individual variants and it is not necessary to observe the same variants or frequencies across cases. Meta-analysis methods have been developed that are able to encompass heterogeneous genetic effects across studies and are applicable to cross-ancestry meta-analysis ( Lee et al., 2013 ; Tang and Lin, 2015 ).

Association testing for rare variants is particularly sensitive to population stratification, and adjusting for fine-scale patterns of population stratification can be difficult with traditional methods ( Zhang et al., 2013 ). In simulation studies, adjusting for PCs failed to fully control inflation for collapsing and variance-component methods ( Persyn et al., 2018 ). Mixed effects models that have been developed for related samples might improve on this ( Jiang and McPeek, 2014 ). However, this area requires further methods development.

Non-Genetic Contributors to Trait Variability

Diversity in social, cultural, and environmental factors also affect disease risk, and can contribute to confounding in genetic studies. In the case of complex traits with strong environmental influences, such as psychiatric conditions, the need to account for non-genetic contributors to disease is important. Unfortunately, measurement of environmental factors can be difficult, so proxy measures such as zip code or insurance status can be used to model non-genetic risk factors such as air quality or accessibility to quality health care. PCs calculated from genotypes can control for population structure due to genetic relatedness, but this approach alone may not capture the social and environmental factors that are encompassed in self-reported “race” and “ethnicity”, even though these measures can be correlated with genetic ancestry. Self-reported measures of diversity can help in the modeling of societal determinants of health, such as increased stress due to the experience of racism and inequality and related variability in environmental factors (e.g., socio-economic status) that affect disease risk. However, the reliance on race and ethnicity as proxy variables for environmental effects or in order to control for population structure may be inappropriate. Better understanding and measurement of causal environmental risk factors is critical in order to advance discovery methods beyond these over-simplified and potentially harmful constructs of non-genetic contributors to trait variability.

Investigating complex traits in diverse populations, especially when samples are pooled from different research sites or cultural contexts, requires consistency and equivalence in the underlying construct and assessment measures across groups. Differences and variability in phenotypic measurement between study sites and populations may affect both gene discovery and the transferability of genetic findings between populations. Most psychiatric classification systems and diagnostic measures have been developed and validated in individuals from industrialized, Western societies ( Henrich et al., 2010 ). This presents a substantial challenge for global and cross-cultural collaborations. Investigations into cross-cultural differences in the prevalence of major depression, for instance, have suggested that although there is a shared underlying disorder construct across groups ( Kendler et al., 2015 ; Simon et al., 2002 ), individuals may differ culturally in terms of the level of symptomatology reached prior to seeking help ( Simon et al., 2002 ). The inclusion and consideration of diverse populations in the development, validation, and deployment of diagnostic measures used in genetic studies is therefore critical for ensuring an unbiased picture of disease etiology ( Supplemental Methods VII ).

Despite known large effects of environmental exposures on complex disease risk, there have been limited efforts to incorporate these factors into large-scale genetic studies. Appropriate modeling of the environment is especially critical when a phenotype or trait of interest is influenced by gene-by-environment interactions (GxE) . That is, genetic risk factors not only alter average risk but also influence sensitivity to the effects of environmental adversities. However, the majority of GxE studies have been underpowered and conducted using samples of primarily European descent, which limits the assessment of GxE and thereby the identification of modifiable targets for intervention and prevention among understudied groups ( Duncan et al., 2014 ). We note that the statistical definition of GxE depends on the choice of modelling on an additive or multiplicative scale ( Kendler and Gardner, 2010 ). Greater representation of diverse individuals is critically needed in order to increase our understanding of how the interrelated contributions of genes and environment vary across social and cultural groups, and how these factors may interact.

Perspectives and Recommendations

The lack of diversity in genetic studies is problematic for a variety of ethical and scientific reasons. Continued reliance on samples that only represent a fraction of genomic, socio-cultural, and environmental diversity limits our understanding of disease biology and may ultimately contribute to widening global health disparities. Greater ancestral diversity in study samples has the potential to accelerate the discovery of causal risk variants and is critical for a greater understanding of the biological causes of disease, including gene-by-environment interactions. In this Primer, we have highlighted the challenges and benefits of working with diverse populations, recommended practices based on current methods, and have noted specific areas that are in need of further methodological development ( Box 3 ). In summarizing progress, remaining challenges, and requisite next steps, we consider three main domains: 1) researcher participation, 2) data resources, and 3) analytic methods.

Common pitfalls, recommendations, and methods in need of development.

Researcher participation.

It is essential that cross-population research is carried out with careful consideration of its ethical, legal, and social implications (ELSI). This includes an ethos of trust-building, transparency, bi-directional knowledge sharing, and community engagement. This is especially true in low and middle income (LMIC) settings and in work with minority groups – contexts in which mistrust of researchers is warranted given historical mistreatment and ethical violations. As there is no single overarching legislative framework that covers this area, we draw attention to literature that (i) articulates key issues (e.g., consent-taking, data-sharing, sample governance, equal partnership, capacity building, community engagement, participants’ advisory boards ( Akinhanmi et al., 2018 ; Claw et al., 2018 ; Parker and Kwiatkowski, 2016 ) and (ii) proposes effective working solutions to them ( Beaton et al., 2017 ; Campbell et al., 2017 ; de Vries et al., 2015 ). Additionally, there is a need to overcome traditional barriers to research empowerment for under-represented groups. H3ABioNet ( https://www.h3abionet.org/ ), GINGER ( https://ginger.sph.harvard.edu/ ), AMARI ( https://amari-africa.org/ ), MIND ( https://minds-uf.org/ ), and BRAIN ( https://advance.washington.edu/brains ) are examples of initiatives that embed the targeted delivery of skills and training within broader programs of research. Additional funding mechanisms that support such an approach would be particularly beneficial.

Data Resources

There is a critical need for extensive collaborative efforts to generate large-scale discovery cohorts of diverse ancestry. Limited diversity in genetics research is a major factor limiting our ability to address important scientific questions. The 1KGP ( Sudmant et al., 2015 ) serves as one of the most widely-used resources in genetics research, but expanding those reference panels is a priority. Here, we provide a selected catalogue of extant and emerging sources of whole-genome sequence data ( Table 1 and Supplemental Table S3 ), to facilitate improved matching of diverse study cohorts to appropriate reference panels. Notably, some sources of non-European data are under-utilized, such as minority groups within the UK Biobank. Although diverse ancestry groups only account for about 5% of this data, that fraction amounts to over 35,000 samples of non-European and admixed ancestry ( Bycroft et al., 2018 ) and yet only 7.3% of publications since 2008 that used this data included any of these diverse samples. Thus, there are opportunities to make better use of these and other existing resources.

Additionally, substantial efforts are needed for efficient and ethical international sample and data sharing. This is an issue under active debate, as countries have different approaches to weighing concerns about the privacy of individuals against the collective benefits of science, and the regulatory landscape of individual-level genotype data has been uneven. For example, while the UK allows open access of individual-level genotype data with a valid scientific proposal, other countries, such as Denmark, Iceland, and China, tightly regulate the sharing of such data. Some GWAS consortia, including the Enhancing NeuroImaging Genetics through Meta-Analysis (ENIGMA) and Social Science Genetic Association Consortium (SSGAC), overcame these regulatory challenges using essentially a “federated sharing model” ( Fiume et al., 2019 ). Without sharing individual-level genotype data, a study in these consortia follows the prespecified analytic protocol and contributes its summary statistics to the meta-analysis, allowing the participation of studies that do not have permission to share individual-level data. Researchers should be aware of such options and restrictions, and we recommend regular review of policies as scientific advances may change the ground on which they are based. The practice of sharing summary statistics is increasingly important, and facilitates meta-analyses and other secondary analyses like polygenic risk scoring and estimation of cross-trait genetic correlations. Journals and funding agencies should require sharing of summary statistics whenever it is ethically and legally possible.

Future directions for improving analytic methods

Many of the analytic challenges involved in genetic studies of diverse populations ( Box 3 ) can be addressed by recent advances in methodologies. We reflect on two key issues that remain unresolved and are likely to be beneficial directions for methodological development: 1) the division of individuals into major population groups for analysis and 2) the extension of common secondary analyses of GWAS results to accommodate results from cross-population studies.

A primary question currently faced in genetic analyses of diverse cohorts is whether to follow a ‘combining’ approach (analyzing all individuals together, regardless of ancestry) or a ‘stratifying’ approach (dividing the cohort into major population groups for separate analysis, followed by cross-ancestry meta-analysis; Figure 2 ). Concerns regarding joint analysis methods (e.g., mixed models) include inadequate control for confounding population stratification and the limited options for secondary analyses such as polygenic risk scoring and genetic correlation estimates. To the extent that stratifying individuals into major population groups remains a feature of cross-population analyses, future methods and theoretical work may continue to refine standards for how best to assign individuals to more homogenous groups. The best solution currently available combines a priori analysis plans, exploratory examination of the data, and involving collaborators with expertise in analyzing globally representative datasets. Future work will benefit from increasing diversity in reference panels, formalizing how major populations should be defined for the purposes of genetic analyses, and evaluating the performance of such methods. Continued methodological work should help resolve the tension between these approaches, clarifying if and when stratifying samples is necessary and providing improved methods for joint analysis of diverse cohorts that addresses population stratification.

Many post-GWAS statistical methods have limited portability to association results from diverse and admixed populations, due to complexities with LD patterns. Caution should be taken in the downstream analysis of cross-population GWAS meta-analyses, as many common approaches such as gene-based testing (e.g., MAGMA ( de Leeuw et al., 2015 )), heritability and genetic correlation estimation (e.g., LD Score regression ( Bulik-Sullivan et al., 2015 )), and predicted gene expression (e.g., S-PrediXcan ( Barbeira et al., 2018 )) rely on external reference panels that may not be compatible with the ‘combining’ approach. Even methodologies such as Popcorn ( Brown et al., 2016 ) that are specifically designed for cross-population analyses typically assume single-population summary statistics as input. Furthermore, it is unclear whether annotations of GWAS results based on observed associations in external studies (e.g., gene expression, Hi-C contacts, methylation) may also need to evaluate population specificity or include diverse samples to improve generalizability across populations. For example, 85% of GTEx eQTL annotations are from individuals of European ancestry ( GTEx Consortium, 2013 ) and other functional genomics resources may be similarly limited.

The above-described methods of cross-population aggregation and comparison rely on an assumption that complex diseases are phenotypically similar across global populations and that measurement of such disorders is culturally unbiased. Given that we know these assumptions are not always accurate, the best practical steps are to be aware of potential phenotypic and environmental differences across populations and involve multi-disciplinary teams with expertise in global societal determinants of health and cultural competency. Suitable methods – such as those that account for cultural context of phenotype ascertainment and GxE – should then be developed and implemented to more precisely measure and treat disorders across cultures.

There is a growing need for investment in policies and practices to support the inclusion of diverse research participants and thus maximize the global potential of genetics research and precision medicine. Broadening participation of both study populations and researchers from many regions of the globe and LMIC in particular will likely be tremendously beneficial. Within the arenas of available data and analytic methods, short-term goals include improved sharing and openness of data. Longer-term goals include identifying ways in which the complex practical, cultural, social, legal and ethical issues inhibiting sample collection from under-represented populations are best resolved. Early, often, and meaningful engagement of stakeholders from diverse patient groups and communities, multi-disciplinary investigators including those with expertise in community-based participatory research, research institutions, scientific editors and reviewers, and funding agencies will all be critical to the success of these short- and long-term objectives towards fostering an environment of inclusive research. Knowing that the lack of representation of diverse populations in genetics research will hinder our understanding of disease etiology, it is clear that this is both an important ethical and scientific growth area for genomics research.

Listing of currently available imputation reference panels.

Supplementary Material

Acknowledgements:.

The authors acknowledge the support and helpful discussions with many members in the Psychiatric Genomics Consortium (PGC), which is supported by the National Institutes of Health (NIH) grants U01 MH109528, MH109539, MH109539, MH109536, MH109501, MH109514, MH109499, MH109532. REP is supported by NIH K01 grant MH113848. KK is supported by Wellcome Trust grant 212360/Z/18/Z. RKW is supported by NIH U01 MH094432. ABP is supported by a Postdoctoral Fellowship from the Stanford Center for Computational, Evolutionary, and Human Genomics (CEHG). RJS is supported by a UKRI Innovation- HDR- UK Fellowship (MR/S003061/1). ARM is supported by NIH grant K99MH117229. MLP is supported in part by grant CONICYT FONDECYT 1181365. HH is supported by NIH K01DK114379, R21AI139012, and the Stanley Center for Psychiatric Research. LD was supported by UL1 TR001085 and Stanford Department of Psychiatry and Behavioral Sciences.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Supplemental Material:

Includes Supplemental Methods I-VII and Supplemental Tables S1–S4.

  • Ahmad M, Sinha A, Ghosh S, Kumar V, Davila S, Yajnik CS, and Chandak GR (2017). Inclusion of Population-specific Reference Panel from India to the 1000 Genomes Phase 3 Panel Improves Imputation Accuracy . Sci. Rep 7 , 6733. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Akinhanmi MO, Biernacka JM, Strakowski SM, McElroy SL, Balls Berry JE, Merikangas KR, Assari S, McInnis MG, Schulze TG, LeBoyer M, et al. (2018). Racial disparities in bipolar disorder treatment and research: a call to action . Bipolar Disord . 20 , 506–514. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Banda Y, Kvale MN, Hoffmann TJ, Hesselson SE, Ranatunga D, Tang H, Sabatti C, Croen LA, Dispensa BP, Henderson M, et al. (2015). Characterizing Race/Ethnicity and Genetic Ancestry for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort . Genetics 200 , 1285–1295. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, Torstenson ES, Shah KP, Garcia T, Edwards TL, et al. (2018). Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics . Nat. Commun 9 , 1825. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Barrett JC, and Cardon LR (2006). Evaluating coverage of genome-wide association studies . Nat. Genet 38 , 659–662. [ PubMed ] [ Google Scholar ]
  • Beaton A, Hudson M, Milne M, Port RV, Russell K, Smith B, Toki V, Uerata L, Wilcox P, Bartholomew K, et al. (2017). Engaging Māori in biobanking and genomic research: a model for biobanks to guide culturally informed governance, operational, and community engagement activities . Genet. Med 19 , 345–351. [ PubMed ] [ Google Scholar ]
  • Bigdeli TB, Ripke S, Peterson RE, Trzaskowski M, Bacanu S-A, Abdellaoui A, Andlauer TFM, Beekman ATF, Berger K, Blackwood DHR, et al. (2017). Genetic effects influencing risk for major depressive disorder in China and Europe . Transl. Psychiatry 7 , e1074. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Bomba L, Walter K, and Soranzo N (2017). The impact of rare and low-frequency genetic variants in common disease . Genome Biol . 18 , 77. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Brick LA, Keller MC, Knopik VS, McGeary JE, and Palmer RHC (2019). Shared additive genetic variation for alcohol dependence among subjects of African and European ancestry . Addict. Biol 24 , 132–144. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Brown BC, Asian Genetic Epidemiology Network Type 2 Diabetes Consortium, Ye CJ, Price AL, and Zaitlen N (2016). Transethnic Genetic-Correlation Estimates from Summary Statistics . Am. J. Hum. Genet 99 , 76–88. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, Schizophrenia Working Group of the Psychiatric Genomics Consortium, Patterson N, Daly MJ, Price AL, and Neale BM (2015). LD Score regression distinguishes confounding from polygenicity in genome-wide association studies . Nat. Genet 47 , 291–295. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J, et al. (2018). The UK Biobank resource with deep phenotyping and genomic data . Nature 562 , 203–209. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Campbell MM, Susser E, Mall S, Mqulwana SG, Mndini MM, Ntola OA, Nagdee M, Zingela Z, Van Wyk S, and Stein DJ (2017). Using iterative learning to improve understanding during the informed consent process in a South African psychiatric genomics study . PLoS One 12 , e0188466. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Chen Z, Chen J, Collins R, Guo Y, Peto R, Wu F, Li L, and China Kadoorie Biobank (CKB) collaborative group (2011). China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up . Int. J. Epidemiol 40 , 1652–1666. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Claw KG, Anderson MZ, Begay RL, Tsosie KS, Fox K, Garrison NA, and Summer internship for INdigenous peoples in Genomics (SING) Consortium (2018). A framework for enhancing ethical genomic research with Indigenous communities . Nat. Commun 9 , 2957. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Conomos MP, Reiner AP, McPeek MS, and Thornton TA (2018). Genome-Wide Control of Population Structure and Relatedness in Genetic Association Studies via Linear Mixed Models with Orthogonally Partitioned Structure (bioRxiv) .
  • Duncan LE, Pollastri AR, and Smoller JW (2014). Mind the gap: why many geneticists and psychological scientists have discrepant views about gene-environment interaction (G×E) research . Am. Psychol 69 , 249–268. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Duncan LE, Shen H, Gelaye B, Ressler KJ, Feldman MW, Peterson RE, and Domingue BW (2018). Analysis of Polygenic Score Usage and Performance in Diverse Human Populations . [ Google Scholar ]
  • Edenberg HJ, and McClintick JN (2018). Alcohol Dehydrogenases, Aldehyde Dehydrogenases, and Alcohol Use Disorders: A Critical Review . Alcohol. Clin. Exp. Res 42 , 2281–2297. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Fiume M, Cupak M, Keenan S, Rambla J, de la Torre S, Dyke SOM, Brookes AJ, Carey K, Lloyd D, Goodhand P, et al. (2019). Federated discovery and sharing of genomic data using Beacons . Nat. Biotechnol 37 , 220–224. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Galinsky KJ, Reshef YA, Finucane HK, Loh P-R, Zaitlen N, Patterson NJ, Brown BC, and Price AL (2019). Estimating cross-population genetic correlations of causal effect sizes . Genet. Epidemiol 43 , 180–188. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Gilly A, Suveges D, Kuchenbaecker K, Pollard M, Southam L, Hatzikotoulas K, Farmaki A-E, Bjornland T, Waples R, Appel EVR, et al. (2018). Cohort-wide deep whole genome sequencing and the allelic architecture of complex traits . Nat. Commun 9 , 4674. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Grinde KE, Qi Q, Thornton TA, Liu S, Shadyab AH, Chan KHK, Reiner AP, and Sofer T (2019). Generalizing polygenic risk scores from Europeans to Hispanics/Latinos . Genet. Epidemiol 43 , 50–62. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • GTEx Consortium (2013). The Genotype-Tissue Expression (GTEx) project . Nat. Genet 45 , 580–585. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Heckerman D, Gurdasani D, Kadie C, Pomilla C, Carstensen T, Martin H, Ekoru K, Nsubuga RN, Ssenyomo G, Kamali A, et al. (2016). Linear mixed model for heritability estimation that explicitly addresses environmental variation . Proc. Natl. Acad. Sci. U. S. A 113 , 7377–7382. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hellwege JN, Keaton JM, Giri A, Gao X, Velez Edwards DR, and Edwards TL (2017). Population Stratification in Genetic Association Studies . Curr. Protoc. Hum. Genet 95 , 1.22.1–1.22.23. [ Google Scholar ]
  • Henrich J, Heine SJ, and Norenzayan A (2010). The weirdest people in the world? Behav. Brain Sci 33 , 61–83; discussion 83–135. [ PubMed ] [ Google Scholar ]
  • Hindorff LA, Bonham VL, Brody LC, Ginoza MEC, Hutter CM, Manolio TA, and Green ED (2018a). Prioritizing diversity in human genomics research . Nat. Rev. Genet 19 , 175–185. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hindorff LA, Bonham VL, and Ohno-Machado L (2018b). Enhancing diversity to reduce health information disparities and build an evidence base for genomic medicine . Per. Med 15 , 403–412. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Howie B, Fuchsberger C, Stephens M, Marchini J, and Abecasis GR (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing . Nat. Genet 44 , 955–959. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Huang H, Fang M, Jostins L, Umićević Mirkov M, Boucher G, Anderson CA, Andersen V, Cleynen I, Cortes A, Crins F, et al. (2017). Fine-mapping inflammatory bowel disease loci to single-variant resolution . Nature 547 , 173–178. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • International HapMap Consortium (2005). A haplotype map of the human genome . Nature 437 , 1299–1320. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Jiang D, and McPeek MS (2014). Robust rare variant association testing for quantitative traits in samples with related individuals . Genet. Epidemiol 38 , 10–20. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kelleher J, Wong Y, Albers P, Wohns AW, and McVean G (2018). Inferring the ancestry of everyone . [ Google Scholar ]
  • Kendler KS, and Gardner CO (2010). Interpretation of interactions: guide for the perplexed . Br. J. Psychiatry 197 , 170–171. [ PubMed ] [ Google Scholar ]
  • Kendler KS, Aggen SH, Li Y, Lewis CM, Breen G, Boomsma DI, Bot M, Penninx BWJH, and Flint J (2015). The similarity of the structure of DSM-IV criteria for major depression in depressed women from China, the United States and Europe . Psychol. Med 45 , 1945–1954. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, Natarajan P, Lander ES, Lubitz SA, Ellinor PT, et al. (2018). Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations . Nat. Genet 50 , 1219. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kichaev G, and Pasaniuc B (2015). Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping Studies . Am. J. Hum. Genet 97 , 260–271. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kuchenbaecker K, and Appel EVR (2018). Assessing Rare Variation in Complex Traits . Methods Mol. Biol 1793 , 51–71. [ PubMed ] [ Google Scholar ]
  • Lam M, Chen C-Y, Li Z, Martin A, Bryois J, Ma X, Gaspar H, Ikeda M, Benyamin B, Brown B, et al. (2018). Comparative genetic architectures of schizophrenia in East Asian and European populations . [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lee S, Teslovich TM, Boehnke M, and Lin X (2013). General framework for meta-analysis of rare variants in sequencing association studies . Am. J. Hum. Genet 93 , 42–53. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • de Leeuw CA, Mooij JM, Heskes T, and Posthuma D (2015). MAGMA: generalized gene-set analysis of GWAS data . PLoS Comput. Biol 11 , e1004219. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Li YR, and Keating BJ (2014). Trans-ethnic genome-wide association studies: advantages and challenges of mapping in diverse populations . Genome Med . 6 , 91. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Loh P-R, Kichaev G, Gazal S, Schoech AP, and Price AL (2018). Mixed-model association for biobank-scale datasets . Nat. Genet 50 , 906–908. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Luo Y, Li X, Wang X, Gazal S, Mercader JM, Neale B, Florez JC, Auton A, Price A, Finucane HK, et al. (2018). Estimating heritability of complex traits in admixed populations with summary statistics . [ Google Scholar ]
  • Mahajan A, Taliun D, Thurner M, Robertson NR, Torres JM, Rayner NW, Payne AJ, Steinthorsdottir V, Scott RA, Grarup N, et al. (2018). Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps . Nat. Genet 50 , 1505–1513. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Manrai AK, Funke BH, Rehm HL, Olesen MS, Maron BA, Szolovits P, Margulies DM, Loscalzo J, and Kohane IS (2016). Genetic Misdiagnoses and the Potential for Health Disparities . N. Engl. J. Med 375 , 655–665. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Marchini J, Cardon LR, Phillips MS, and Donnelly P (2004). The effects of human population structure on large genetic association studies . Nat. Genet 36 , 512–517. [ PubMed ] [ Google Scholar ]
  • Marigorta UM, and Navarro A (2013). High trans-ethnic replicability of GWAS results implies common causal variants . PLoS Genet . 9 , e1003566. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Márquez-Luna C, Loh P-R, South Asian Type 2 Diabetes (SAT2D) Consortium, SIGMA Type 2 Diabetes Consortium, and Price AL (2017). Multiethnic polygenic risk scores improve risk prediction in diverse populations . Genet. Epidemiol 41 , 811–823. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, and Daly MJ (2019). Clinical use of current polygenic risk scores may exacerbate health disparities . Nat. Genet 51 , 584–591. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Medina-Gomez C, Felix JF, Estrada K, Peters MJ, Herrera L, Kruithof CJ, Duijts L, Hofman A, van Duijn CM, Uitterlinden AG, et al. (2015). Challenges in conducting genome-wide association studies in highly admixed multi-ethnic populations: the Generation R Study . Eur. J. Epidemiol 30 , 317–330. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Mersha TB, and Abebe T (2015). Self-reported race/ethnicity in the age of genomic research: its potential impact on understanding health disparities . Hum. Genomics 9 , 1. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Mills MC, and Rahal C (2019). A scientometric review of genome-wide association studies . Commun Biol 2 , 9. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Minster RL, Hawley NL, Su C-T, Sun G, Kershaw EE, Cheng H, Buhule OD, Lin J, Reupena MS, Viali S ‘itea, et al. (2016). A thrifty variant in CREBRF strongly influences body mass index in Samoans . Nat. Genet 48 , 1049–1054. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Moltke I, Grarup N, Jørgensen ME, Bjerregaard P, Treebak JT, Fumagalli M, Korneliussen TS, Andersen MA, Nielsen TS, Krarup NT, et al. (2014). A common Greenlandic TBC1D4 variant confers muscle insulin resistance and type 2 diabetes . Nature 512 , 190–193. [ PubMed ] [ Google Scholar ]
  • Mulder N, Abimiku A ‘le, Adebamowo SN, de Vries J, Matimba A, Olowoyo P, Ramsay M, Skelton M, and Stein DJ (2018). H3Africa: current perspectives . Pharmgenomics. Pers. Med 11 , 59–66. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Nelson SC, Romm JM, Doheny KF, Pugh EW, and Laurie CC (2017). Imputation-Based Genomic Coverage Assessments of Current Genotyping Arrays: Illumina HumanCore, OmniExpress, Multi-Ethnic global array and sub-arrays, Global Screening Array, Omni2.5M, Omni5M, and Affymetrix UK Biobank
  • Novembre J, and Barton NH (2018). Tread Lightly Interpreting Polygenic Tests of Selection . Genetics 208 , 1351–1355. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Parker M, and Kwiatkowski DP (2016). The ethics of sustainable genomic research in Africa . Genome Biol . 17 , 44. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Persyn E, Redon R, Bellanger L, and Dina C (2018). The impact of a fine-scale population stratification on rare variant association test results . PLoS One 13 , e0207677. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Peterson RE, Cai N, Bigdeli TB, Li Y, Reimers M, Nikulova A, Webb BT, Bacanu S-A, Riley BP, Flint J, et al. (2017a). The Genetic Architecture of Major Depressive Disorder in Han Chinese Women . JAMA Psychiatry 74 , 162–168. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Peterson RE, Edwards AC, Bacanu S-A, Dick DM, Kendler KS, and Webb BT (2017b). The utility of empirically assigning ancestry groups in cross-population genetic studies of addiction . Am. J. Addict 26 , 494–501. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Petrovski S, and Goldstein DB (2016). Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine . Genome Biol . 17 , 157. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Popejoy AB, and Fullerton SM (2016). Genomics is failing on diversity . Nature 538 , 161–164. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Race Ethnicity, and Genetics Working Group (2005). The use of racial, ethnic, and ancestral categories in human genetics research . Am. J. Hum. Genet 77 , 519–532. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Roden DM, Wilke RA, Kroemer HK, and Stein CM (2011). Pharmacogenomics: the genetics of variable drug responses . Circulation 123 , 1661–1670. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Schaid DJ, Chen W, and Larson NB (2018). From genome-wide associations to candidate causal variants by statistical fine-mapping . Nat. Rev. Genet 19 , 491–504. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Simon GE, Goldberg DP, Von Korff M, and Ustün TB (2002). Understanding cross-national differences in depression prevalence . Psychol. Med 32 , 585–594. [ PubMed ] [ Google Scholar ]
  • Sirugo G, Williams SM, and Tishkoff SA (2019). The Missing Diversity in Human Genetic Studies . Cell 177 , 26–31. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Fritz MH-Y, et al. (2015). An integrated map of structural variation in 2,504 human genomes . Nature 526 , 75–81. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Sul JH, Martin LS, and Eskin E (2018). Population structure in genetic studies: Confounding factors and mixed models . PLoS Genet . 14 , e1007309. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Tang Z-Z, and Lin D-Y (2015). Meta-analysis for Discovering Rare-Variant Associations: Statistical Methods and Software Programs . Am. J. Hum. Genet 97 , 35–53. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Thornton T, Tang H, Hoffmann TJ, Ochs-Balcom HM, Caan BJ, and Risch N (2012). Estimating kinship in admixed populations . Am. J. Hum. Genet 91 , 122–138. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • de Vries J, Tindana P, Littler K, Ramsay M, Rotimi C, Abayomi A, Mulder N, and Mayosi BM (2015). The H3Africa policy framework: negotiating fairness in genomics . Trends Genet . 31 , 117–119. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Walters RK, Polimanti R, Johnson EC, McClintick JN, Adams MJ, Adkins AE, Aliev F, Bacanu S-A, Batzler A, Bertelsen S, et al. (2018). Transancestral GWAS of alcohol dependence reveals common genetic underpinnings with psychiatric disorders . Nat. Neurosci 21 , 1656–1669. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Wheeler E, Leong A, Liu C-T, Hivert M-F, Strawbridge RJ, Podmore C, Li M, Yao J, Sim X, Hong J, et al. (2017). Impact of common genetic determinants of Hemoglobin A1c on type 2 diabetes risk and diagnosis in ancestrally diverse populations: A transethnic genome-wide meta-analysis . PLoS Med . 14 , e1002383. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Wojcik GL, Graff M, Nishimura KK, Tao R, and Haessler J (2019). Genetic analyses of diverse populations improves discovery for complex traits . Nature . [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Zaitlen N, Pasaniuc B, Sankararaman S, Bhatia G, Zhang J, Gusev A, Young T, Tandon A, Pollack S, Vilhjálmsson BJ, et al. (2014). Leveraging population admixture to characterize the heritability of complex traits . Nat. Genet 46 , 1356–1362. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Zhang Y, and Pan W (2015). Principal component regression and linear mixed model in association analysis of structured samples: competitors or complements? Genet. Epidemiol 39 , 149–155. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Zhang Y, Shen X, and Pan W (2013). Adjusting for population stratification in a fine scale with principal components and sequencing data . Genet. Epidemiol 37 , 787–801. [ PMC free article ] [ PubMed ] [ Google Scholar ]

IMAGES

  1. Genetic assignment (Bayesian assignment to K = 3 clusters, purple, pink

    genetic assignment method

  2. Genetic Algorithm Applications in Machine Learning

    genetic assignment method

  3. Results on the genetic assignment of individuals of A. svalbardicum and

    genetic assignment method

  4. Genetic Analysis Testing: everything you need to know

    genetic assignment method

  5. Monohybrid Punnett Square Practice Diagram

    genetic assignment method

  6. Genetic selection method using genetic algorithm.

    genetic assignment method

VIDEO

  1. PMHNP Pathophysiology Assignment Genetic Testing Diabetes Predisposition

  2. Introduction To Genetic Lec #1 #genetics #geneticdrift #geneticcode

  3. Application of genetic engineering

  4. Deciphering Genetic code l Genetic code part 2 l easy to learn

  5. 100% guaranteed Method to reduce genetic hair loss

  6. How to write assignment in scientific method part 2

COMMENTS

  1. Genetic assignment methods

    Genetic assignment methods are a set of powerful statistical methods that are used to determine the relationship between individuals and populations. [1] The general principle behind them is to use multilocus genotypes to assign reference populations as origins of the individuals. [2] Genetic assignment methods Frequency method

  2. Genetic Assignment Methods for Gaining Insight into the Management of

    Genetic assignment methods are a set of powerful statistical approaches useful for establishing population membership of individuals.

  3. Assignment methods: matching biological questions with appropriate

    Assignment methods, which use genetic information to ascertain population membership of individuals or groups of individuals, have been used in recent years to study a wide range of evolutionary and ecological processes.

  4. Genetic assignment of individuals to source populations ...

    By using genetic assignment methods, individuals with unknown genetic origin can be assigned to source populations. This knowledge is necessary in studying many key questions in ecology, evolution and conservation.

  5. Genetic assignment methods for the direct, real‐time estimation of

    Genetic assignment methods for the direct, real‐time estimation of migration rate: a simulation‐based exploration of accuracy and power David Paetkau Department of Zoology and Entomology, University of Queensland, St. Lucia, QLD 4072, Australia, Wildlife Genetics International, Box 274, Nelson, BC V1L 5P9, Canada,

  6. PDF Assignment methods: matching biological questions with appropriate

    Assignment method (AM): any of several related statistical methods that use genetic information to ascertain population membership of individuals or groups of individuals (Table 1). Assignment test (AT): a statistical test of the hypothesis that the multilocus genotype of an individual in question arose from a particular population (Box 1).

  7. Visualizations for Genetic Assignment Analyses using the ...

    Summary. We propose a method for visualizing genetic assignment data by characterizing the distribution of genetic profiles for each candidate source population. This method enhances the assignment method of Rannala and Mountain (1997) by calculating appropriate graph positions for individuals for which some genetic data are missing.

  8. Molecular Ecology

    Genetic assignment methods use genotype likelihoods to draw inference about where individuals were or were not born, potentially allowing direct, real-time estimates of dispersal.

  9. Genetic assignment of individuals to source populations ...

    By using genetic assignment methods, individuals with unknown genetic origin can be assigned to source populations. This knowledge is necessary in studying many key questions in ecology, evolution and conservation.

  10. Visualizations for genetic assignment analyses using the saddlepoint

    Summary We propose a method for visualizing genetic assignment data by characterizing the distribution of genetic profiles for each candidate source population. This method enhances the assignment method of Rannala and Mountain (1997) by calculating appropriate graph positions for individuals for which some genetic data are missing.

  11. PDF Genetic assignment methods for the direct, real-time estimation of

    Genetic assignment methods use genotype likelihoods to draw inference about where indi- viduals were or were not born, potentially allowing direct, real-time estimates of dispersal.

  12. Genetic assignment methods for the direct, real-time estimation of

    Abstract Genetic assignment methods use genotype likelihoods to draw inference about where individuals were or were not born, potentially allowing direct, real-time estimates of dispersal.

  13. Parentage assignment with genotyping‐by‐sequencing data

    To assign a parent, we calculated an assignment score for each putative parent: The score will be close to 0 if the individual is unlikely to be the father, and tends towards positive infinity with increasing evidence that the individual is the father.

  14. Genetic Assignment Methods for Gaining Insight into the ...

    Abstract For many pathogens with environmental stages, or those carried by vectors or intermediate hosts, disease transmission is strongly influenced by pathogen, host, and vector movements across complex landscapes, and thus quantitative measures of movement rate and direction can reveal new opportunities for disease management and intervention.

  15. Assignment methods: matching biological questions with appropriate

    Assignment methods, which use genetic information to ascertain population membership of individuals or groups of individuals, have been used in recent years to study a wide range of evolutionary and ecological processes.

  16. Genetic assignment of individuals to source populations using network

    work tools for genetic identification of individuals' source populations. 4. BONE is aimed at any researcher performing genetic assignment and trying to infer the genetic population structure. Compared to other methods, our approach also identifies outlying mixture individuals that could originate outside of the baseline populations.

  17. A comparison of four methods for detecting weak genetic structure from

    There are a number of techniques available to detect and quantify genetic structure but here we concentrate on four methods: FST, population assignment, relatedness, and sibship assignment. Under the simple mating system simulated here, the four methods produce qualitatively similar results. However, the assignment method performed relatively ...

  18. Using genomic relationship likelihood for parentage assignment

    Background Parentage assignment is usually based on a limited number of unlinked, independent genomic markers (microsatellites, low-density single nucleotide polymorphisms (SNPs), etc.). Classical methods for parentage assignment are exclusion-based (i.e. based on loci that violate Mendelian inheritance) or likelihood-based, assuming independent inheritance of loci. For true parent-offspring ...

  19. Maximum likelihood parentage assignment using quantitative ...

    Parentage assignment is also widely used in molecular ecology, including the study of conservation biology, dispersal and recruitment patterns, quantitative genetics and sexual selection (Flanagan ...

  20. Using genomic relationship likelihood for parentage assignment

    Our aim was to develop a fast and accurate trio parentage assignment method for dense SNP data without prior genotyping error- or call rate knowledge among loci and individuals. This genomic relationship likelihood (GRL) method infers parentage by using genomic relationships, which are typically used in genomic prediction models. Results: Using ...

  21. A nearest neighbour approach by genetic distance to the assignment of

    1. Introduction In recent years, the application of forensic methods based on genetic markers to assign individual plants and animals to their geographic origin [1], [2] has gained importance for the control of trade regulations and consumer protection.

  22. Gene Assignment

    Gene Assignment Genetics and Synthesis of Components of the Complement System HARVEY R. COLTEN, in Immunobiology of the Complement System, 1986 IIIE Chromosomal Assignment of the Complement Genes Once cDNA probes are available, chromosome assignment of other complement genes can be accomplished by one of two methods.

  23. A Multi-stage Target Assignment Method Based on Improved Genetic

    In this paper, a mathematical model of multi-stage target assignment is established for the multi-stage target assignment problem of the attacker. According to the characteristics of the problem, the general genetic algorithm is improved and the example simulation is carried out. The simulation results show that the improved genetic algorithm ...

  24. Genome-wide association studies in ancestrally diverse populations

    This assignment is necessary for the stratified meta-analysis approach to GWAS of diverse cohorts, and is intended to reduce the risk of false positive genetic signals due to inflated test statistics from population stratification. Assigning samples to more homogeneous groups for analysis reduces stratification by limiting the degree of ...