The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics

January 18, 2024

The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics

Image credit: pgen.1011110

Research Article

Transposon dynamics in the emerging oilseed crop Thlaspi arvense

Transposable elements Ty3/Athila drive genome evolution in T. arvense; their high mobilization activity could be harnessed for targeted gene expression diversification in breeding.

Image credit: pgen.1011141

Transposon dynamics in the emerging oilseed crop Thlaspi arvense

Recently Published Articles

  • Cryptococcus neoformans reveals novel proteins involved in DNA damage repair">CryptoCEN: A Co-Expression Network for Cryptococcus neoformans reveals novel proteins involved in DNA damage repair
  • Gametic selection favours polyandry and selfing
  • TRPS1 modulates chromatin accessibility to regulate estrogen receptor alpha (ER) binding and ER target gene expression in luminal breast cancer cells

Current Issue

Current Issue November 2023

Precise coordination between nutrient transporters ensures fertility in the malaria mosquito Anopheles gambiae

Reciprocal regulation between lipid transporter lipophorin (Lp) and yolk precursor protein vitellogenin (Vg) is crucial for proper egg development, making them potential targets for mosquito control.

Image credit: pgen.1011145

Precise coordination between nutrient transporters ensures fertility in the malaria mosquito Anopheles gambiae

Emergence and spread of the barley net blotch pathogen coincided with crop domestication and cultivation history

Studying 104 genomes of the barley pathogen Pyrenophora teres f. teres revealed geographically structured populations, tied to human migration and trade, showcasing the pathogen's rapid adaptation to local conditions and hosts.

Image credit: pgen.1010884

Emergence and spread of the barley net blotch pathogen coincided with crop domestication and cultivation history

Failure to mate enhances investment in behaviors that may promote mating reward and impairs the ability to cope with stressors via a subpopulation of Neuropeptide F receptor neurons

Using Drosophila courtship suppression to demonstrate that repeated failures to mate induce a stress response, revealing a crosstalk between reward, stress, and reproduction in a genetically manipulable system.

Failure to mate enhances investment in behaviors that may promote mating reward and impairs the ability to cope with stressors via a subpopulation of Neuropeptide F receptor neurons

Image credit: pgen.1011054

A major endogenous glycoside hydrolase mediating quercetin uptake in Bombyx mori

Quercetin boosts cocoon protection in silkworms through GH1G5 enzyme, influencing quercetin uptake and silk yield, indicating its adaptive role in silkworm evolution.

A major endogenous glycoside hydrolase mediating quercetin uptake in Bombyx mori

Image credit: pgen.1011118

Genome-wide analyses reveal the contribution of somatic variants to the immune landscape of multiple cancer types

KEAP1 mutations in lung adenocarcinoma influence immune traits by activating NRF2, affecting the tumor microenvironment and response to anti-PD1 therapy.

Genome-wide analyses reveal the contribution of somatic variants to the immune landscape of multiple cancer types

Image credit: pgen.1011134

SparsePro: An efficient fine-mapping method integrating summary statistics and functional annotations

SparsePro effectively integrates functional annotations with summary statistics from GWAS …

SparsePro: An efficient fine-mapping method integrating summary statistics and functional annotations

Image credit: pgen.1011104

Template switching between the leading and lagging strands at replication forks generates inverted copy number variants through hairpin-capped extrachromosomal DNA

A novel replication error, ODIRA, not double-stranded DNA breaks, drives inverted triplications ….

Template switching between the leading and lagging strands at replication forks generates inverted copy number variants through hairpin-capped extrachromosomal DNA

Image credit: pgen.1010850

Quantifying the role of genome size and repeat content in adaptive variation and the architecture of flowering time in Amaranthus tuberculatus

Genome size variation and repeat content in A. tuberculatus impact flowering time and growth rates …

Quantifying the role of genome size and repeat content in adaptive variation and the architecture of flowering time in Amaranthus tuberculatus

Image credit: pgen.1010865

New PLOS journals accepting submissions

Five new journals unified in addressing global health and environmental challenges are now ready to receive submissions: PLOS Climate , PLOS Sustainability and Transformation , PLOS Water , PLOS Digital Health , and PLOS Global Public Health

COVID-19 Collection

The COVID-19 Collection highlights all content published across the PLOS journals relating to the COVID-19 pandemic.

Submit your Lab and Study Protocols to PLOS ONE !

PLOS ONE is now accepting submissions of Lab Protocols, a peer-reviewed article collaboration with protocols.io, and Study Protocols, an article that credits the work done prior to producing and publishing results.

PLOS Reviewer Center

A collection of free training and resources for peer reviewers of PLOS journals—and for the peer review community more broadly—drawn from research and interviews with staff editors, editorial board members, and experienced reviewers.

Ten Simple Rules

PLOS Computational Biology 's "Ten Simple Rules" articles provide quick, concentrated guides for mastering some of the professional challenges research scientists face in their careers.

Welcome New Associate Editors!

PLOS Genetics welcomes several new Associate Editors to our board: Nicolas Bierne, Julie Simpson, Yun Li, Hongbin Ji, Hongbing Zhang, Bertrand Servin, & Benjamin Schwessinger

Expanding human variation at PLOS Genetics

The former Natural Variation section at PLOS Genetics relaunches as Human Genetic Variation and Disease. Read the editors' reasoning behind this change.

PLOS Genetics welcomes new Section Editors

Quanjiang Ji (ShanghaiTech University) joined the editorial board and Xiaofeng Zhu (Case Western Reserve University) was promoted as new Section Editors for the PLOS Genetics Methods section.

PLOS Genetics editors elected to National Academy of Sciences

Congratulations to Associate Editor Michael Lichten and Consulting Editor Nicole King, who are newly elected members of the National Academy of Sciences.

Harmit Malik receives Novitski Prize

Congratulations to Associate Editor Harmit Malik, who was awarded the Edward Novitski Prize by the Genetics Society of America for his work on genetic conflict. Harmit has also been elected as a new member of the American Academy of Arts & Sciences.

Publish with PLOS

  • Submission Instructions
  • Submit Your Manuscript

Connect with Us

  • PLOS Genetics on Twitter
  • PLOS on Facebook

Get new content from PLOS Genetics in your inbox

Thank you you have successfully subscribed to the plos genetics newsletter., sorry, an error occurred while sending your subscription. please try again later..

  • Search Menu
  • Advance Articles
  • Perspectives
  • Knowledgebase and Database Resources
  • Nobel Laureates Collection

China Virtual Outreach Webinar

Neurogenetics, fungal genetics and genomics.

  • Multiparental Populations
  • Genomic Prediction
  • Plant Genetics and Genomics

Genetic Models of Rare Diseases

  • Why Publish
  • Author Guidelines
  • Submission Site
  • Open Access Options
  • Full Data Policy
  • Self-Archiving Policy
  • About Genetics
  • About Genetics Society of America
  • Editorial Board
  • Early Career Reviewers
  • Guidelines for Reviewers
  • Advertising & Corporate Services
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Editor-in-Chief

Howard Lipshitz

Executive Editor

Tracey DePellegrin

Managing Editor

Ruth Isaacson

Scientific Editor and Program Manager

Join the GSA Journals at The Allied Genetics Conference 2024 for opportunities to meet editors and learn about publishing with us. You won’t want to miss this one-of-a-kind meeting happening in Metro Washington, D.C., March 6–10, 2024.

Why publish with GENETICS?

Why publish in genetics.

Learn more about why GENETICS is the perfect home for your research, and submit today to join our celebrated author community.

Why publish?

Series and Collections accepting papers

Submit your work to one of GSA’s ongoing series and collections.

Currently accepting submissions

Meet the Editorial Board

See who handles papers for GENETICS by topic.

Editorial board

Re-watch the recent China Virtual Outreach Webinar where you will learn more about publishing your work in the journal.

Watch the webinar

Latest articles

Series & collections.

research paper about genetics

Genes and variants of interest in rare diseases often benefit from modelling in cellular assays or genetic models to aid in understanding molecular and cellular mechanisms of disfunction. Model organisms are useful for the discovery of new genetic diseases and key to understanding variant effects, and modelling a disease gene in a genetic model means that researchers can perform an in-depth exploration of gene or variant function. The GSA Journals are pleased to publish a series highlighting ongoing advances in rare disease discovery and mechanisms by presenting key research findings and new discoveries.

plant genetics and genomics homepage panel

Plant Genetics and Genomics 

Plant science has generated many discoveries and advances in genetics and genomics research. These contributions reflect the ingenuity and rigor of the plant science community, as well as the rich diversity of plants and their biology. To showcase this critical work, GENETICS and G3: Genes|Genomes|Genetics has launched the Plant Genetics and Genomics series with a collection of fourteen research articles and an accompanying editorial.

Neurogenetics Series

Neurogenetics lies at the intersection of Neuroscience and Genetics, where genetic approaches are applied to the study of nervous system development, function, and plasticity. Overseen by Series Editors Oliver Hobert, Cecilia Moens, and Kate O’Connor Giles, this new series aims to make the GSA Journals a home for cutting-edge, robust research in neurogenetics.

Fungal Genetics and Genomics Series

The fungal kingdom is remarkable in its breadth and depth of impact on global health, agriculture, biodiversity, ecology, manufacturing, and biomedical research. Overseen by editors Leah Cowen and Joseph Heitman, this series aims to report and thereby further stimulate advances in genetics and genomics across a diversity of fungal species.

FlyBook

FlyBook from GENETICS is a comprehensive compendium of review articles presenting the current state of knowledge in  Drosophila  research.

Browse FlyBook

WormBook

WormBook from GENETICS features a comprehensive compendium of review articles presenting the current state of knowledge in  C. elegans  research. WormBook articles will span the breadth of the biology, genetics, genomics, and evolutionary biology of  C. elegans .

Browse WormBook

YeastBook

The YeastBook series from GENETICS features a comprehensive compendium of reviews that presents the current state of knowledge of the molecular biology, cellular biology, and genetics of the yeast  Saccharomyces cerevisiae .

Browse YeastBook

More from GSA

G3: Genes|Genomes|Genetics

G3: Genes|Genomes|Genetics

G3, a Genetics Society of America journal, provides a forum for the publication of high-quality foundational research-particularly research that generates useful genetic and genomic information, as well as genome reports, mutant screens, and advances in methods and technology.

Find out more

Join GSA

GSA members of all career stages receive member benefits including access to professional development programs, discounted meeting registration, and eligibility for travel awards. Members also receive a personal subscription to GENETICS, as well as discounted publication fees in both GSA journals.

Conferences

Conferences

GSA conferences have long served as community hubs for researchers focused on particular organisms or topics. GSA also hosts The Allied Genetics Conference (TAGC) , a unique meeting that brings together multiple research communities for collaboration and synthesis.

Attend a conference

Career Development

Career Development

GSA professional development programs provide rich opportunities for scientists to gain skills,  experience, mentors, and networks. Our initiatives and resources range from peer review training to inclusive public engagement, newsletters, webinars, a job board, leadership programs, and much more.

Browse Opportunities

image of inbox

Email alerts

Register to receive email alerts as soon as new content from  GENETICS is published online.

Bookshelf

Recommend to your library

Fill out our simple online form to recommend GENETICS to your library. Recommend now

Author resources

Author resources

Learn about how to submit your article, our publishing process, and tips on how to promote your article.

Related Titles

Cover image of current issue from G3 Genes|Genomes|Genetics

  • Recommend to Your Librarian
  • Advertising and Corporate Services
  • Journals Career Network

Affiliations

  • Online ISSN 1943-2631
  • Copyright © 2024 Genetics Society of America
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Advertisement

Advertisement

A review on genetic algorithm: past, present, and future

  • Published: 31 October 2020
  • Volume 80 , pages 8091–8126, ( 2021 )

Cite this article

  • Sourabh Katoch 1 ,
  • Sumit Singh Chauhan 1 &
  • Vijay Kumar   ORCID: orcid.org/0000-0002-3460-6989 1  

142k Accesses

1544 Citations

11 Altmetric

Explore all metrics

In this paper, the analysis of recent advances in genetic algorithms is discussed. The genetic algorithms of great interest in research community are selected for analysis. This review will help the new and demanding researchers to provide the wider vision of genetic algorithms. The well-known algorithms and their implementation are presented with their pros and cons. The genetic operators and their usages are discussed with the aim of facilitating new researchers. The different research domains involved in genetic algorithms are covered. The future research directions in the area of genetic operators, fitness function and hybrid algorithms are discussed. This structured review will be helpful for research and graduate teaching.

Similar content being viewed by others

Particle swarm optimization algorithm: an overview.

Dongshu Wang, Dapei Tan & Lei Liu

research paper about genetics

Evolutionary algorithms and their applications to engineering problems

Adam Slowik & Halina Kwasnicka

research paper about genetics

A tutorial on multiobjective optimization: fundamentals and evolutionary methods

Michael T. M. Emmerich & André H. Deutz

Avoid common mistakes on your manuscript.

1 Introduction

In the recent years, metaheuristic algorithms are used to solve real-life complex problems arising from different fields such as economics, engineering, politics, management, and engineering [ 113 ]. Intensification and diversification are the key elements of metaheuristic algorithm. The proper balance between these elements are required to solve the real-life problem in an effective manner. Most of metaheuristic algorithms are inspired from biological evolution process, swarm behavior, and physics’ law [ 17 ]. These algorithms are broadly classified into two categories namely single solution and population based metaheuristic algorithm (Fig.  1 ). Single-solution based metaheuristic algorithms utilize single candidate solution and improve this solution by using local search. However, the solution obtained from single-solution based metaheuristics may stuck in local optima [ 112 ]. The well-known single-solution based metaheuristics are simulated annealing, tabu search (TS), microcanonical annealing (MA), and guided local search (GLS). Population-based metaheuristics utilizes multiple candidate solutions during the search process. These metaheuristics maintain the diversity in population and avoid the solutions are being stuck in local optima. Some of well-known population-based metaheuristic algorithms are genetic algorithm (GA) [ 135 ], particle swarm optimization (PSO) [ 101 ], ant colony optimization (ACO) [ 47 ], spotted hyena optimizer (SHO) [ 41 ], emperor penguin optimizer (EPO) [ 42 ], and seagull optimization (SOA) [ 43 ].

figure 1

Classification of metaheuristic Algorithms

Among the metaheuristic algorithms, Genetic algorithm (GA) is a well-known algorithm, which is inspired from biological evolution process [ 136 ]. GA mimics the Darwinian theory of survival of fittest in nature. GA was proposed by J.H. Holland in 1992. The basic elements of GA are chromosome representation, fitness selection, and biological-inspired operators. Holland also introduced a novel element namely, Inversion that is generally used in implementations of GA [ 77 ]. Typically, the chromosomes take the binary string format. In chromosomes, each locus (specific position on chromosome) has two possible alleles (variant forms of genes) - 0 and 1. Chromosomes are considered as points in the solution space. These are processed using genetic operators by iteratively replacing its population. The fitness function is used to assign a value for all the chromosomes in the population [ 136 ]. The biological-inspired operators are selection, mutation, and crossover. In selection, the chromosomes are selected on the basis of its fitness value for further processing. In crossover operator, a random locus is chosen and it changes the subsequences between chromosomes to create off-springs. In mutation, some bits of the chromosomes will be randomly flipped on the basis of probability [ 77 , 135 , 136 ]. The further development of GA based on operators, representation, and fitness has diminished. Therefore, these elements of GA are focused in this paper.

The main contribution of this paper are as follows:

The general framework of GA and hybrid GA are elaborated with mathematical formulation.

The various types of genetic operators are discussed with their pros and cons.

The variants of GA with their pros and cons are discussed.

The applicability of GA in multimedia fields is discussed.

The main aim of this paper is two folds. First, it presents the variants of GA and their applicability in various fields. Second, it broadens the area of possible users in various fields. The various types of crossover, mutation, selection, and encoding techniques are discussed. The single-objective, multi-objective, parallel, and hybrid GAs are deliberated with their advantages and disadvantages. The multimedia applications of GAs are elaborated.

The remainder of this paper is organized as follows: Section 2 presents the methodology used to carry out the research. The classical genetic algorithm and genetic operators are discussed in Section 3 . The variants of genetic algorithm with pros and cons are presented in Section 4 . Section 5 describes the applications of genetic algorithm. Section 6 presents the challenges and future research directions. The concluding remarks are drawn in Section 7 .

2 Research methodology

PRISMA’s guidelines were used to conduct the review of GA [ 138 ]. A detailed search has been done on Google scholar and PubMed for identification of research papers related to GA. The important research works found during the manual search were also added in this paper. During search, some keywords such as “Genetic Algorithm” or “Application of GA” or “operators of GA” or “representation of GA” or “variants of GA” were used. The selection and rejection of explored research papers are based on the principles, which is mentioned in Table 1 .

Total 27,64,792 research papers were explored on Google Scholar, PubMed and manual search. The research work related to genetic algorithm for multimedia applications were also included. During the screening of research papers, all the duplicate papers and papers published before 2007 were discarded. 4340 research papers were selected based on 2007 and duplicate entries. Thereafter, 4050 research papers were eliminated based on titles. 220 research papers were eliminated after reading of abstract. 70 research papers were left after third round of screening. 40 more research papers were discarded after full paper reading and facts found in the papers. After the fourth round of screening, final 30 research papers are selected for review.

Based on the relevance and quality of research, 30 papers were selected for evaluation. The relevance of research is decided through some criteria, which is mentioned in Table 1 . The selected research papers comprise of genetic algorithm for multimedia applications, advancement of their genetic operators, and hybridization of genetic algorithm with other well-established metaheuristic algorithms. The pros and cons of genetic operators are shown in preceding section.

3 Background

In this section, the basic structure of GA and its genetic operators are discussed with pros and cons.

3.1 Classical GA

Genetic algorithm (GA) is an optimization algorithm that is inspired from the natural selection. It is a population based search algorithm, which utilizes the concept of survival of fittest [ 135 ]. The new populations are produced by iterative use of genetic operators on individuals present in the population. The chromosome representation, selection, crossover, mutation, and fitness function computation are the key elements of GA. The procedure of GA is as follows. A population ( Y ) of n chromosomes are initialized randomly. The fitness of each chromosome in Y is computed. Two chromosomes say C1 and C2 are selected from the population Y according to the fitness value. The single-point crossover operator with crossover probability (C p ) is applied on C1 and C2 to produce an offspring say O . Thereafter, uniform mutation operator is applied on produced offspring ( O ) with mutation probability (M p ) to generate O′ . The new offspring O′ is placed in new population. The selection, crossover, and mutation operations will be repeated on current population until the new population is complete. The mathematical analysis of GA is as follows [ 126 ]:

GA dynamically change the search process through the probabilities of crossover and mutation and reached to optimal solution. GA can modify the encoded genes. GA can evaluate multiple individuals and produce multiple optimal solutions. Hence, GA has better global search capability. The offspring produced from crossover of parent chromosomes is probable to abolish the admirable genetic schemas parent chromosomes and crossover formula is defined as [ 126 ]:

where g is the number of generations, and G is the total number of evolutionary generation set by population. It is observed from Eq.( 1 ) that R is dynamically changed and increase with increase in number of evolutionary generation. In initial stage of GA, the similarity between individuals is very low. The value of R should be low to ensure that the new population will not destroy the excellent genetic schema of individuals. At the end of evolution, the similarity between individuals is very high as well as the value of R should be high.

According to Schema theorem, the original schema has to be replaced with modified schema. To maintain the diversity in population, the new schema keep the initial population during the early stage of evolution. At the end of evolution, the appropriate schema will be produced to prevent any distortion of excellent genetic schema [ 65 , 75 ]. Algorithm 1 shows the pseudocode of classical genetic algorithm.

Algorithm 1: Classical Genetic Algorithm (GA)

figure a

3.2 Genetic operators

GAs used a variety of operators during the search process. These operators are encoding schemes, crossover, mutation, and selection. Figure 2 depicts the operators used in GAs.

figure 2

Operators used in GA

3.2.1 Encoding schemes

For most of the computational problems, the encoding scheme (i.e., to convert in particular form) plays an important role. The given information has to be encoded in a particular bit string [ 121 , 183 ]. The encoding schemes are differentiated according to the problem domain. The well-known encoding schemes are binary, octal, hexadecimal, permutation, value-based, and tree.

Binary encoding is the commonly used encoding scheme. Each gene or chromosome is represented as a string of 1 or 0 [ 187 ]. In binary encoding, each bit represents the characteristics of the solution. It provides faster implementation of crossover and mutation operators. However, it requires extra effort to convert into binary form and accuracy of algorithm depends upon the binary conversion. The bit stream is changed according the problem. Binary encoding scheme is not appropriate for some engineering design problems due to epistasis and natural representation.

In octal encoding scheme, the gene or chromosome is represented in the form of octal numbers (0–7). In hexadecimal encoding scheme, the gene or chromosome is represented in the form of hexadecimal numbers (0–9, A-F) [ 111 , 125 , 187 ]. The permutation encoding scheme is generally used in ordering problems. In this encoding scheme, the gene or chromosome is represented by the string of numbers that represents the position in a sequence. In value encoding scheme, the gene or chromosome is represented using string of some values. These values can be real, integer number, or character [ 57 ]. This encoding scheme can be helpful in solving the problems in which more complicated values are used. As binary encoding may fail in such problems. It is mainly used in neural networks for finding the optimal weights.

In tree encoding, the gene or chromosome is represented by a tree of functions or commands. These functions and commands can be related to any programming language. This is very much similar to the representation of repression in tree format [ 88 ]. This type of encoding is generally used in evolving programs or expressions. Table 2 shows the comparison of different encoding schemes of GA.

3.2.2 Selection techniques

Selection is an important step in genetic algorithms that determines whether the particular string will participate in the reproduction process or not. The selection step is sometimes also known as the reproduction operator [ 57 , 88 ]. The convergence rate of GA depends upon the selection pressure. The well-known selection techniques are roulette wheel, rank, tournament, boltzmann, and stochastic universal sampling.

Roulette wheel selection maps all the possible strings onto a wheel with a portion of the wheel allocated to them according to their fitness value. This wheel is then rotated randomly to select specific solutions that will participate in formation of the next generation [ 88 ]. However, it suffers from many problems such as errors introduced by its stochastic nature. De Jong and Brindle modified the roulette wheel selection method to remove errors by introducing the concept of determinism in selection procedure. Rank selection is the modified form of Roulette wheel selection. It utilizes the ranks instead of fitness value. Ranks are given to them according to their fitness value so that each individual gets a chance of getting selected according to their ranks. Rank selection method reduces the chances of prematurely converging the solution to a local minima [ 88 ].

Tournament selection technique was first proposed by Brindle in 1983. The individuals are selected according to their fitness values from a stochastic roulette wheel in pairs. After selection, the individuals with higher fitness value are added to the pool of next generation [ 88 ]. In this method of selection, each individual is compared with all n-1 other individuals if it reaches the final population of solutions [ 88 ]. Stochastic universal sampling (SUS) is an extension to the existing roulette wheel selection method. It uses a random starting point in the list of individuals from a generation and selects the new individual at evenly spaced intervals [ 3 ]. It gives equal chance to all the individuals in getting selected for participating in crossover for the next generation. Although in case of Travelling Salesman Problem, SUS performs well but as the problem size increases, the traditional Roulette wheel selection performs relatively well [ 180 ].

Boltzmann selection is based on entropy and sampling methods, which are used in Monte Carlo Simulation. It helps in solving the problem of premature convergence [ 118 ]. The probability is very high for selecting the best string, while it executes in very less time. However, there is a possibility of information loss. It can be managed through elitism [ 175 ]. Elitism selection was proposed by K. D. Jong (1975) for improving the performance of Roulette wheel selection. It ensures the elitist individual in a generation is always propagated to the next generation. If the individual having the highest fitness value is not present in the next generation after normal selection procedure, then the elitist one is also included in the next generation automatically [ 88 ]. The comparison of above-mentioned selection techniques are depicted in Table 3 .

3.2.3 Crossover operators

Crossover operators are used to generate the offspring by combining the genetic information of two or more parents. The well-known crossover operators are single-point, two-point, k-point, uniform, partially matched, order, precedence preserving crossover, shuffle, reduced surrogate and cycle.

In a single point crossover, a random crossover point is selected. The genetic information of two parents which is beyond that point will be swapped with each other [ 190 ]. Figure 3 shows the genetic information after swapping. It replaced the tail array bits of both the parents to get the new offspring.

figure 3

Swapping genetic information after a crossover point

In a two point and k-point crossover, two or more random crossover points are selected and the genetic information of parents will be swapped as per the segments that have been created [ 190 ]. Figure 4 shows the swapping of genetic information between crossover points. The middle segment of the parents is replaced to generate the new offspring.

figure 4

Swapping genetic information between crossover points

In a uniform crossover, parent cannot be decomposed into segments. The parent can be treated as each gene separately. We randomly decide whether we need to swap the gene with the same location of another chromosome [ 190 ]. Figure 5 depicts the swapping of individuals under uniform crossover operation.

figure 5

Swapping individual genes

Partially matched crossover (PMX) is the most frequently used crossover operator. It is an operator that performs better than most of the other crossover operators. The partially matched (mapped) crossover was proposed by D. Goldberg and R. Lingle [ 66 ]. Two parents are choose for mating. One parent donates some part of genetic material and the corresponding part of other parent participates in the child. Once this process is completed, the left out alleles are copied from the second parent [ 83 ]. Figure 6 depicts the example of PMX.

figure 6

Partially matched crossover (PMX) [ 117 ]

Order crossover (OX) was proposed by Davis in 1985. OX copies one (or more) parts of parent to the offspring from the selected cut-points and fills the remaining space with values other than the ones included in the copied section. The variants of OX are proposed by different researchers for different type of problems. OX is useful for ordering problems [ 166 ]. However, it is found that OX is less efficient in case of Travelling Salesman Problem [ 140 ]. Precedence preserving crossover (PPX) preserves the ordering of individual solutions as present in the parent of offspring before the application of crossover. The offspring is initialized to a string of random 1’s and 0’s that decides whether the individuals from both parents are to be selected or not. In [ 169 ], authors proposed a modified version of PPX for multi-objective scheduling problems.

Shuffle crossover was proposed by Eshelman et al. [ 20 ] to reduce the bias introduced by other crossover techniques. It shuffles the values of an individual solution before the crossover and unshuffles them after crossover operation is performed so that the crossover point does not introduce any bias in crossover. However, the utilization of this crossover is very limited in the recent years. Reduced surrogate crossover (RCX) reduces the unnecessary crossovers if the parents have the same gene sequence for solution representations [ 20 , 139 ]. RCX is based on the assumption that GA produces better individuals if the parents are sufficiently diverse in their genetic composition. However, RCX cannot produce better individuals for those parents that have same composition. Cycle crossover was proposed by Oliver [ 140 ]. It attempts to generate an offspring using parents where each element occupies the position by referring to the position of their parents [ 140 ]. In the first cycle, it takes some elements from the first parent. In the second cycle, it takes the remaining elements from the second parent as shown in Fig.  7 .

figure 7

Cycle Crossover (CX) [ 140 ]

Table 4 shows the comparison of crossover techniques. It is observed from Table 4 that single and k-point crossover techniques are easy to implement. Uniform crossover is suitable for large subsets. Order and cycle crossovers provide better exploration than the other crossover techniques. Partially matched crossover provides better exploration. The performance of partially matched crossover is better than the other crossover techniques. Reduced surrogate and cycle crossovers suffer from premature convergence.

3.2.4 Mutation operators

Mutation is an operator that maintains the genetic diversity from one population to the next population. The well-known mutation operators are displacement, simple inversion, and scramble mutation. Displacement mutation (DM) operator displaces a substring of a given individual solution within itself. The place is randomly chosen from the given substring for displacement such that the resulting solution is valid as well as a random displacement mutation. There are variants of DM are exchange mutation and insertion mutation. In Exchange mutation and insertion mutation operators, a part of an individual solution is either exchanged with another part or inserted in another location, respectively [ 88 ].

The simple inversion mutation operator (SIM) reverses the substring between any two specified locations in an individual solution. SIM is an inversion operator that reverses the randomly selected string and places it at a random location [ 88 ]. The scramble mutation (SM) operator places the elements in a specified range of the individual solution in a random order and checks whether the fitness value of the recently generated solution is improved or not [ 88 ]. Table 5 shows the comparison of different mutation techniques.

Table 6 shows the best combination of encoding scheme, mutation, and crossover techniques. It is observed from Table 6 that uniform and single-point crossovers can be used with most of encoding and mutation operators. Partially matched crossover is used with inversion mutation and permutation encoding scheme provides the optimal solution.

4 Variants of GA

Various variants of GA’s have been proposed by researchers. The variants of GA are broadly classified into five main categories namely, real and binary coded, multiobjective, parallel, chaotic, and hybrid GAs. The pros and cons of these algorithms with their application has been discussed in the preceding subsections.

4.1 Real and binary coded GAs

Based on the representation of chromosomes, GAs are categorized in two classes, namely binary and real coded GAs.

4.1.1 Binary coded GAs

The binary representation was used to encode GA and known as binary GA. The genetic operators were also modified to carry out the search process. Payne and Glen [ 153 ] developed a binary GA to identify the similarity among molecules. They used binary representation for position of molecule and their conformations. However, this method has high computational complexity. Longyan et al. [ 203 ] investigated three different method for wind farm design using binary GA (BGA). Their method produced better fitness value and farm efficiency. Shukla et al. [ 185 ] utilized BGA for feature subset selection. They used mutual information maximization concept for selecting the significant features. BGAs suffer from Hamming cliffs, uneven schema, and difficulty in achieving precision [ 116 , 199 ].

4.1.2 Real-coded GAs

Real-coded GAs (RGAs) have been widely used in various real-life applications. The representation of chromosomes is closely associated with real-life problems. The main advantages of RGAs are robust, efficient, and accurate. However, RGAs suffer from premature convergence. Researchers are working on RGAs to improve their performance. Most of RGAs are developed by modifying the crossover, mutation and selection operators.

Crossover operators

The searching capability of crossover operators are not satisfactory for continuous search space. The developments in crossover operators have been done to enhance their performance in real environment. Wright [ 210 ] presented a heuristics crossover that was applied on parents to produce off-spring. Michalewicz [ 135 ] proposed arithmetical crossover operators for RGAs. Deb and Agrawal [ 34 ] developed a real-coded crossover operator, which is based on characteristics of single-point crossover in BGA. The developed crossover operator named as simulated binary crossover (SBX). SBX is able to overcome the Hamming cliff, precision, and fixed mapping problem. The performance of SBX is not satisfactory in two-variable blocked function. Eshelman et al. [ 53 ] utilized the schemata concept to design the blend crossover for RGAs. The unimodal normal distribution crossover operator (UNDX) was developed by Ono et al. [ 144 ]. They used ellipsoidal probability distribution to generate the offspring. Kita et al. [ 106 ] presented a multi-parent UNDX (MP-UNDX), which is the extension of [ 144 ]. However, the performance of RGA with MP-UNDX is much similar to UNDX. Deep and Thakur [ 39 ] presented a Laplace crossover for RGAs, which is based on Laplacian distribution. Chuang et al. [ 27 ] developed a direction based crossover to further explore the all possible search directions. However, the search directions are limited. The heuristic normal distribution crossover operator was developed by Wang et al. [ 207 ]. It generates the cross-generated offspring for better search operation. However, the better individuals are not considered in this approach. Subbaraj et al. [ 192 ] proposed Taguchi self-adaptive RCGA. They used Taguchi method and simulated binary crossover to exploit the capable offspring.

Mutation operators

Mutation operators generate diversity in the population. The two main challenges have to tackle during the application of mutation. First, the probability of mutation operator that was applied on population. Second, the outlier produced in chromosome after mutation process. Michalewicz [ 135 ] presented uniform and non-uniform mutation operators for RGAs. Michalewicz and Schoenauer [ 136 ] developed a special case of uniform mutation. They developed boundary mutation. Deep and Thakur [ 38 ] presented a novel mutation operator based on power law and named as power mutation. Das and Pratihar [ 30 ] presented direction-based exponential mutation operator. They used direction information of variables. Tang and Tseng [ 196 ] presented a novel mutation operator for enhancing the performance of RCGA. Their approach was fast and reliable. However, it stuck in local optima for some applications. Deb et al. [ 35 ] developed polynomial mutation that was used in RCGA. It provides better exploration. However, the convergence speed is slow and stuck in local optima. Lucasius et al. [ 129 ] proposed a real-coded genetic algorithm (RCGA). It is simple and easy to implement. However, it suffers from local optima problem. Wang et al. [ 205 ] developed multi-offspring GA and investigated their performance over single point crossover. Wang et al. [ 206 ] stated the theoretical basis of multi-offspring GA. The performance of this method is better than non-multi-offspring GA. Pattanaik et al. [ 152 ] presented an improvement in the RCGA. Their method has better convergence speed and quality of solution. Wang et al. [ 208 ] proposed multi-offspring RCGA with direction based crossover for solving constrained problems.

Table 7 shows the mathematical formulation of genetic operators in RGAs.

4.2 Multiobjective GAs

Multiobjective GA (MOGA) is the modified version of simple GA. MOGA differ from GA in terms of fitness function assignment. The remaining steps are similar to GA. The main motive of multiobjective GA is to generate the optimal Pareto Front in the objective space in such a way that no further enhancement in any fitness function without disturbing the other fitness functions [ 123 ]. Convergence, diversity, and coverage are main goal of multiobjective GAs. The multiobjective GAs are broadly categorized into two categories namely, Pareto-based, and decomposition-based multiobjective GAs [ 52 ]. These techniques are discussed in the preceding subsections.

4.2.1 Pareto-based multi-objective GA

The concept of Pareto dominance was introduced in multiobjective GAs. Fonseca and Fleming [ 56 ] developed first multiobjective GA (MOGA). The niche and decision maker concepts were proposed to tackle the multimodal problems. However, MOGA suffers from parameter tuning problem and degree of selection pressure. Horn et al. [ 80 ] proposed a niched Pareto genetic algorithm (NPGA) that utilized the concept of tournament selection and Pareto dominance. Srinivas and Deb [ 191 ] developed a non-dominated sorting genetic algorithm (NSGA). However, it suffers from lack of elitism, need of sharing parameter, and high computation complexity. To alleviate these problems, Deb et al. [ 36 ] developed a fast elitist non-dominated sorting genetic algorithm (NSGA-II). The performance of NSGA-II may be deteriorated for many objective problems. NSGA-II was unable to maintain the diversity in Pareto-front. To alleviate this problem, Luo et al. [ 130 ] introduced a dynamic crowding distance in NSGA-II. Coello and Pulido [ 28 ] developed a multiobjective micro GA. They used an archive for storing the non-dominated solutions. The performance of Pareto-based approaches may be deteriorated in many objective problems [ 52 ].

4.2.2 Decomposition-based multiobjective GA

Decomposition-based MOGAs decompose the given problem into multiple subproblems. These subproblems are solved simultaneously and exchange the solutions among neighboring subproblems [ 52 ]. Ishibuchi and Murata [ 84 ] developed a multiobjective genetic local search (MOGLS). In MOGLS, the random weights were used to select the parents and local search for their offspring. They used generation replacement and roulette wheel selection method. Jaszkiewicz [ 86 ] modified the MOGLS by utilizing different selection mechanisms for parents. Murata and Gen [ 141 ] proposed a cellular genetic algorithm for multiobjective optimization (C-MOGA) that was an extension of MOGA. They added cellular structure in MOGA. In C-MOGA, the selection operator was performed on the neighboring of each cell. C-MOGA was further extended by introducing an immigration procedure and known as CI-MOGA. Alves and Almeida [ 11 ] developed a multiobjective Tchebycheffs-based genetic algorithm (MOTGA) that ensures convergence and diversity. Tchebycheff scalar function was used to generate non-dominated solution set. Patel et al. [ 151 ] proposed a decomposition based MOGA (D-MOGA). They integrated opposition based learning in D-MOGA for weight vector generation. D-MOGA is able to maintain the balance between diversity of solutions and exploration of search space.

4.3 Parallel GAs

The motivation behind the parallel GAs is to improve the computational time and quality of solutions through distributed individuals. Parallel GAs are categorized into three broad categories such as master-slave parallel GAs, fine grained parallel GAs, and multi-population coarse grained parallel Gas [ 70 ]. In master-slave parallel GA, the computation of fitness functions is distributed over the several processors. In fine grained GA, parallel computers are used to solve the real-life problems. The genetic operators are bounded to their neighborhood. However, the interaction is allowed among the individuals. In coarse grained GA, the exchange of individuals among sub-populations is performed. The control parameters are also transferred during migration. The main challenges in parallel GAs are to maximize memory bandwidth and arrange threads for utilizing the power of GPUs [ 23 ]. Table 8 shows the comparative analysis of parallel GAs in terms of hardware and software. The well-known parallel GAs are studied in the preceding subsections.

4.3.1 Master slave parallel GA

The large number of processors are utilized in master-slave parallel GA (MS-PGA) as compared to other approaches. The computation of fitness functions may be increased by increasing the number of processors. Hong et al. [ 79 ] used MS-PGA for solving data mining problems. Fuzzy rules are used with parallel GA. The evaluation of fitness function was performed on slave machines. However, it suffers from high computational time. Sahingzo [ 174 ] implemented MS-PGA for UAV path finding problem. The genetic operators were executed on processors. They used multicore CPU with four cores. Selection and fitness evaluation was done on slave machines. MS-PGA was applied on traffic assignment problem in [ 127 ]. They used thirty processors to solve this problem at National University of Singapore. Yang et al. [ 213 ] developed a web-based parallel GA. They implemented the master slave version of NSGA-II in distributed environment. However, the system is complex in nature.

4.3.2 Fine grained parallel GA

In last few decades, researchers are working on migration policies of fine grained parallel GA (FG-PGA). Porta et al. [ 161 ] utilized clock-time for migration frequency, which is independent of generations. They used non-uniform structure and static configuration. The best solution was selected for migration and worst solution was replaced with migrant solution. Kurdi [ 115 ] used adaptive migration frequency. The migration procedure starts until there is no change in the obtained solutions after ten successive generations. The non-uniform and dynamic structure was used. In [ 209 ], local best solutions were synchronized and formed a global best solutions. The global best solutions were transferred to all processors for father execution. The migration frequency depends upon the number of generation. They used uniform structure with fixed configuration. Zhang et al. [ 220 ] used parallel GA to solve the set cover problem of wireless networks. They used divide-and-conquer strategy to decompose the population into sub-populations. Thereafter, the genetic operators were applied on local solutions and Kuhn-Munkres was used to merge the local solutions.

4.3.3 Coarse grained parallel GA

Pinel et al. [ 158 ] proposed a GraphCell. The population was initialized with random values and one solution was initialized with Min-min heuristic technique. 448 processors were used to implement the proposed approach. However, coarse grained parallel GAs are less used due to complex in nature. The hybrid parallel GAs are widely used in various applications. Shayeghi et al. [ 182 ] proposed a pool-based Birmingham cluster GA. Master node was responsible for managing global population. Slave node selected the solutions from global population and executed it. 240 processors are used for computation. Roberge et al. [ 170 ] used hybrid approach to optimize switching angle of inverters. They used four different strategies for fitness function computation. Nowadays, GPU, cloud, and grid are most popular hardware for parallel GAs [ 198 ].

4.4 Chaotic GAs

The main drawback of GAs is premature convergence. The chaotic systems are incorporated into GAs to alleviate this problem. The diversity of chaos genetic algorithm removes premature convergence. Crossover and mutation operators can be replaced with chaotic maps. Tiong et al. [ 197 ] integrated the chaotic maps into GA for further improvement in accuracy. They used six different chaotic maps. The performance of Logistic, Henon and Ikeda chaotic GA performed better than the classical GA. However, these techniques suffer from high computational complexity. Ebrahimzadeh and Jampour [ 48 ] used Lorenz chaotic for genetic operators of GA to eliminate the local optima problem. However, the proposed approach was unable to find relationship between entropy and chaotic map. Javidi and Hosseinpourfard [ 87 ] utilized two chaotic maps namely logistic map and tent map for generating chaotic values instead of random selection of initial population. The proposed chaotic GA performs better than the GA. However, this method suffers from high computational complexity. Fuertes et al. [ 60 ] integrated the entropy into chaotic GA. The control parameters are modified through chaotic maps. They investigated the relationship between entropy and performance optimization.

Chaotic systems have also used in multiobjective and hybrid GAs. Abo-Elnaga and Nasr [ 5 ] integrated chaotic system into modified GA for solving Bi-level programming problems. Chaotic helps the proposed algorithm to alleviate local optima and enhance the convergence. Tahir et al. [ 193 ] presented a binary chaotic GA for feature selection in healthcare. The chaotic maps were used to initialize the population and modified reproduction operators were applied on population. Xu et al. [ 115 ] proposed a chaotic hybrid immune GA for spectrum allocation. The proposed approach utilizes the advantages of both chaotic and immune operator. However, this method suffers from parameter initialization problem.

4.5 Hybrid GAs

Genetic Algorithms can be easily hybridized with other optimization methods for improving their performance such as image denoising methods, chemical reaction optimization, and many more. The main advantages of hybridized GA with other methods are better solution quality, better efficiency, guarantee of feasible solutions, and optimized control parameters [ 51 ]. It is observed from literature that the sampling capability of GAs is greatly affected from population size. To resolve this problem, local search algorithms such as memetic algorithm, Baldwinian, Lamarckian, and local search have been integrated with GAs. This integration provides proper balance between intensification and diversification. Another problem in GA is parameter setting. Finding appropriate control parameters is a tedious task. The other metaheuristic techniques can be used with GA to resolve this problem. Hybrid GAs have been used to solve the issues mentioned in the preceding subsections [ 29 , 137 , 186 ].

4.5.1 Enhance search capability

GAs have been integrated with local search algorithms to reduce the genetic drift. The explicit refinement operator was introduced in local search for producing better solutions. El-Mihoub et al. [ 54 ] established the effect of probability of local search on the population size of GA. Espinoza et al. [ 50 ] investigated the effect of local search for reducing the population size of GA. Different search algorithms have been integrated with GAs for solving real-life applications.

4.5.2 Generate feasible solutions

In complex and high-dimensional problems, the genetic operators of GA generate infeasible solutions. PMX crossover generates the infeasible solutions for order-based problems. The distance preserving crossover operator was developed to generate feasible solutions for travelling salesman problem [ 58 ]. The gene pooling operator instead of crossover was used to generate feasible solution for data clustering [ 19 ]. Konak and Smith [ 108 ] integrated a cut-saturation algorithm with GA for designing the communication networks. They used uniform crossover to produce feasible solutions.

4.5.3 Replacement of genetic operators

There is a possibility to replace the genetic operators which are mentioned in Section 3.2 with other search techniques. Leng [ 122 ] developed a guided GA that utilizes the penalties from guided local search. These penalties were used in fitness function to improve the performance of GA. Headar and Fukushima [ 74 ] used simplex crossover instead of standard crossover. The standard mutation operator was replaced with simulated annealing in [ 195 ]. The basic concepts of quantum computing are used to improve the performance of GAs. The heuristic crossover and hill-climbing operators can be integrated into GA for solving three-matching problem.

4.5.4 Optimize control parameters

The control parameters of GA play a crucial role in maintaining the balance between intensification and diversification. Fuzzy logic has an ability to estimate the appropriate control parameters of GA [ 167 ]. Beside this, GA can be used to optimize the control parameters of other techniques. GAs have been used to optimize the learning rate, weights, and topology of neutral networks [ 21 ]. GAs can be used to estimate the optimal value of fuzzy membership in controller. It was also used to optimize the control parameters of ACO, PSO, and other metaheuristic techniques [ 156 ]. The comparative analysis of well-known GAs are mentioned in Table 9 .

5 Applications

Genetic Algorithms have been applied in various NP-hard problems with high accuracy rates. There are a few application areas in which GAs have been successfully applied.

5.1 Operation management

GA is an efficient metaheuristic for solving operation management (OM) problems such as facility layout problem (FLP), supply network design, scheduling, forecasting, and inventory control.

5.1.1 Facility layout

Datta et al. [ 32 ] utilized GA for solving single row facility layout problem (SRFLP). For SRFLP, the modified crossover and mutation operators of GA produce valid solutions. They applied GA to large sized problems that consists of 60–80 instances. However, it suffers from parameter dependency problem. Sadrzadeh [ 173 ] proposed GA for multi-line FLP have multi products. The facilities were clustered using mutation and heuristic operators. The total cost obtained from the proposed GA was decreased by 7.2% as compared to the other algorithms. Wu et al. [ 211 ] implemented hierarchical GA to find out the layout of cellular manufacturing system. However, the performance of GA is greatly affected from the genetic operators. Aiello et al. [ 7 ] proposed MOGA for FLP. They used MOGA on the layout of twenty different departments. Palomo-Romero et al. [ 148 ] proposed an island model GA to solve the FLP. The proposed technique maintains the population diversity and generates better solutions than the existing techniques. However, this technique suffers from improper migration strategy that can be utilized for improving the population. GA and its variants has been successfully applied on FLP [ 103 , 119 , 133 , 201 ].

5.1.2 Scheduling

GA shows the superior performance for solving the scheduling problems such as job-shop scheduling (JSS), integrated process planning and scheduling (IPPS), etc. [ 119 ]. To improve the performance in the above-mentioned areas of scheduling, researchers developed various genetic representation [ 12 , 159 , 215 ], genetic operators, and hybridized GA with other methods [ 2 , 67 , 147 , 219 ].

5.1.3 Inventory control

Besides the scheduling, inventory control plays an important role in OM. Backordering and lost sales are two main approaches for inventory control [ 119 ]. Hiassat et al. [ 76 ] utilized the location-inventory model to find out the number and location of warehouses. Various design constraints have been added in the objective functions of GA and its variants for solving inventory control problem [].

5.1.4 Forecasting and network design

Forecasting is an important component for OM. Researchers are working on forecasting of financial trading, logistics demand, and tourist arrivals. GA has been hybridized with support vector regression, fuzzy set, and neural network (NN) to improve their forecasting capability [ 22 , 78 , 89 , 178 , 214 ]. Supply network design greatly affect the operations planning and scheduling. Most of the research articles are focused on capacity constraints of facilities [ 45 , 184 ]. Multi-product multi-period problems increases the complexity of supply networks. To resolve the above-mentioned problem, GA has been hybridized with other techniques [ 6 , 45 , 55 , 188 , 189 ]. Multi-objective GAs are also used to optimize the cost, profit, carbon emissions, etc. [ 184 , 189 ].

5.2 Multimedia

GAs have been applied in various fields of multimedia. Some of well-known multimedia fields are encryption, image processing, video processing, medical imaging, and gaming.

5.2.1 Information security

Due to development in multimedia applications, images, videos and audios are transferred from one place to another over Internet. It has been found in literature that the images are more error prone during the transmission. Therefore, image protection techniques such as encryption, watermarking and cryptography are required. The classical image encryption techniques require the input parameters for encryption. The wrong selection of input parameters will generate inadequate encryption results. GA and its variants have been used to select the appropriate control parameters. Kaur and Kumar [ 96 ] developed a multi-objective genetic algorithm to optimize the control parameters of chaotic map. The secret key was generated using beta chaotic map. The generated key was use to encrypt the image. Parallel GAs were also used to encrypt the image [ 97 ].

5.2.2 Image processing

The main image processing tasks are preprocessing, segmentation, object detection, denoising, and recognition. Image segmentation is an important step to solve the image processing problems. Decomposing/partitioning an image requires high computational time. To resolve this problem, GA is used due to their better search capability [ 26 , 102 ]. Enhancement is a technique to improve the quality and contrast of an image. The better image quality is required to analyze the given image. GAs have been used to enhance natural contrast and magnify image [ 40 , 64 , 99 ]. Some researchers are working on hybridization of rough set with adaptive genetic algorithm to merge the noise and color attributes. GAs have been used to remove the noise from the given image. GA can be hybridized with fuzzy logic to denoise the noisy image. GA based restoration technique can be used to remove haze, fog and smog from the given image [ 8 , 110 , 146 , 200 ]. Object detection and recognition is a challenging issue in real-world problem. Gaussian mixture model provides better performance during detection and recognition process. The control parameters are optimized through GA [ 93 ].

5.2.3 Video processing

Video segmentation has been widely used in pattern recognition, and computer vision. There are some critical issues that are associated with video segmentation. These are distinguishing object from the background and determine accurate boundaries. GA can be used to resolve these issues [ 9 , 105 ]. GAs have been implemented for gesture recognition successfully by Chao el al. [ 81 ] used GA for gesture recognition. They applied GAs and found an accuracy of 95% in robot vision. Kaluri and Reddy [ 91 ] proposed an adaptive genetic algorithm based method along with fuzzy classifiers for sign gesture recognition. They reported an improved recognition rate of 85% as compared to the existing method that provides 79% accuracy. Beside the gesture recognition, face recognition play an important role in criminal identification, unmanned vehicles, surveillance, and robots. GA is able to tackle the occlusion, orientations, expressions, pose, and lighting condition [ 69 , 95 , 109 ].

5.2.4 Medical imaging

Genetic algorithms have been applied in medical imaging such as edge detection in MRI and pulmonary nodules detection in CT scan images [ 100 , 179 ]. In [ 120 ], authors used a template matching technique with GA for detecting nodules in CT images. Kavitha and Chellamuthu [ 179 ] used GA based region growing method for detecting the brain tumor. GAs have been applied on medical prediction problems captured from pathological subjects. Sari and Tuna [ 176 ] used GA used to solve issues arises in biomechanics. It is used to predict pathologies during examination. Ghosh and Bhattachrya [ 62 ] implemented sequential GA with cellular automata for modelling the coronavirus disease 19 (COVID-19) data. GAs can be applied in parallel mode to find rules in biological datasets [ 31 ]. The authors proposed a parallel GA that runs by dividing the process into small sub-generations and evaluating the fitness of each individual solution in parallel. Genetic algorithms are used in medicine and other related fields. Koh et al. [ 61 ] proposed a genetic algorithm based method for evaluation of adverse effects of a given drug.

5.2.5 Precision agriculture

GAs have been applied on various problems that are related to precision agriculture. The main issues are crop yield, weed detection, and improvement in farming equipment. Pachepsky and Acock [ 145 ] implemented GA to analyze the water capacity in soil using remote sensing images. The crop yield can be predicted through the capacity of water present in soil. The weed identification was done through GA in [ 142 ]. They used aerial image for classification of plants. In [ 124 ], color image segmentation was used to discriminate the weed and plant. Peerlink et al. [ 154 ] determined the appropriate rate of fertilizer for various portions of agriculture field. They GA for determining the nitrogen in wheat field. The energy requirements in water irrigation systems can be optimized by viewing it as a multi-objective optimization problem. The amount of irrigation required and thus power requirements change continuously in a SMART farm. Therefore, GA can be applied in irrigation systems to reduce the power requirements [ 33 ].

5.2.6 Gaming

GAs have been successfully used in games such as gomoku. In [ 202 ], the authors shown that the GA based approach finds the solution having the highest fitness than the normal tree based methods. However, in real-time strategy based games, GA based solutions become less practical to implement [ 82 ]. GAs have been implemented for path planning problems considering the environment constraints as well as avoiding the obstacles to reach the given destination. Burchardt and Salomon [ 18 ] described an implementation for path planning for soccer games. GA can encode the path planning problems via the coordinate points of a two-dimensional playing field, hence resulting in a variable length solution. The fitness function in path planning considers length of path as well as the collision avoiding terms for soccer players.

5.3 Wireless networking

Due to adaptive, scalable, and easy implementation of GA, it has been used to solve the various issues of wireless networking. The main issues of wireless networking are routing, quality of service, load balancing, localization, bandwidth allocation and channel assignment [ 128 , 134 ]. GA has been hybridized with other metaheuristics for solving the routing problems. Hybrid GA not only producing the efficient routes among pair of nodes, but also used for load balancing [ 24 , 212 ].

5.3.1 Load balancing

Nowadays, multimedia applications require Quality-of-Service (QoS) demand for delay and bandwidth. Various researchers are working on GAs for QoS based solutions.GA produces optimal solutions for complex networks [ 49 ]. Roy et al. [ 172 ] proposed a multi-objective GA for multicast QoS routing problem. GA was used with ACO and other search algorithms for finding optimal routes with desired QoS metrics. Load balancing is another issue in wireless networks. Scully and Brown [ 177 ] used MicroGAs and MacroGAs to distribute the load among various components of networks. He et al. [ 73 ] implemented GA to determine the balance load in wireless sensor networks. Cheng et al. [ 25 ] utilized distributed GA with multi-population scheme for load balancing. They used load balancing metric as a fitness function in GA.

5.3.2 Localization

The process of determining the location of wireless nodes is called as localization. It plays an important role in disaster management and military services. Yun et al. [ 216 ] used GA with fuzzy logic to find out the weights, which are assigned according to the signal strength. Zhang et al. [ 218 ] hybridized GA with simulated annealing (SA) to determine the position of wireless nodes. SA is used as local search to eliminate the premature convergence.

5.3.3 Bandwidth and channel allocation

The appropriate bandwidth allocation is a complex task. GAs and its variants have been developed to solve the bandwidth allocation problem [ 92 , 94 , 107 ]. GAs were used to investigate the allocation of bandwidth with QoS constraints. The fitness function of GAs may consists of resource utilization, bandwidth distribution, and computation time [ 168 ]. The channel allocation is an important issue in wireless networks. The main objective of channel allocation is to simultaneously optimize the number of channels and reuse of allocated frequency. Friend et al. [ 59 ] used distributed island GA to resolve the channel allocation problem in cognitive radio networks. Zhenhua et al. [ 221 ] implemented a modified immune GA for channel assignment. They used different encoding scheme and immune operators. Pinagapany and Kulkarni [ 157 ] developed a parallel GA to solve both static and dynamic channel allocation problem. They used decimal encoding scheme. Table 10 summarizes the applications of GA and its variants.

6 Challenges and future possibilities

In this section, the main challenges faced during the implementation of GAs are discussed followed by the possible research directions.

6.1 Challenges

Despite the several advantages, there are some challenges that need to be resolved for future advancements and further evolution of genetic algorithms. Some major challenges are given below:

6.1.1 Selection of initial population

Initial population is always considered as an important factor for the performance of genetic algorithms. The size of population also affects the quality of solution [ 160 ]. The researchers argue that if a large population is considered, then the algorithm takes more computation time. However, the small population may lead to poor solution [ 155 ]. Therefore, finding the appropriate population size is always a challenging issue. Harik and Lobo [ 71 ] investigated the population using self-adaption method. They used two approaches such as (1) use of self-adaption prior to execution of algorithm, in which the size of population remains the same and (2) in which the self-adaption used during the algorithm execution where the population size is affected by fitness function.

6.1.2 Premature convergence

Premature convergence is a common issue for GA. It can lead to the loss of alleles that makes it difficult to identify a gene [ 15 ]. Premature convergence states that the result will be suboptimal if the optimization problem coincides too early. To avoid this issue, some researchers suggested that the diversity should be used. The selection pressure should be used to increase the diversity. Selection pressure is a degree which favors the better individuals in the initial population of GA’s. If selection pressure (SP1) is greater than some selection pressure (SP2), then population using SP1 should be larger than the population using SP2. The higher selection pressure can decrease the population diversity that may lead to premature convergence [ 71 ].

Convergence property has to be handled properly so that the algorithm finds global optimal solution instead of local optimal solution (see Fig. 8 ). If the optimal solution lies in the vicinity of an infeasible solution, then the global nature of GA can be combined with local nature of other algorithms such as Tabu search and local search. The global nature of genetic algorithms and local nature of Tabu search provide the proper balance between intensification and diversification.

figure 8

Local and global optima [ 149 ]

6.1.3 Selection of efficient fitness functions

Fitness function is the driving force, which plays an important role in selecting the fittest individual in every iteration of an algorithm. If the number of iterations are small, then a costly fitness function can be adjusted. The number of iterations increases may increase the computational cost. The selection of fitness function depends upon the computational cost as well as their suitability. In [ 46 ], the authors used Davies-Bouldin index for classification of documents.

6.1.4 Degree of mutation and crossover

Crossover and mutation operators are the integral part of GAs. If the mutation is not considered during evolution, then there will be no new information available for evolution. If crossover is not considered during evolution, then the algorithm can result in local optima. The degree of these operators greatly affect the performance of GAs [ 72 ]. The proper balance between these operators are required to ensure the global optima. The probabilistic nature cannot determine the exact degree for an effective and optimal solution.

6.1.5 Selection of encoding schemes

GAs require a particular encoding scheme for a specific problem. There is no general methodology for deciding whether the particular encoding scheme is suitable for any type of real-life problem. If there are two different problems, then two different encoding schemes are required. Ronald [ 171 ] suggested that the encoding schemes should be designed to overwhelm the redundant forms. The genetic operators should be implemented in a manner that they are not biased towards the redundant forms.

6.2 Future research directions

GAs have been applied in different fields by modifying the basic structure of GA. The optimality of a solution obtained from GA can be made better by overcoming the current challenges. Some future possibilities for GA are as follows:

There should be some way to choose the appropriate degree of crossover and mutation operators. For example Self-Organizing GA adapt the crossover and mutation operators according to the given problem. It can save computation time that make it faster.

Future work can also be considered for reducing premature convergence problem. Some researchers are working in this direction. However, it is suggested that new methods of crossover and mutation techniques are required to tackle the premature convergence problem.

Genetic algorithms mimic the natural evolution process. There can be a possible scope for simulating the natural evolution process such as the responses of human immune system and the mutations in viruses.

In real-life problems, the mapping from genotype to phenotype is complex. In this situation, the problem has no obvious building blocks or building blocks are not adjacent groups of genes. Hence, there is a possibility to develop novel encoding schemes to different problems that does not exhibit same degree of difficulty.

7 Conclusions

This paper presents the structured and explained view of genetic algorithms. GA and its variants have been discussed with application. Application specific genetic operators are discussed. Some genetic operators are designed for representation. However, they are not applicable to research domains. The role of genetic operators such as crossover, mutation, and selection in alleviating the premature convergence is studied extensively. The applicability of GA and its variants in various research domain has been discussed. Multimedia and wireless network applications were the main attention of this paper. The challenges and issues mentioned in this paper will help the practitioners to carry out their research. There are many advantages of using GAs in other research domains and metaheuristic algorithms.

The intention of this paper is not only provide the source of recent research in GAs, but also provide the information about each component of GA. It will encourage the researchers to understand the fundamentals of GA and use the knowledge in their research problems.

Abbasi M, Rafiee M, Khosravi MR, Jolfaei A, Menon VG, Koushyar JM (2020) An efficient parallel genetic algorithm solution for vehicle routing problem in cloud implementation of the intelligent transportation systems. Journal of cloud Computing 9(6)

Abdelghany A, Abdelghany K, Azadian F (2017) Airline flight schedule planning under competition. Comput Oper Res 87:20–39

MathSciNet   MATH   Google Scholar  

Abdulal W, Ramachandram S (2011) Reliability-aware genetic scheduling algorithm in grid environment. International Conference on Communication Systems and Network Technologies, Katra, Jammu, pp 673–677

Google Scholar  

Abdullah J (2010) Multiobjectives ga-based QoS routing protocol for mobile ad hoc network. Int J Grid Distrib Comput 3(4):57–68

Abo-Elnaga Y, Nasr S (2020) Modified evolutionary algorithm and chaotic search for Bilevel programming problems. Symmetry 12:767

Afrouzy ZA, Nasseri SH, Mahdavi I (2016) A genetic algorithm for supply chain configuration with new product development. Comput Ind Eng 101:440–454

Aiello G, Scalia G (2012) La, Enea M. A multi objective genetic algorithm for the facility layout problem based upon slicing structure encoding Expert Syst Appl 39(12):10352–10358

Alaoui A, Adamou-Mitiche ABH, Mitiche L (2020) Effective hybrid genetic algorithm for removing salt and pepper noise. IET Image Process 14(2):289–296

Alkhafaji BJ, Salih MA, Nabat ZM, Shnain SA (2020) Segmenting video frame images using genetic algorithms. Periodicals of Engineering and Natural Sciences 8(2):1106–1114

Al-Oqaily AT, Shakah G (2018) Solving non-linear optimization problems using parallel genetic algorithm. International Conference on Computer Science and Information Technology (CSIT), Amman, pp. 103–106

Alvesa MJ, Almeidab M (2007) MOTGA: A multiobjective Tchebycheff based genetic algorithm for the multidimensional knapsack problem. Comput Oper Res 34:3458–3470

MathSciNet   Google Scholar  

Arakaki RK, Usberti FL (2018) Hybrid genetic algorithm for the open capacitated arc routing problem. Comput Oper Res 90:221–231

Arkhipov DI, Wu D, Wu T, Regan AC (2020) A parallel genetic algorithm framework for transportation planning and logistics management. IEEE Access 8:106506–106515

Azadeh A, Elahi S, Farahani MH, Nasirian B (2017) A genetic algorithm-Taguchi based approach to inventory routing problem of a single perishable product with transshipment. Comput Ind Eng 104:124–133

Baker JE, Grefenstette J (2014) Proceedings of the first international conference on genetic algorithms and their applications. Taylor and Francis, Hoboken, pp 101–105

Bolboca SD, JAntschi L, Balan MC, Diudea MV, Sestras RE (2010) State of art in genetic algorithms for agricultural systems. Not Bot Hort Agrobot Cluj 38(3):51–63

Bonabeau E, Dorigo M, Theraulaz G (1999) Swarm intelligence: from natural to artificial systems. Oxford University Press, Inc

MATH   Google Scholar  

Burchardt H, Salomon R (2006) Implementation of path planning using genetic algorithms on Mobile robots. IEEE International Conference on Evolutionary Computation, Vancouver, BC, pp 1831–1836

Burdsall B, Giraud-Carrier C (1997) Evolving fuzzy prototypes for efficient data clustering," in second international ICSC symposium on fuzzy logic and applications. Zurich, Switzerland, pp. 217-223.

Burkowski FJ (1999) Shuffle crossover and mutual information. Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406), Washington, DC, USA, 1999, pp. 1574–1580

Chaiyaratana N, Zalzala AM (2000) "Hybridisation of neural networks and a genetic algorithm for friction compensation," in the 2000 congress on evolutionary computation, vol 1. San Diego, USA, pp 22–29

Chen R, Liang C-Y, Hong W-C, Gu D-X (2015) Forecasting holiday daily tourist flow based on seasonal support vector regression with adaptive genetic algorithm. Appl Soft Comput 26:434–443

J.R. Cheng and M. Gen (2020) Parallel genetic algorithms with GPU computing. Impact on Intelligent Logistics and Manufacturing.

Cheng H, Yang S (2010) Multi-population genetic algorithms with immigrants scheme for dynamic shortest path routing problems in mobile ad hoc networks. Applications of evolutionary computation. Springer, In, pp 562–571

Cheng H, Yang S, Cao J (2013) Dynamic genetic algorithms for the dynamic load balanced clustering problem in mobile ad hoc net-works. Expert Syst Appl 40(4):1381–1392

Chouhan SS, Kaul A, Singh UP (2018) Soft computing approaches for image segmentation: a survey. Multimed Tools Appl 77(21):28483–28537

Chuang YC, Chen CT, Hwang C (2016) A simple and efficient real-coded genetic algorithm for constrained optimization. Appl Soft Comput 38:87–105

Coello CAC, Pulido GT (2001) A micro-genetic algorithm for multiobjective optimization. In: EMO, volume 1993 of lecture notes in computer science, pp 126–140. Springer

Das, K. N. (2014). Hybrid genetic algorithm: an optimization tool. In global trends in intelligent computing Research and Development (pp. 268-305). IGI global.

Das AK, Pratihar DK (2018) A direction-based exponential mutation operator for real-coded genetic algorithm. IEEE International Conference on Emerging Applications of Information Technology.

Dash SR, Dehuri S, Rayaguru S (2013) Discovering interesting rules from biological data using parallel genetic algorithm, 3rd IEEE International Advance Computing Conference (IACC), Ghaziabad,, pp. 631–636.

Datta D, Amaral ARS, Figueira JR (2011) Single row facility layout problem using a permutation-based genetic algorithm. European J Oper Res 213(2):388–394

de Ocampo ALP, Dadios EP (2017) "Energy cost optimization in irrigation system of smart farm by using genetic algorithm," 2017IEEE 9th international conference on humanoid. Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM), Manila, pp 1–7

Deb K, Agrawal RB (1995) Simulated binary crossover for continuous search space. Complex Systems 9:115–148

Deb K, Deb D (2014) Analysing mutation schemes for real-parameter genetic algorithms. International Journal of Artificial Intelligence and Soft Computing 4(1):1–28

Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197

Deep K, Das KN (2008) Quadratic approximation based hybrid genetic algorithm for function optimization. Appl Math Comput 203(1):86–98

Deep K, Thakur M (2007) A new mutation operator for real coded genetic algorithms. Appl Math Comput 193:211–230

Deep K, Thakur M (2007) A new crossover operator for real coded genetic algorithms. Appl Math Comput 188:895–911

Dhal KP, Ray S, Das A, Das S (2018) A survey on nature-inspired optimization algorithms and their application in image enhancement domain. Archives of Computational Methods in Engineering 5:1607–1638

Dhiman G, Kumar V (2017) Spotted hyena optimizer: A novel bio-inspired based metaheuristic technique for engineering applications. Adv Eng Softw 114:48–70

Dhiman G, Kumar V (2018) Emperor penguin optimizer: A bio-inspired algorithm for engineering problems. Knowl-Based Syst 159:20–50

Dhiman G, Kumar V (2019) Seagull optimization algorithm: theory and its applications for large-scale industrial engineering problems. Knowl-Based Syst 165:169–196

Di Fatta G, Hoffmann F, Lo Re G, Urso A (2003) A genetic algorithm for the design of a fuzzy controller for active queue management. IEEE Trans Syst Man Cybern Part C Appl Rev 33(3):313–324

Diabat A, Deskoores R (2016) A hybrid genetic algorithm based heuristic for an integrated supply chain problem. J Manuf Syst 38:172–180

Diaz-Manríquez A, Ríos-Alvarado AB, Barrón-Zambrano JH, Guerrero-Melendez TY, Elizondo-Leal JC (2018) An automatic document classifier system based on genetic algorithm and taxonomy. IEEE Access 6:21552–21559. https://doi.org/10.1109/ACCESS.2018.2815992

Article   Google Scholar  

Dorigo M, Birattari M, Stutzle T (2006) Ant colony optimization - artificial ants as a computational intelligence technique. IEEE Comput Intell Mag 1(2006):28–39

Ebrahimzadeh R, Jampour M (2013) Chaotic genetic algorithm based on Lorenz chaotic system for optimization problems. I.J. Intelligent Systems and Applications Intelligent Systems and Applications 05(05):19–24

EkbataniFard GH, Monsefi R, Akbarzadeh-T M-R, Yaghmaee M et al. (2010) A multi-objective genetic algorithm based approach for energy efficient qos-routing in two-tiered wireless sensor net-works. In: wireless pervasive computing (ISWPC), 2010 5th IEEE international symposium on. IEEE, pp 80–85

El-Mihoub T, Hopgood A, Nolle L, Battersby A (2004) Performance of hybrid genetic algorithms incorporating local search. In: Horton G (ed) 18th European simulation multi-conference (ESM2004). Germany, Magdeburg, pp 154–160

El-Mihoub TA, Hopgood AA, Lars N, Battersby A (2006) Hybrid genetic algorithms: A review. Eng Lett 13:2

Emmerich MTM, Deutz AH (2018) A tutorial on multiobjective optimization: fundamentals and evolutionary methods. Nat Comput 17(3):585–609

Eshelman LJ, Caruana RA, Schaffer JD (1997) Biases in the crossover landscape.

Espinoza FB, Minsker B, Goldberg D (2003) Performance evaluation and population size reduction for self adaptive hybrid genetic algorithm (SAHGA), in the Genetic and Evolutionary Computation Conference, vol. 2723, Lecture Notes in Computer Science San Francisco, USA: Springer, pp. 922–933.

Farahani RZ, Elahipanah M (2008) A genetic algorithm to optimize the total cost and service level for just-in-time distribution in a supply chain. Int J Prod Econ 111(2):229–243

Fonseca CM, Fleming PJ (1993) Genetic algorithms for multiobjective optimization: formulation, discussion and generalization. In: ICGA, pp 416–423. Morgan Kaufmann

Fox B, McMahon M (1991) Genetic operators for sequencing problems, in Foundations of Genetic Algorithms, G. Rawlins, Ed. Morgan Kaufmann Publishers, San Mateo,CA, Ed. 1991, pp. 284–300.

Freisleben B, Merz P (1996) New genetic local search operators for the traveling salesman problem," in the Fourth Conference on Parallel Problem Solving from Nature vol. 1141, Lectures Notes in Computer Science, H.-M. Voigt, W. Ebeling, I. Rechenberg, and H.-P. Schwefel, Eds. Berlin, Germany: Springer-Verlag, pp. 890–899.

Friend DH, EI Nainay, M, Shi Y, MacKenzie AB (2008) Architecture and performance of an island genetic algorithm-based cognitive network. In: Consumer communications and networking conference,2008. CCNC 2008. 5th IEEE. IEEE, pp 993–997

Fuertes G, Vargas M, Alfaro M, Soto-Garrido R, Sabattin J, Peralta M-A (2019) Chaotic genetic algorithm and the effects of entropy in performance optimization.

Ghaheri A, Shoar S, Naderan M, Hoseini SS (2015) The applications of genetic algorithms in medicine. CJ 30:406–416

Ghosh S, Bhattachrya S (2020) A data-driven understanding of COVID-19 dynamics using sequential genetic algorithm based probabilistic cellular automata. Applied Soft Computing. 96

Ghoshal AK, Das N, Bhattacharjee S, Chakraborty G (2019) A fast parallel genetic algorithm based approach for community detection in large networks. International Conference on Communication Systems & Networks (COMSNETS), Bengaluru, India, pp. 95–101.

Gogna A, Tayal A (2012) Comparative analysis of evolutionary algorithms for image enhancement. Int J Met 2(1)

Goldberg D (1989) Genetic algorithm in search. Optimization and Machine Learning, Addison -Wesley, Reading, MA 1989

Goldberg D, Lingle R (1985) Alleles, loci and the traveling salesman problem. In: Proceedings of the 1st international conference on genetic algorithms and their applications, vol. 1985. Los Angeles, USA, pp 154–159

Guido R, Conforti D (2017) A hybrid genetic approach for solving an integrated multi-objective operating room planning and scheduling problem. Comput Oper Res 87:270–282

Ha QM, Deville Y, Pham QD, Ha MH (2020) A hybrid genetic algorithm for the traveling salesman problem with drone. J Heuristics 26:219–247

HajiRassouliha A, Gamage TPB, Parker MD, Nash MP, Taberner AJ, Nielsen, PM (2013) FPGA implementation of 2D cross-correlation for real-time 3D tracking of deformable surfaces. In Proceedings of the2013 28th International Conference on Image and Vision Computing New Zealand (IVCNZ 2013), Wellington, New Zealand, 27–29 November 2013; IEEE: Piscataway, NJ, USA; pp. 352–357

Harada T, Alba E (2020) Parallel genetic algorithms: a useful survey. ACM Computing Survey 53(4):1–39

Harik GR, Lobo FG (1999) A parameter-less genetic algorithm, in Proceedings of the Genetic and Evolutionary Computation Conference, pp. 258–265.

Hassanat A, Almohammadi K, Alkafaween E, Abunawas E, Hammouri A, Prasath VBS (December 2019) Choosing mutation and crossover ratios for genetic algorithms—A review with a new dynamic approach. Information 10:390. https://doi.org/10.3390/info10120390

He J, Ji S, Yan M, Pan Y, Li Y (2012) Load-balanced CDS construction in wireless sensor networks via genetic algorithm. Int J Sens Netw 11(3):166–178

Hedar A, Fukushima M (2003) Simplex coding genetic algorithm for the global optimization of nonlinear functions, in Multi-Objective Programming and Goal Programming, Advances in Soft Computing, T. Tanino, T. Tanaka, and M. Inuiguchi, Eds.: Springer-Verlag, pp. 135–140.

Helal MHS, Fan C, Liu D, Yuan S (2017) Peer-to-peer based parallel genetic algorithm. International Conference on Information, Communication and Engineering (ICICE), Xiamen, pp 535–538

Hiassat A, Diabat A, Rahwan I (2017) A genetic algorithm approach for location-inventory-routing problem with perishable products. J Manuf Syst 42:93–103

Holland JH (1975) Adaptation in natural and artificial systems. The U. of Michigan Press

Hong W-C, Dong Y, Chen L-Y, Wei S-Y (2011) SVR with hybrid chaotic genetic algorithms for tourism demand forecasting. Appl Soft Comput 11(2):1881–1890

Hong T-P, Lee Y-C, Min-Thai W (2014) An effective parallel approach for genetic-fuzzy data mining. Exp Syst Applic 41(2):655–662

Horn J, Nafpliotis N, Goldberg DE. (1994) A niched Pareto genetic algorithm for multiobjective optimization. Proceedings of the First IEEE Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence, vol. 1, Piscataway, NJ: IEEE Service Center, p. 67–72.

Hu C, Wang X, Mandal MK, Meng M, Li D (2003) Efficient face and gesture recognition techniques for robot control. Department of Electrical and Computer Engineering University of Alberta, Edmonton, AB, T6G 2V4, Canada. CCECE2003 - CCGEI 2003, Montreal, May/mai 2003 IEEE, pp 1757-1762.

Peng Huo, Simon C. K. Shiu, Haibo Wang, Ben Niu (2009) Application and Comparison of Particle Swarm Optimization and Genetic Algorithm in Strategy Defense Game. Fifth International Conference on Natural Computation, pp 387–392.

Hussain A, Muhammad YS, Nauman Sajid M, Hussain I, Mohamd Shoukry A, Gani S (2017) Genetic algorithm for traveling salesman problem with modified cycle crossover operator. Computational intelligence and neuroscience 2017:1–7

Ishibuchi H, Murata T (1998) A multi-objective genetic local search algorithm and its application to flowshop scheduling. IEEE Trans Syst Man Cybern Part C Appl Rev 28(3):392–403

Jafari A, Khalili T, Babaei E, Bidram A (2020) Hybrid optimization technique using exchange market and GA. IEEE Access 8:2417–2427

Jaszkiewicz A (February 2002) Genetic local search for multi-objective combinatorial optimization. Eur J Oper Res 137(1):50–71

Javidi M, Hosseinpourfard R (2015) Chaos genetic algorithm instead genetic algorithm. Int J Inf Tech 12(2):163–168

Jebari K (2013) Selection methods for genetic algorithms. Abdelmalek Essaâdi University. International Journal of Emerging Sciences 3(4):333–344

Jiang S, Chin K-S, Wang L, Qu G, Tsui KL (2017) Modified genetic algorithm-based feature selection combined with pre-trained deep neural network for demand forecasting in outpatient department. Expert Syst Appl 82:216–230

Jiang M, Fan X, Pei Z, Zhang Z (2018) Research on text feature clustering based on improved parallel genetic algorithm. Tenth International Conference on Advanced Computational Intelligence (ICACI), Xiamen, pp. 235–238

Kaluri R, Reddy P (2016) Sign gesture recognition using modified region growing algorithm and adaptive genetic fuzzy classifier. International Journal of Intelligent Engineering and Systems 9(4):225–233

Kandavanam G, Botvich D, Balasubramaniam S, Jennings B (2010) A hybrid genetic algorithm/variable neighborhood search approach to maximizing residual bandwidth of links for route planning. Artificial evolution. Springer, In, pp 49–60

Kannan S (2020) Intelligent object recognition in underwater images using evolutionary-based Gaussian mixture model and shape matching. SIViP 14:877–885

Karabudak D, Hung C-C, Bing B (2004) A call admission control scheme using genetic algorithms. In: Proceedings of the 2004ACM symposium on applied computing. ACM, pp 1151–1158

Katz P, Aron M, Alfalou A (2001) A face-tracking system to detect falls in the elderly; SPIE newsroom. SPIE, Bellingham, WA, USA, p 201

Kaur M, Kumar V (2018) Beta chaotic map based image encryption using genetic algorithm. Int J Bifurcation Chaos 28(11):1850132

Kaur M, Kumar V (2018) Parallel non-dominated sorting genetic algorithm-II-based image encryption technique. The Imaging Science Journal. 66(8):453–462

Kaur M, Kumar V (2018) Fourier–Mellin moment-based intertwining map for image encryption. Modern Physics Letters B 32(9):1850115

Kaur G, Bhardwaj N, Singh PK (2018) An analytic review on image enhancement techniques based on soft computing approach. Sensors and Image Processing, Advances in Intelligent Systems and Computing 651:255–266

Kavitha AR, Chellamuthu C (2016) Brain tumour segmentation from MRI image using genetic algorithm with fuzzy initialisation and seeded modified region growing (GFSMRG) method. The Imaging Science Journal 64(5):285–297

Kennedy J, Eberhart RC (1995) Particle swarm optimization. In: Proceedings of IEEE international conference on neural networks (1995), pp 1942–1948

Khan, A., ur Rehman, Z., Jaffar, M.A., Ullah, J., Din, A., Ali, A., Ullah, N. (2019) Color image segmentation using genetic algorithm with aggregation-based clustering validity index (CVI). SIViP 13(5), 833–841

Kia R, Khaksar-Haghani F, Javadian N, Tavakkoli-Moghaddam R (2014) Solving a multi-floor layout design model of a dynamic cellular manufacturing system by an efficient genetic algorithm. J Manuf Syst 33(1):218–232

Kim EY, Jung K (2006) Genetic algorithms for video segmentation. Pattern Recogn 38(1):59–73

Kim EY, Park SH (2006) Automatic video segmentation using genetic algorithms. Pattern Recogn Lett 27(11):1252–1265

Kita H, Ono I, Kobayashi S (1999). The multi-parent unimodal normal distribution crossover for real-coded genetic algorithms. Proceedings of the 1999 Congress on Evolutionary Computation, vol. 2, IEEE (1999), pp. 1588–1595

Kobayashi H, Munetomo M, Akama K, Sato Y (2004) Designing a distributed algorithm for bandwidth allocation with a genetic algorithm. Syst Comput Jpn 35(3):37–45

Konak A, Smith AE (1999) A hybrid genetic algorithm approach for backbone design of communication networks, in the 1999 Congress on Evolutionary Computation. Washington D.C, USA: IEEE, pp. 1817-1823.

Kortil Y, Jridi M, Falou AA, Atri M (2020) Face recognition systems: A survey. Sensors. 20:1–34

Krishnan N, Muthukumar S, Ravi S, Shashikala D, Pasupathi P (2013) Image restoration by using evolutionary technique to Denoise Gaussian and impulse noise. In: Prasath R., Kathirvalavakumar T. (eds) mining intelligence and knowledge exploration. Lecture notes in computer science, vol 8284. Springer, Cham.

Kumar A (2013) Encoding schemes in genetic algorithm. Int J Adv Res IT Eng 2(3):1–7

Kumar V, Kumar D (2017) An astrophysics-inspired grey wolf algorithm for numerical optimization and its application to engineering design problems. Adv Eng Softw 112:231–254

Kumar V, Chhabra JK, Kumar D (2014) Parameter adaptive harmony search algorithm for unimodal and multimodal optimization problems. J Comput Sci 5(2):144–155

Kumar C, Singh AK, Kumar P (2017) A recent survey on image watermarking techniques and its application in e-governance. MultiMed Tools Appl.

Kurdi M (2016) An effective new island model genetic algorithm for job shop scheduling problem. Comput Oper Res 67(2016):132–142

Larranaga P, Kuijpers CMH, Murga RH, Yurramendi Y (July 1996) Learning Bayesian network structures by searching for the best ordering with genetic algorithms. in IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 26(4):487–493

Larranaga P, Kuijpers C, Murga R, Inza I, Dizdarevic S (1999) Genetic algorithms for the travelling salesman problem: a review of representations and operators. Artificial Intelligence Review 13:129–170

Chang-Yong Lee (2003) Entropy-Boltzmann selection in the genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 33, no. 1, pp. 138–149, Feb. 2003.

Lee CKH (2018) A review of applications of genetic algorithms in operations management. Eng Appl Artif Intell 76:1–12

Lee Y, Hara T, Fujita H, Itoh S, Ishigaki T (July 2001) Automated detection of pulmonary nodules in helical CT images based on an improved template-matching technique. in IEEE Transactions on Medical Imaging 20(7):595–604

Joon-Yong Lee, Min-Soeng Kim, Cheol-Taek Kim and Ju-Jang Lee (2007) Study on encoding schemes in compact genetic algorithm for the continuous numerical problems,SICE Annual Conference 2007, Takamatsu, pp. 2694–2699.

Leng LT (1999) Guided genetic algorithm. University of Essex, Doctoral Dissertation

Li B, Li J, Tang K, Yao X (2015) Many-objective evolutionary algorithms: A survey. ACM Computing surveys

Lie Tang L (2000) Tian and Brian L steward, "color image segmentation with genetic algorithm for in-field weed sensing". Transactions of the ASAE 43(4):1019–1027

Lima S.J.A., de Araújo S.A. (2018) A new binary encoding scheme in genetic algorithm for solving the capacitated vehicle routing problem. In: Korošec P., Melab N., Talbi EG. (eds) Bioinspired Optimization Methods and Their Applications. BIOMA 2018. Lecture notes in computer science, vol 10835. Springer, Cham

Liu D (2019) Mathematical modeling analysis of genetic algorithms under schema theorem. Journal of Computational Methods in Sciences and Engineering 19:S131–S137

Liu Z, Meng Q, Wang S (2013) Speed-based toll design for cordon-based congestion pricing scheme. Transport Res Part C: Emerg Technol 31(2013):83–98

Lorenzo B, Glisic S (2013) Optimal routing and traffic scheduling for multihop cellular networks using genetic algorithm. IEEE Trans Mob Comput 12(11):2274–2288

Lucasius CB, Kateman G (1989) Applications of genetic algorithms in chemometrics. In: Proceedings of the 3rd international conference on genetic algorithms. Morgan Kaufmann, Los Altos, CA, USA, pp 170–176

Luo B, Jinhua Zheng, Jiongliang Xie, Jun Wu. Dynamic crowding distance – a new diversity maintenance strategy for MOEAs. ICNC ‘08, Fourth Int. Conf. on Natural Comp., vol. 1 (2008), pp. 580–585

Maghawry A, Kholief M, Omar Y, Hodhod R (2020) An approach for evolving transformation sequences using hybrid genetic algorithms. Int J Intell Syst 13(1):223–233

Manzoni L, Mariot L, Tuba E (2020) Balanced crossover operators in genetic algorithms. Swarm and Evolutionary Computation 54:100646

Mazinani M, Abedzadeh M, Mohebali N (2013) Dynamic facility layout problem based on flexible bay structure and solving by genetic algorithm. Int J Adv Manuf Technol 65(5–8):929–943

Mehboob U, Qadir J, Ali S, Vasilakos A (2016) Genetic algorithms in wireless networking: techniques, applications, and issues. Soft Comput 20:2467–2501

Michalewicz Z (1992) Genetic algorithms + data structures = evolution programs. Springer-Verlag, New York

Michalewicz Z, Schoenauer M (1996) Evolutionary algorithms for constrained parameter optimization problems. Evol Comput 4(1):1–32

Mishra R, Das KN (2017). A novel hybrid genetic algorithm for unconstrained and constrained function optimization. In bio-inspired computing for information retrieval applications (pp. 230-268). IGI global

Moher D, Liberati A, Tetzlaff J, Altman DG, The PRISMA Group (2009) Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med 6(7):e1000097

Mooi S, Lim S, Sultan M, Bakar A, Sulaiman M, Mustapha A, Leong KY (2017) Crossover and mutation operators of genetic algorithms. International Journal of Machine Learning and Computing 7:9–12

Mudaliar DN, Modi NK (2013) Unraveling travelling salesman problem by genetic algorithm using m-crossover operator. International Conference on Signal Processing, Image Processing & Pattern Recognition, Coimbatore, pp 127–130

T. Murata and M. Gen (2000) Cellular genetic algorithm for multi-objective optimization, in Proceedings of the Fourth Asian Fuzzy System Symposium, pp. 538–542

Neto JC, Meyer GE, Jones DD (2006) Individual leaf extractions from young canopy images using gustafsonkessel clustering and a genetic algorithm. Comput Electron Agric 51(1):66–85

NKFC, Viswanatha SDK (2009) Routing algorithm using mobile agents and genetic algorithm. Int J Comput Electr Eng, vol 1, no 3

Ono I, Kobayashi S (1997) A real-coded genetic algorithm for functional optimization using unimodal normal distribution crossover. In: Back T (ed) Proceedings of the 7th international conference on genetic algorithms, ICGA-7. Morgan Kaufmann, East Lansing, MI, USA, pp 246–253

Pachepsky Y, Acock B (1998) Stochastic imaging of soil parameters to assess variability and uncertainty of crop yield estimates. Geoderma 85(2):213–229

Paiva JPD, Toledo CFM, Pedrini H (2016) An approach based on hybrid genetic algorithm applied to image denoising problem. Appl Soft Comput 46:778–791

Palencia AER, Delgadillo GEM (2012) A computer application for a bus body assembly line using genetic algorithms. Int J Prod Econ 140(1):431–438

Palomo-Romero JM, Salas-Morera L, García-Hernández L (2017) An island model genetic algorithm for unequal area facility layout problems. Expert Syst Appl 68:151–162

Pandian S, Modrák V (December 2009) "possibilities, obstacles and challenges of genetic algorithm in manufacturing cell formation," advanced logistic systems, University of Miskolc. Department of Material Handling and Logistics 3(1):63–70

Park Y-B, Yoo J-S, Park H-S (2016) A genetic algorithm for the vendor-managed inventory routing problem with lost sales. Expert Syst Appl 53:149–159

Patel R, Raghuwanshi MM, Malik LG (2012) Decomposition based multi-objective genetic algorithm (DMOGA) with opposition based learning

Pattanaik JK, Basu M, Dash DP (2018) Improved real coded genetic algorithm for dynamic economic dispatch. Journal of electrical systems and information technology. Vol. 5(3):349–362

Payne AW, Glen RC (1993) Molecular recognition using a binary genetic system. J Mol Graph 11(2):74–91

Peerlinck A, Sheppard J, Pastorino J, Maxwell B (2019) Optimal Design of Experiments for precision agriculture using a genetic algorithm. IEEE Congress on Evolutionary Computation.

Pelikan M, Goldberg DE, Cantu-Paz E (2000) Bayesian optimization algorithm, population sizing, and time to convergence, Illinois Genetic Algorithms Laboratory, University of Illinois, Tech. Rep

Pilat ML, White T (2002) Using genetic algorithms to optimize ACS-TSP, in the Third International Workshop on Ant Algorithms, vol. Lecture Notes In Computer Science 2463. Berlin, Germany: Springer-Verlag, pp. 282–287.

Pinagapany S, Kulkarni A (2008) Solving channel allocation problem in cellular radio networks using genetic algorithm. In: Communication Systems software and middleware and workshops, 2008.COMSWARE 2008. 3rd International Conference on. IEEE, pp239–244

Pinel F, Dorronsoro B, Bouvry P (2013) Solving very large instances of the scheduling of independent tasks problem on the GPU. J Parallel Distrib. Comput 73(1):101–110

Pinto G, Ainbinder I, Rabinowitz G (2009) A genetic algorithm-based approach for solving the resource-sharing and scheduling problem. Comput Ind Eng 57(3):1131–1143

Piszcz A, Soule T (2006) Genetic programming: optimal population sizes for varying complexity problems, in Proceedings of the Genetic and Evolutionary Computation Conference, pp. 953–954.

Porta J, Parapar R, Doallo F, Rivera F, Santé I, Crecente R (2013) High performance genetic algorithm for land use planning. Comput Environ Urb Syst 37(2013):45–58

Rafsanjani MK, Riyahi M (2020) A new hybrid genetic algorithm for job shop scheduling problem. International Journal of Advanced Intelligence Paradigms 16(2):157–171

Rathi R, Acharjya DP (2018) A framework for prediction using rough set and real coded genetic algorithm. Arab J Sci Eng 43(8):4215–4227

Rathi R, Acharjya DP (2018) A rule based classification for vegetable production using rough set and genetic algorithm. International Journal of Fuzzy System Applications (IJFSA) 7(1):74–100

Rathi R, Acharjya DP (2020) A comparative study of genetic algorithm and neural network computing techniques over feature selection, In advances in distributed computing and machine learning (pp. 491–500). Springer, Singapore

Ray SS, Bandyopadhyay S, Pal SK (2004) New operators of genetic algorithms for traveling salesman problem," Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Cambridge pp 497-500

Richter JN, Peak D (2002) Fuzzy evolutionary cellular automata, in international conference on artificial neural networks in engineering, vol 12. USA, Saint Louis pp. 185-191

Riedl A (2002) A hybrid genetic algorithm for routing optimization in ip networks utilizing bandwidth and delay metrics. In: IP operations and management, 2002 IEEE Workshop on. IEEE, pp 166–170

Ripon KSN, Siddique N, Torresen J (2011) Improved precedence preservation crossover for multi-objective job shop scheduling problem. Evolving Systems 2:119–129

Roberge V, Tarbouchi M, Okou F (2014) Strategies to accelerate harmonic minimization in multilevel inverters using a parallel genetic algorithm on graphical processing unit. IEEE Trans Power Electron 29(10):5087–5090

Ronald S (1997) Robust encoding in genetic algorithms: a survey of encoding issues. IEEE international conference on evolutionary computation, pp. 43-48

Roy A, Banerjee N, Das SK (2002) An efficient multi-objective qos-routing algorithm for wireless multicasting. In:Vehicular technology conference, 2002. VTC Spring 2002. IEEE 55th, vol 3., pp 1160–1164

Sadrzadeh A (2012) A genetic algorithm with the heuristic procedure to solve the multi-line layout problem. Comput Ind Eng 62(4):1055–1064

Sahingoz OK (2014) Generation of Bezier curve-based flyable trajectories for multi-UAV systems with parallel genetic algorithm. J Intell Robot Syst 74(1):499–511

Saini N (2017) Review of selection methods in genetic algorithms. International Journal of Engineering and Computer Science 6(12):22261–22263

Sari M, Can T (2018) Prediction of pathological subjects using genetic algorithms. Computational and Mathematical Methods in Medicine 2018:1–9

Scully T, Brown KN (2009) Wireless LAN load balancing with genetic algorithms. Knowl Based Syst 22(7):529–534

Sermpinis G, Stasinakis C, Theofilatos K, Karathanasopoulos A (2015) Modeling, forecasting and trading the EUR exchange rates with hybrid rolling genetic algorithms–support vector regression forecast combinations. European J. Oper. Res. 247(3):831–846

Shabankareh SG, Shabankareh SG (2019) Improvement of edge-tracking methods using genetic algorithm and neural network, 2019 5th Iranian conference on signal processing and intelligent systems (ICSPIS). Shahrood, Iran, pp 1–7. https://doi.org/10.1109/ICSPIS48872.2019.9066026

Book   Google Scholar  

Sharma S, Gupta K (2011) Solving the traveling salesman problem through genetic algorithm with new variation order crossover. International Conference on Emerging Trends in Networks and Computer Communications (ETNCC), Udaipur, pp. 274–276

Sharma N, Kaushik I, Rathi, R, Kumar S (2020) Evaluation of accidental death records using hybrid genetic algorithm. Available at SSRN: https://ssrn.com/abstract=3563084 or https://doi.org/10.2139/ssrn.3563084

Shayeghi A, Gotz D, Davis JBA, Schafer R, Johnston RL (2015) Pool-BCGA: A parallelised generation-free genetic algorithm for the ab initio global optimisation of nano alloy clusters. Phys Chem Chem Phys 17(3):2104–2112

Guoyong Shi, H. Iima and N. Sannomiya (1996) A new encoding scheme for solving job shop problems by genetic algorithm, Proceedings of 35th IEEE Conference on Decision and Control, Kobe, Japan, 1996, pp. 4395–4400 vol.4.

Shi J, Liu Z, Tang L, Xiong J (2017) Multi-objective optimization for a closed-loop network design problem using an improved genetic algorithm. Appl Math Model 45:14–30

Shukla AK, Singh P, Vardhan M (2019) A new hybrid feature subset selection framework based on binary genetic algorithm and information theory. International Journal of Computational Intelligence and Applications 18(3):1950020(1–10)

Singh A, Deep K (2015) Real coded genetic algorithm operators embedded in gravitational search algorithm for continuous optimization. Int J Intell Syst Appl 7(12):1

Sivanandam SN, Deepa SN (2008) Introduction to genetic algorithm, 1st edn. Springer-Verlag, Berlin Heidelberg

Soleimani H, Kannan G (2015) A hybrid particle swarm optimization and genetic algorithm for closed-loop supply chain network design in large-scale networks. Appl Math Model 39(14):3990–4012

Soleimani H, Govindan K, Saghafi H, Jafari H (2017) Fuzzy multi-objective sustainable and green closed-loop supply chain network design. Comput Ind Eng 109:191–203

Soon GK, Guan TT, On CK, Alfred R, Anthony P (2013) "A comparison on the performance of crossover techniques in video game," 2013 IEEE international conference on control system. Computing and Engineering, Mindeb, pp 493–498

Srinivas N, Deb K (1995) Multi-objective function optimization using non-dominated sorting genetic algorithms. Evol Comput 2(3):221–248

Subbaraj P, Rengaraj R, Salivahanan S (2011) Enhancement of self-adaptive real-coded genetic algorithm using Taguchi method for economic dispatch problem. Appl Soft Comput 11(1):83–92

Tahir M, Tubaishat A, Al-Obeidat F, et al. (2020) A novel binary chaotic genetic algorithm for feature selection and its utility in affective computing and healthcare. Neural Comput & Appl

Tam V, Cheng K-Y, Lui K-S (2006) Using micro-genetic algorithms to improve localization in wireless sensor networks. J Commun 1(4):1–10

Tan KC, Li Y, Murray-Smith DJ, Sharman KC (1995) System identification and linearisation using genetic algorithms with simulated annealing, in First IEE/IEEE Int. Conf. on GA in Eng. Syst.: Innovations and Appl. Sheffield, UK, pp. 164–169.

Tang PH, Tseng MH (2013) Adaptive directed mutation for real-coded genetic algorithms. Appl Soft Comput 13(1):600–614

Tiong SK, Yap DFW, Koh SP (2012) A comparative analysis of various chaotic genetic algorithms for multimodal function optimization. Trends in Applied Sciences Research 7:785–791

Toutouh J, Alba E (2017) Parallel multi-objective metaheuristics for smart communications in vehicular networks. Soft Comput 21(8):1949–1961

Umbarkar A, Sheth P (2015) Crossover operators in genetic algorithms: a review. Journal on Soft Computing 6(1)

Verma D, Vishwakarma VP, Dalal S (2020) A hybrid self-constrained genetic algorithm (HSGA) for digital image Denoising based on PSNR improvement. Advances in Bioinformatics, Multimedia, and Electronics Circuits and Signals, In, pp 135–153

Vitayasak S, Pongcharoen P, Hicks C (2016) A tool for solving stochastic dynamic facility layout problems with stochastic demand using either a genetic algorithm or modified backtracking search algorithm. Int J Prod Econ

Junru Wang and Lan Huang (2014) Evolving gomoku Solver by Genetic Algorithm. IEEE Workshop on Advanced Research and Technology in Industry Applications (WARTIA) pp 1064–1067.

Wang L, Kan MS, Shahriar Md R, Tan ACC (2014) Different approaches of applying single-objective binary genetic algorithm on the wind farm design. In World Congress on Engineering Asset Management.

Wang N, Li Q, Abd El-Latif AA, Zhang T, Niu X (2014) Toward accurate localization and high recognition performance for noisy iris images. Multimed Tools Appl 71(3):1411–1430

Wang JQ, Ersoy OK, He MY et al (2016) Multi-offspring genetic algorithm and its application to the traveling salesman problem. Appl Soft Comput 43:415–423

Wang FL, Fu XM, Zhu HX et al (2016) Multi-child genetic algorithm based on two-point crossover. J Northeast Agric Univ 47(3):72–79

Wang JQ, Cheng ZW, Ersoy OK et al (2018) Improvement analysis and application of real-coded genetic algorithm for solving constrained optimization problems. Math Probl Eng 2018:1–16

Wang J, Zhang M, Ersoy OK, Sun K, Bi Y (2019) An improved real-coded genetic algorithm using the Heuristical Normal distribution and direction-based crossover. Computational Intelligence and Neuroscience 2019:1–17

Wen Z, Yang R, Garraghan P, Lin T, Xu J, Rovatsos M (2017) Fog orchestration for internet of things services. IEEE Internet Comput 21(2) (Mar. 2017):16–24

Wright AH (1991) Genetic algorithms for real parameter optimization. In Foundations of genetic algorithms I,G. J. E. Rawlins, Ed., Morgan Kaufmann, San Mateo, CA,USA

Wu X, Chu C-H, Wang Y, Yan W (2007) A genetic algorithm for cellular manufacturing design and layout. European J Oper Res 181(1):156–167

Yang S, Cheng H, Wang F (2010) Genetic algorithms with immigrants and memory schemes for dynamic shortest path routing problems in mobile ad hoc networks. IEEE Trans Syst Man Cybern Part C Appl Rev 40(1):52–63

Yang C, Li H, Rezgui Y, Petri I, Yuce B, Chen B, Jayan B (2014) High throughput computing based distributed genetic algorithm for building energy consumption optimization. Energy Build 76(2014):92–101

Yu F, Xu X (2014) A short-term load forecasting model of natural gas based on optimized genetic algorithm and improve BR neural network. Appl Energy 134:102–113

Yuce B, Fruggiero F, Packianather MS, Pham DT, Mastrocinque E, Lambiase A, Fera M (2017) Hybrid genetic bees algorithm applied to single machine scheduling with earliness and tardiness penalties. Comput Ind Eng 113:842–858

Yun S, Lee J, Chung W, Kim E, Kim S (2009) A soft computing approach to localization in wireless sensor networks. Expert Syst Appl 36(4):7552–7561

Zhai R (2020) Solving the optimization of physical distribution routing problem with hybrid genetic algorithm. J Phys Conf Ser 1550:1–6

Zhang Q, Wang J, Jin C, Zeng Q (2008) Localization algorithm for wireless sensor network based on genetic simulated annealing algorithm. In: 4th IEEE International Conference on Wireless communications, networking and mobile computing. Pp 1–5

Zhang R, Ong SK, Nee AYC (2015) A simulation-based genetic algorithm approach for remanufacturing process planning and scheduling. Appl Soft Comput 37:521–532

Zhang X-Y, Zhang J, Gong Y-J, Zhan Z-H, Chen W-N, Li Y (2016) Kuhn-Munkres parallel genetic algorithm for the set cover problem and its application to large-scale wireless sensor networks. IEEETrans Evol Comput 20(5):695–710

Zhenhua Y, Guangwen Y, Shanwei L, Qishan Z (2010) A modified immune genetic algorithm for channel assignment problems in cellular radio networks. In: Intelligent system design and engineering application (ISDEA), 2010 International Conference on, vol 2. , pp 823–826

Download references

Author information

Authors and affiliations.

Computer Science and Engineering Department, National Institute of Technology, Hamirpur, India

Sourabh Katoch, Sumit Singh Chauhan & Vijay Kumar

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Vijay Kumar .

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Katoch, S., Chauhan, S.S. & Kumar, V. A review on genetic algorithm: past, present, and future. Multimed Tools Appl 80 , 8091–8126 (2021). https://doi.org/10.1007/s11042-020-10139-6

Download citation

Received : 27 July 2020

Revised : 12 October 2020

Accepted : 23 October 2020

Published : 31 October 2020

Issue Date : February 2021

DOI : https://doi.org/10.1007/s11042-020-10139-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Optimization
  • Metaheuristic
  • Genetic algorithm
  • Find a journal
  • Publish with us
  • Track your research

Here’s how you know

  • U.S. Department of Health and Human Services
  • National Institutes of Health

275 million new genetic variants identified in NIH precision medicine data

Study details the unprecedented scale, diversity, and power of the all of us research program.

For Immediate Release: Monday , February 19, 2024

Researcher reviewing data on computer screen

Researchers have discovered more than 275 million previously unreported genetic variants, identified from data shared by nearly 250,000 participants of the National Institutes of Health’s All of Us Research Program. Half of the genomic data are from participants of non-European genetic ancestry. The unexplored cache of variants provides researchers new pathways to better understand the genetic influences on health and disease, especially in communities who have been left out of research in the past. The findings are detailed in Nature , alongside three other articles in Nature journals.

Nearly 4 million of the newly identified variants are in areas that may be tied to disease risk. The genomic data detailed in the study is available to registered researchers in the Researcher Workbench , the program’s platform for data analysis.

“As a physician, I’ve seen the impact the lack of diversity in genomic research has had in deepening health disparities and limiting care for patients,” said Josh Denny, M.D., M.S., chief executive officer of the All of Us Research Program and an author of the study. “The All of Us dataset has already led researchers to startling findings that challenge what we know about health. It is setting a course for a future where scientific discovery is more inclusive, with broader benefits for all.”

To date, more than 90% of participants in large genomics studies have been of European genetic ancestry. NIH Institute and Center directors noted in an accompanying commentary article in Nature Medicine that this has led to a narrow understanding of the biology of diseases and impeded the development of new treatments and prevention strategies for all populations. They emphasize that many researchers are now utilizing the All of Us dataset to advance precision medicine for all.

For example, in a companion study published in Communications Biology, a research team led by Baylor College of Medicine, Houston, reviewed the frequency of genes and variants recommended by the American College of Medical Genetics and Genomics across different genetic ancestry groups. These genes and variants mirror those in the program’s Hereditary Disease Risk research results offered to participants. The authors found significant variability in the frequency of variants associated with disease risk between different genetic ancestry groups and compared with other large genomic datasets.

While more research is needed before these findings can be used to tailor genetic testing recommendations for specific populations, researchers believe the difference in the number of disease-causing variants may be influenced by past studies’ limited diversity and their disease-focused approach to participant enrollment, rather than a difference in the prevalence of disease-causing variants.

In a separate study, investigators tapped the All of Us dataset to calibrate and implement 10 polygenic risk scores for common diseases across diverse genetic ancestry groups ( Nature Medicine, Lennon). These scores calculate an individual’s risk of disease by taking into account genetic and family history factors. Without accounting for diversity, polygenic risk scores could cause false results that misrepresent a person’s risk for disease and create inequitable genetic tools. Without the diversity of the All of Us data, these polygenic risk scores would have only been applicable to some of the population.

“ All of Us values intentional community engagement to ensure that populations historically underrepresented in biomedical research can also benefit from future scientific discoveries,” said Karriem Watson, D.H.Sc., M.S., M.P.H., chief engagement officer of the All of Us Research Program. “This starts with building awareness and improving access to medical research so that everyone has the opportunity to participate.”

More than 750,000 people have enrolled in All of Us to date. Ultimately, the program plans to engage at least one million people who reflect the diversity of the United States and contribute data from DNA, electronic health records, wearable devices, surveys, and more over time. The program regularly expands and refreshes the dataset as more participants share information.

To learn more about All of Us’ scientific resources, visit researchallofus.org .

All of Us is a registered service mark of the U.S. Department of Health & Human Services (HHS).

About the All of Us Research Program: The mission of the All of Us Research Program is to accelerate health research and medical breakthroughs, enabling individualized prevention, treatment, and care for all of us. The program will partner with one million or more people across the United States to build the most diverse biomedical data resource of its kind, to help researchers gain better insights into the biological, environmental, and behavioral factors that influence health. For more information, visit www.ResearchAllofUs.org , www.JoinAllofUs.org , and https://www.AllofUs.nih.gov/ .

About the National Center for Complementary and Integrative Health (NCCIH):  NCCIH’s mission is to define, through rigorous scientific investigation, the usefulness and safety of complementary and integrative health approaches and their roles in improving health and health care. For additional information, call NCCIH’s Clearinghouse toll free at 1-888-644-6226. Follow us on Twitter , Facebook , and YouTube .

About the National Institutes of Health (NIH): NIH, the nation’s medical research agency, includes 27 Institutes and Centers and is a component of the U.S. Department of Health and Human Services. NIH is the primary federal agency conducting and supporting basic, clinical, and translational medical research, and is investigating the causes, treatments, and cures for both common and rare diseases. For more information about NIH and its programs, visit www.nih.gov .

All of Us Press Office  301-827-6877 [email protected]

  • U.S. Department of Health & Human Services

National Institutes of Health (NIH) - Turning Discovery into Health

  • Virtual Tour
  • Staff Directory
  • En Español

You are here

News releases.

News Release

Tuesday, February 20, 2024

Researchers optimize genetic tests for diverse populations to tackle health disparities

Improved genetic tests more accurately assess disease risk regardless of genetic ancestry.

To prevent an emerging genomic technology from contributing to health disparities, a scientific team funded by the National Institutes of Health has devised new ways to improve a genetic testing method called a polygenic risk score . Since polygenic risk scores have not been effective for all populations, the researchers recalibrated these genetic tests using ancestrally diverse genomic data. As reported in Nature Medicine , the optimized tests provide a more accurate assessment of disease risk across diverse populations.

Genetic tests look at the small differences between individuals’ genomes, known as genomic variants , and polygenic risk scores are tools for assessing many genomic variants across the genome to determine a person’s risk for disease. As the use of polygenic risk scores grows, one major concern is that the genomic datasets used to calculate the scores often heavily overrepresent people of European ancestry.

“Recently, more and more studies incorporate multi-ancestry genomic data into the development of polygenic risk scores,” said Niall Lennon, Ph.D., a scientist at the Broad Institute in Cambridge, Massachusetts and first author of the publication. “However, there are still gaps in genetic ancestral representation in many scores that have been developed to date.”

These “gaps” or missing data can cause false results, where a person could be at high risk for a disease but not receive a high-risk score because their genomic variants are not represented. Although health disparities often stem from systemic discrimination, not genetics, these false results are a way that inequitable genetic tools can exacerbate existing health disparities.

In the new study, the researchers improved existing polygenic risk scores using health records and ancestrally diverse genomic data from the All of Us Research Program, an NIH-funded initiative to collect health data from over a million people from diverse backgrounds.

The All of Us dataset represented about three times as many individuals of non-European ancestry compared to other major datasets previously used for calculating polygenic risk scores. It also included eight times as many individuals with ancestry spanning two or more global populations. Strong representation of these individuals is key as they are more likely than other groups to receive misleading results from polygenic risk scores.

The researchers selected polygenic risk scores for 10 common health conditions, including breast cancer, prostate cancer, chronic kidney disease, coronary heart disease, asthma and diabetes. Polygenic risk scores are particularly useful for assessing risk for conditions that result from a combination of several genetic factors, as is the case for the 10 conditions selected. Many of these health conditions are also associated with health disparities.

The researchers assembled ancestrally diverse cohorts from the All of Us data, including individuals with and without each disease. The genomic variants represented in these cohorts allowed the researchers to recalibrate the polygenic risk scores for individuals of non-European ancestry.

With the optimized scores, the researchers analyzed disease risk for an ancestrally diverse group of 2,500 individuals. About 1 in 5 participants were found to be at high risk for at least one of the 10 diseases.

Most importantly, these participants ranged widely in their ancestral backgrounds, showing that the recalibrated polygenic risk scores are not skewed towards people of European ancestry and are effective for all populations.

“Our model strongly increases the likelihood that a person in the high-risk end of the distribution should receive a high-risk result regardless of their genetic ancestry,” said Dr. Lennon. “The diversity of the All of Us dataset was critical for our ability to do this.”

However, these optimized scores cannot address health disparities alone. “Polygenic risk score results are only useful to patients who can take action to prevent disease or catch it early, and people with less access to healthcare will also struggle to get the recommended follow-up activities, such as more frequent screenings,” said Dr. Lennon.

Still, this work is an important step towards routine use of polygenic risk scores in the clinic to benefit all people. The 2,500 participants in this study represent just an initial look at the improved polygenic risk scores. NIH’s Electronic Medical Health Records and Genomics (eMERGE) Network will continue this research by enrolling a total of 25,000 participants from ancestrally diverse populations in the study’s next phase.

About the National Institutes of Health (NIH): NIH, the nation's medical research agency, includes 27 Institutes and Centers and is a component of the U.S. Department of Health and Human Services. NIH is the primary federal agency conducting and supporting basic, clinical, and translational medical research, and is investigating the causes, treatments, and cures for both common and rare diseases. For more information about NIH and its programs, visit www.nih.gov .

NIH…Turning Discovery Into Health ®

Connect with Us

  • More Social Media from NIH
  • Introduction to Genomics
  • Educational Resources
  • Policy Issues in Genomics
  • The Human Genome Project
  • Funding Opportunities
  • Funded Programs & Projects
  • Division and Program Directors
  • Scientific Program Analysts
  • Contact by Research Area
  • News & Events
  • Research Areas
  • Research investigators
  • Research Projects
  • Clinical Research
  • Data Tools & Resources
  • Genomics & Medicine
  • Family Health History
  • For Patients & Families
  • For Health Professionals
  • Jobs at NHGRI
  • Training at NHGRI
  • Funding for Research Training
  • Professional Development Programs
  • NHGRI Culture
  • Social Media
  • Broadcast Media
  • Image Gallery
  • Press Resources
  • Organization
  • NHGRI Director
  • Mission & Vision
  • Policies & Guidance
  • Institute Advisors
  • Strategic Vision
  • Leadership Initiatives
  • Diversity, Equity, and Inclusion
  • Partner with NHGRI
  • Staff Search

Researchers optimize genetic tests for diverse populations to tackle health disparities

  • Share on Facebook
  • Submit to Reddit
  • Share on LinkedIn

Improved genetic tests more accurately assess disease risk regardless of genetic ancestry.

To prevent an emerging genomic technology from contributing to health disparities, a scientific team funded by the National Institutes of Health has devised new ways to improve a genetic testing method called a polygenic risk score . Since polygenic risk scores have not been effective for all populations, the researchers recalibrated these genetic tests using ancestrally diverse genomic data. As reported in Nature Medicine , the optimized tests provide a more accurate assessment of disease risk across diverse populations.

Genetic tests look at the small differences between individuals’ genomes, known as genomic variants , and polygenic risk scores are tools for assessing many genomic variants across the genome to determine a person’s risk for disease. As the use of polygenic risk scores grows, one major concern is that the genomic datasets used to calculate the scores often heavily overrepresent people of European ancestry.

“Recently, more and more studies incorporate multi-ancestry genomic data into the development of polygenic risk scores,” said Niall Lennon, Ph.D., a scientist at the Broad Institute in Cambridge, Massachusetts and first author of the publication. “However, there are still gaps in genetic ancestral representation in many scores that have been developed to date.”

These “gaps” or missing data can cause false results, where a person could be at high risk for a disease but not receive a high-risk score because their genomic variants are not represented. Although health disparities often stem from systemic discrimination, not genetics, these false results are a way that inequitable genetic tools can exacerbate existing health disparities.

Recently, more and more studies incorporate multi-ancestry genomic data into the development of polygenic risk scores. However, there are still gaps in genetic ancestral representation in many scores that have been developed to date.

In the new study, the researchers improved existing polygenic risk scores using health records and ancestrally diverse genomic data from the All of Us  Research Program, an NIH-funded initiative to collect health data from over a million people from diverse backgrounds.

The All of Us dataset represented about three times as many individuals of non-European ancestry compared to other major datasets previously used for calculating polygenic risk scores. It also included eight times as many individuals with ancestry spanning two or more global populations. Strong representation of these individuals is key as they are more likely than other groups to receive misleading results from polygenic risk scores.

The researchers selected polygenic risk scores for 10 common health conditions, including breast cancer, prostate cancer, chronic kidney disease, coronary heart disease, asthma and diabetes. Polygenic risk scores are particularly useful for assessing risk for conditions that result from a combination of several genetic factors, as is the case for the 10 conditions selected. Many of these health conditions are also associated with health disparities.

The researchers assembled ancestrally diverse cohorts from the All of Us data, including individuals with and without each disease. The genomic variants represented in these cohorts allowed the researchers to recalibrate the polygenic risk scores for individuals of non-European ancestry.

With the optimized scores, the researchers analyzed disease risk for an ancestrally diverse group of 2,500 individuals. About 1 in 5 participants were found to be at high risk for at least one of the 10 diseases.

Most importantly, these participants ranged widely in their ancestral backgrounds, showing that the recalibrated polygenic risk scores are not skewed towards people of European ancestry and are effective for all populations.

“Our model strongly increases the likelihood that a person in the high-risk end of the distribution should receive a high-risk result regardless of their genetic ancestry,” said Dr. Lennon. “The diversity of the All of Us dataset was critical for our ability to do this.”

However, these optimized scores cannot address health disparities alone. “Polygenic risk score results are only useful to patients who can take action to prevent disease or catch it early, and people with less access to healthcare will also struggle to get the recommended follow-up activities, such as more frequent screenings,” said Dr. Lennon.

Still, this work is an important step towards routine use of polygenic risk scores in the clinic to benefit all people. The 2,500 participants in this study represent just an initial look at the improved polygenic risk scores. NIH’s Electronic Medical Health Records and Genomics (eMERGE) Network will continue this research by enrolling a total of 25,000 participants from ancestrally diverse populations in the study’s next phase.

About NHGRI and NIH

About the National Human Genome Research Institute (NHGRI):  At NHGRI, we are focused on advances in genomics research. Building on our leadership role in the initial sequencing of the human genome, we collaborate with the world's scientific and medical communities to enhance genomic technologies that accelerate breakthroughs and improve lives. By empowering and expanding the field of genomics, we can benefit all of humankind. For more information about NHGRI and its programs, visit  www.genome.gov . About the National Institutes of Health (NIH):  NIH, the nation's medical research agency, includes 27 Institutes and Centers and is a component of the U.S. Department of Health and Human Services. NIH is the primary federal agency conducting and supporting basic, clinical, and translational medical research, and is investigating the causes, treatments, and cures for both common and rare diseases. For more information about NIH and its programs, visit  www.nih.gov .

Press Contact

Related content.

All of Us Research Program

Last updated: February 19, 2024

StatAnalytica

251+ Life Science Research Topics [Updated]

life science research topics

Life science research is like peering into the intricate workings of the universe, but instead of stars and galaxies, it delves into the mysteries of life itself. From unraveling the secrets of our genetic code to understanding ecosystems and everything in between, life science research encompasses a vast array of fascinating topics. In this blog post, we’ll embark on a journey through some of the most captivating life science research topics within the realm of life science research.

What is research in life science?

Table of Contents

Research in life science involves the systematic investigation and study of living organisms, their interactions, and their environments. It encompasses a wide range of disciplines, including biology, genetics, ecology, microbiology, neuroscience, and more.

Life science research aims to expand our understanding of the fundamental principles governing life processes, uncover new insights into biological systems, develop innovative technologies and therapies, and address pressing challenges in areas such as healthcare, agriculture, and conservation.

251+ Life Science Research Topics: Category Wise

Genetics and genomics.

  • Genetic basis of inherited diseases
  • Genome-wide association studies
  • Epigenetics and gene regulation
  • Evolutionary genomics
  • CRISPR/Cas9 gene editing technology
  • Pharmacogenomics and personalized medicine
  • Population genetics
  • Functional genomics
  • Comparative genomics across species
  • Genetic diversity and conservation

Biotechnology and Bioengineering

  • Biopharmaceutical production
  • Metabolic engineering for biofuel production
  • Synthetic biology applications
  • Bioremediation techniques
  • Nanotechnology in drug delivery
  • Tissue engineering and regenerative medicine
  • Biosensors for environmental monitoring
  • Bioprocessing optimization
  • Biodegradable plastics and sustainable materials
  • Agricultural biotechnology for crop improvement

Ecology and Environmental Biology

  • Biodiversity hotspots and conservation strategies
  • Ecosystem services and human well-being
  • Climate change impacts on ecosystems
  • Restoration ecology techniques
  • Urban ecology and biodiversity
  • Marine biology and coral reef conservation
  • Habitat fragmentation and species extinction
  • Ecological modeling and forecasting
  • Wildlife conservation genetics
  • Microbial ecology in natural environments

Neuroscience and Cognitive Science

  • Brain mapping techniques (fMRI, EEG, etc.)
  • Neuroplasticity and learning
  • Neural circuitry underlying behavior
  • Neurodegenerative diseases (Alzheimer’s, Parkinson’s, etc.)
  • Neural engineering for prosthetics
  • Consciousness and the mind-body problem
  • Psychiatric genetics and mental health disorders
  • Neuroimaging in psychiatric research
  • Developmental cognitive neuroscience
  • Neural correlates of consciousness

Evolutionary Biology

  • Mechanisms of speciation
  • Molecular evolution and phylogenetics
  • Coevolutionary dynamics
  • Evolution of antibiotic resistance
  • Cultural evolution and human behavior
  • Evolutionary consequences of climate change
  • Evolutionary game theory
  • Evolutionary medicine and infectious diseases
  • Evolutionary psychology and human cognition
  • Paleogenomics and ancient DNA analysis

Cell Biology and Physiology

  • Cell cycle regulation and cancer biology
  • Stem cell biology and regenerative medicine
  • Organelle dynamics and intracellular transport
  • Cellular senescence and aging
  • Ion channels and neuronal excitability
  • Metabolic pathways and cellular energetics
  • Cell signaling pathways in development and disease
  • Autophagy and cellular homeostasis
  • Mitochondrial function and disease
  • Cell adhesion and migration in development and cancer

Microbiology and Immunology

  • Microbiome composition and function
  • Antibiotic resistance mechanisms
  • Host-microbe interactions in health and disease
  • Viral pathogenesis and vaccine development
  • Microbial biotechnology for waste treatment
  • Immunotherapy approaches for cancer treatment
  • Microbial diversity in extreme environments
  • Antimicrobial peptides and drug discovery
  • Microbial biofilms and chronic infections
  • Host immune responses to viral infections

Biomedical Research and Clinical Trials

  • Translational research in oncology
  • Precision medicine approaches
  • Clinical trials for gene therapies
  • Biomarker discovery for disease diagnosis
  • Stem cell-based therapies for regenerative medicine
  • Pharmacokinetics and drug metabolism studies
  • Clinical trials for neurodegenerative diseases
  • Vaccine efficacy trials
  • Patient-reported outcomes in clinical research
  • Health disparities and clinical trial participation

Emerging Technologies and Innovations

  • Single-cell omics technologies
  • 3D bioprinting for tissue engineering
  • CRISPR-based diagnostics
  • Artificial intelligence applications in life sciences
  • Organs-on-chip for drug screening
  • Wearable biosensors for health monitoring
  • Nanomedicine for targeted drug delivery
  • Optogenetics for neuronal manipulation
  • Quantum biology and biological systems
  • Augmented reality in medical education

Ethical, Legal, and Social Implications (ELSI) in Life Sciences

  • Privacy concerns in genomic research
  • Ethical considerations in gene editing technologies
  • Access to healthcare and genetic testing
  • Intellectual property rights in biotechnology
  • Informed consent in clinical trials
  • Animal welfare in research
  • Equity in environmental decision-making
  • Data sharing and reproducibility in science
  • Dual-use research and biosecurity
  • Cultural perspectives on biomedicine and genetics

Public Health and Epidemiology

  • Disease surveillance and outbreak investigation
  • Global health disparities and access to healthcare
  • Environmental factors in disease transmission
  • Health impacts of climate change
  • Social determinants of health
  • Infectious disease modeling and forecasting
  • Vaccination strategies and herd immunity
  • Epidemiology of chronic diseases
  • Mental health epidemiology
  • Occupational health and safety

Plant Biology and Agriculture

  • Crop domestication and evolution
  • Plant-microbe interactions in agriculture
  • Genetic engineering for crop improvement
  • Plant hormone signaling pathways
  • Abiotic stress tolerance mechanisms in plants
  • Soil microbiology and nutrient cycling
  • Agroecology and sustainable farming practices
  • Plant secondary metabolites and natural products
  • Plant developmental biology
  • Plant epigenetics and environmental adaptation

Bioinformatics and Computational Biology

  • Genome assembly and annotation algorithms
  • Phylogenetic tree reconstruction methods
  • Metagenomic data analysis pipelines
  • Machine learning approaches for biomarker discovery
  • Structural bioinformatics and protein modeling
  • Systems biology and network analysis
  • Transcriptomic data analysis tools
  • Population genetics simulation software
  • Evolutionary algorithms in bioinformatics
  • Cloud computing in life sciences research

Toxicology and Environmental Health

  • Mechanisms of chemical toxicity
  • Risk assessment methodologies
  • Environmental fate and transport of pollutants
  • Endocrine disruptors and reproductive health
  • Nanotoxicology and nanomaterial safety
  • Biomonitoring of environmental contaminants
  • Ecotoxicology and wildlife health
  • Air pollution exposure and respiratory health
  • Water quality and aquatic ecosystems
  • Environmental justice and health disparities

Aquatic Biology and Oceanography

  • Marine biodiversity conservation strategies
  • Ocean acidification impacts on marine life
  • Coral reef resilience and restoration
  • Fisheries management and sustainable harvesting
  • Deep-sea biodiversity and exploration
  • Harmful algal blooms and ecosystem health
  • Marine mammal conservation efforts
  • Microplastics pollution in aquatic environments
  • Ocean circulation and climate regulation
  • Aquaculture and mariculture technologies

Social and Behavioral Sciences in Health

  • Health behavior change interventions
  • Social determinants of health disparities
  • Health communication strategies
  • Community-based participatory research
  • Patient-centered care approaches
  • Cultural competence in healthcare delivery
  • Health literacy interventions
  • Stigma reduction efforts in public health
  • Health policy analysis and advocacy
  • Digital health technologies for behavior monitoring

Bioethics and Biomedical Ethics

  • Ethical considerations in human subjects research
  • Research ethics in vulnerable populations
  • Privacy and data protection in healthcare
  • Professional integrity and scientific misconduct
  • Ethical implications of genetic testing
  • Access to healthcare and health equity
  • End-of-life care and euthanasia debates
  • Reproductive ethics and assisted reproduction
  • Ethical challenges in emerging biotechnologies

Forensic Science and Criminalistics

  • DNA fingerprinting techniques
  • Forensic entomology and time of death estimation
  • Trace evidence analysis methods
  • Digital forensics in criminal investigations
  • Ballistics and firearm identification
  • Forensic anthropology and human identification
  • Bloodstain pattern analysis
  • Arson investigation techniques
  • Forensic toxicology and drug analysis
  • Forensic psychology and criminal profiling

Nutrition and Dietary Science

  • Nutritional epidemiology studies
  • Diet and chronic disease risk
  • Functional foods and nutraceuticals
  • Macronutrient metabolism pathways
  • Micronutrient deficiencies and supplementation
  • Gut microbiota and metabolic health
  • Dietary interventions for weight management
  • Food safety and risk assessment
  • Sustainable diets and environmental impact
  • Cultural influences on dietary habits

Entomology and Insect Biology

  • Insect behavior and communication
  • Insecticide resistance mechanisms
  • Pollinator decline and conservation efforts
  • Medical entomology and vector-borne diseases
  • Invasive species management strategies
  • Insect biodiversity in urban environments
  • Agricultural pest management techniques
  • Insect physiology and biochemistry
  • Social insects and eusociality
  • Insect symbiosis and microbial interactions

Zoology and Animal Biology

  • Animal behavior and cognition
  • Conservation genetics of endangered species
  • Reproductive biology and breeding programs
  • Wildlife forensics and illegal wildlife trade
  • Comparative anatomy and evolutionary biology
  • Animal welfare and ethics in research
  • Physiological adaptations to extreme environments
  • Zoological taxonomy and species discovery
  • Animal communication and signaling
  • Human-wildlife conflict mitigation strategies

Biochemistry and Molecular Biology

  • Protein folding and misfolding diseases
  • Enzyme kinetics and catalytic mechanisms
  • Metabolic regulation in health and disease
  • Signal transduction pathways
  • DNA repair mechanisms and genome stability
  • RNA biology and post-transcriptional regulation
  • Lipid metabolism and membrane biophysics
  • Molecular interactions in drug design
  • Bioenergetics and cellular respiration
  • Structural biology and X-ray crystallography

Cancer Biology and Oncology

  • Tumor microenvironment and metastasis
  • Cancer stem cells and therapy resistance
  • Angiogenesis and tumor vasculature
  • Immune checkpoint inhibitors in cancer therapy
  • Liquid biopsy techniques for cancer detection
  • Oncogenic signaling pathways
  • Personalized medicine approaches in oncology
  • Radiation therapy and tumor targeting strategies
  • Cancer genomics and precision oncology
  • Cancer prevention and lifestyle interventions

Developmental Biology and Embryology

  • Embryonic stem cell differentiation
  • Morphogen gradients and tissue patterning
  • Developmental genetics and model organisms
  • Regenerative potential in vertebrates and invertebrates
  • Developmental plasticity and environmental cues
  • Embryo implantation and pregnancy disorders
  • Germ cell development and fertility preservation
  • Cell fate determination in development
  • Evolutionary developmental biology (evo-devo)
  • Organogenesis and tissue morphogenesis

Pharmacology and Drug Discovery

  • Drug-target interactions and pharmacokinetics
  • High-throughput screening techniques
  • Structure-activity relationship studies
  • Drug repurposing strategies
  • Natural product drug discovery
  • Drug delivery systems and nanomedicine
  • Pharmacovigilance and drug safety monitoring
  • Pharmacoeconomics and healthcare outcomes
  • Drug metabolism and drug-drug interactions

Stem Cell Research

  • Induced pluripotent stem cells (iPSCs) technology
  • Stem cell therapy applications in regenerative medicine
  • Stem cell niche and microenvironment
  • Stem cell banking and cryopreservation
  • Stem cell-based disease modeling

What Are The 10 Examples of Life Science Research Paper Titles?

  • Investigating the Role of Gut Microbiota in Neurological Disorders: Implications for Therapeutic Interventions.
  • Genome-Wide Association Study Identifies Novel Genetic Markers for Cardiovascular Disease Risk.
  • Understanding the Molecular Mechanisms of Cancer Metastasis: Insights from Cellular Signaling Pathways.
  • The Impact of Climate Change on Plant-Pollinator Interactions: Implications for Biodiversity Conservation.
  • Exploring the Potential of CRISPR/Cas9 Gene Editing Technology in Treating Genetic Disorders.
  • Characterizing the Microbial Diversity of Extreme Environments: Insights from Deep-Sea Hydrothermal Vents.
  • Assessment of Novel Drug Delivery Systems for Targeted Cancer Therapy: A Preclinical Study.
  • Unraveling the Neurobiology of Addiction: Implications for Treatment Strategies.
  • Investigating the Role of Epigenetics in Age-Related Diseases: From Mechanisms to Therapeutic Targets.
  • Evaluating the Efficacy of Herbal Remedies in Traditional Medicine: A Systematic Review and Meta-Analysis.

Life science research is a journey of discovery, filled with wonder, excitement, and the occasional setback. Yet, through perseverance and ingenuity, researchers continue to push the boundaries of knowledge, unlocking the secrets of life itself. As we stand on the cusp of a new era of scientific discovery, one thing is clear: the future of life science research is brighter—and more promising—than ever before. I hope these life science research topics will help you to find the best topics for you.

Related Posts

best way to finance car

Step by Step Guide on The Best Way to Finance Car

how to get fund for business

The Best Way on How to Get Fund For Business to Grow it Efficiently

Leave a comment cancel reply.

Your email address will not be published. Required fields are marked *

  • International edition
  • Australia edition
  • Europe edition

a view of an alleged former detention centre, known as Yengisheher-2, in Shule County in Kashgar in China's northwestern Xinjiang region

Genetics journal retracts 18 papers from China due to human rights concerns

Researchers used samples from populations deemed by experts and campaigners to be vulnerable to exploitation, including Uyghurs and Tibetans

A genetics journal from a leading scientific publisher has retracted 18 papers from China, in what is thought to be the biggest mass retraction of academic research due to concerns about human rights.

The articles were published in Molecular Genetics & Genomic Medicine (MGGM), a genetics journal published by the US academic publishing company Wiley. The papers were retracted this week after an agreement between the journal’s editor in chief, Suzanne Hart, and the publishing company. In a review process that took over two years, investigators found “inconsistencies” between the research and the consent documentation provided by researchers.

The papers by different scientists are all based on research that draws on DNA samples collected from populations in China . In several cases, the researchers used samples from populations deemed by experts and human rights campaigners to be vulnerable to exploitation and oppression in China, leading to concerns that they would not be able to freely consent to such samples being taken.

Several of the researchers are associated with public security authorities in China, a fact that “voids any notion of free informed consent”, said Yves Moreau, a professor of engineering at the University of Leuven, in Belgium, who focuses on DNA analysis. Moreau first raised concerns about the papers with Hart, MGGM’s editor-in-chief, in March 2021.

One retracted paper studies the DNA of Tibetans in Lhasa, the capital of Tibet, using blood samples collected from 120 individuals. The article stated that “all individuals provided written informed consent” and that work was approved by the Fudan University ethics committee.

But the retraction notice published on Monday stated that an ethical review “uncovered inconsistencies between the consent documentation and the research reported; the documentation was not sufficiently detailed to resolve the concerns raised”.

Xie Jianhui, the corresponding author on the study, is from the department of forensic medicine at Fudan University in Shanghai. Xie did not respond to a request for comment, but the retraction notice states that Xie and his co-authors did not agree with the retraction.

Several of Xie’s co-authors are affiliated with the public security authorities in China, including the Tibetan public security authorities. Tibet is considered to be one of the most closely surveilled and tightly monitored regions in China. In Human Rights Watch’s most recent annual report , the campaign group said that the authorities “enforce severe restrictions on freedoms of religion, expression, movement, and assembly”.

Another of the retracted studies used blood samples from 340 Uyghur individuals in Kashgar, a city in Xinjiang, to study the genetic links between them and Uyghurs from other regions. The scientists said the data would be a resource for “forensic DNA and population genetics”.

The retracted papers were all published between 2019 and 2021. In 2021, after Moreau raised concerns about the papers in MGGM, eight of the journal’s 25 editors resigned. The journal’s editor in chief, Hart, has remained in her post. Hart and MGGM did not respond to a request for comment.

MGGM is considered by some to be a mid-ranking genetics publication. It has an impact factor of 2.473, which puts it roughly in the top 40% of journals. It is considered to be a relatively easy forum for publication, which may have been a draw for Chinese researchers looking to publish in English-language journals, said David Curtis, a professor of genetics at University College London. Curtis resigned from his position as editor-in-chief of Annals of Human Genetics, another Wiley journal, after the publisher vetoed a call to consider boycotting Chinese science because of ethical concerns, including those relating to DNA collection.

MGGM states that its scope is human, molecular and medical genetics. It primarily publishes studies on the medical applications of genetics, such as a recent paper on genetic disorders linked to hearing loss. The sudden pivot towards publishing forensic genetics research from China came as other forensic genetics journals started facing more scrutiny for publishing research based on DNA samples from vulnerable minorities in China, said Moreau. He argues that may have pushed more controversial research towards mid-ranking journals such as MGGM that do not specialise in forensic genetics.

On its information page, MGGM states that it “does not consider studies involving forensic genetic analysis”. That caveat was added in 2023, after an editorial review of the journal’s aims.

In recent years there has been a growing scrutiny on research that uses DNA or other biometric data from individuals in China, particularly those from vulnerable populations. In 2023 , Elsevier, a Dutch academic publisher, retracted an article based on blood and saliva samples from Uyghur and Kazakh people living in Xinjiang, a region in north-west China where there are also widespread reports of human rights abuses.

The Wiley retractions come days before a Chinese government deadline requiring universities to submit lists of all academic articles retracted in the past three years. According to an analysis by Nature, nearly 14,000 retraction notices were published last year, of which three-quarters involved a Chinese co-author.

A spokesperson for Wiley said: “We are continuing to learn from this case, and collaboration with international colleagues is valuable in developing our policies.

“Investigations that involve multiple papers, stakeholders and institutions require significant effort, and often involve lag time in coordinating and analysing information across all involved, as well as translation of materials. We recognise that this takes a significant amount of time but always aim to act as swiftly as possible.”

In recent years, China has outstripped the EU and the US in terms of total research output, and the impact of its research is also catching up with output from the US.

  • Asia Pacific

Most viewed

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Genes (Basel)

Logo of genes

Principles of Genetic Engineering

Thomas m. lanigan.

1 Biomedical Research Core Facilities, Vector Core, University of Michigan, Ann Arbor, MI 48109, USA; ude.hcimu@tnaginal (T.M.L.); ude.hcimu@hgnohc (H.C.K.)

2 Department of Internal Medicine, Division of Rheumatology, University of Michigan, Ann Arbor, MI 48109, USA

Huira C. Kopera

3 Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA

Thomas L. Saunders

4 Biomedical Research Core Facilities, Transgenic Animal Model Core, University of Michigan, Ann Arbor, MI 48109, USA

5 Department of Internal Medicine, Division of Genetic Medicine, University of Michigan, Ann Arbor, MI 48109, USA

Genetic engineering is the use of molecular biology technology to modify DNA sequence(s) in genomes, using a variety of approaches. For example, homologous recombination can be used to target specific sequences in mouse embryonic stem (ES) cell genomes or other cultured cells, but it is cumbersome, poorly efficient, and relies on drug positive/negative selection in cell culture for success. Other routinely applied methods include random integration of DNA after direct transfection (microinjection), transposon-mediated DNA insertion, or DNA insertion mediated by viral vectors for the production of transgenic mice and rats. Random integration of DNA occurs more frequently than homologous recombination, but has numerous drawbacks, despite its efficiency. The most elegant and effective method is technology based on guided endonucleases, because these can target specific DNA sequences. Since the advent of clustered regularly interspaced short palindromic repeats or CRISPR/Cas9 technology, endonuclease-mediated gene targeting has become the most widely applied method to engineer genomes, supplanting the use of zinc finger nucleases, transcription activator-like effector nucleases, and meganucleases. Future improvements in CRISPR/Cas9 gene editing may be achieved by increasing the efficiency of homology-directed repair. Here, we describe principles of genetic engineering and detail: (1) how common elements of current technologies include the need for a chromosome break to occur, (2) the use of specific and sensitive genotyping assays to detect altered genomes, and (3) delivery modalities that impact characterization of gene modifications. In summary, while some principles of genetic engineering remain steadfast, others change as technologies are ever-evolving and continue to revolutionize research in many fields.

1. Introduction

Since the identification of DNA as the unit of heredity and the basis for the central dogma of molecular biology [ 1 ] that DNA makes RNA and RNA makes proteins, scientists have pursued experiments and methods to understand how DNA controls heredity. With the discovery of molecular biology tools such as restriction enzymes, DNA sequencing, and DNA cloning, scientists quickly turned to experiments to change chromosomal DNA in cells and animals. In that regard, initial experiments that involved the co-incubation of viral DNA with cultured cell lines progressed to the use of selectable markers in plasmids. Delivery methods for random DNA integration have progressed from transfection by physical co-incubation of DNA with cultured cells, to electroporation and microinjection of cultured cells [ 2 , 3 , 4 ]. Moreover, the use of viruses to deliver DNA to cultured cells has progressed in tandem with physical methods of supplying DNA to cells [ 5 , 6 , 7 ]. Homologous recombination in animal cells [ 8 ] was rapidly exploited by the mouse genetics research community for the production of gene-modified mouse ES cells, and thus gene-modified whole animals [ 9 , 10 ].

This impetus to understand gene function in intact animals was ultimately manifested in the international knockout mouse project, the purpose of which was to knock out every gene in the mouse genome, such that researchers could choose to make knockout mouse models from a library of gene-targeted knockout ES cells [ 11 , 12 , 13 ]. Thousands of mouse models have resulted from that effort and have been used to better understand gene function and the bases of human genetic diseases [ 14 ]. This project required high-throughput pipelines for the construction of vectors, including bacterial artificial chromosome (BAC) recombineering technology [ 13 , 15 , 16 , 17 ]. BACs contain long segments of cloned genomic DNA. For example, the C57BL/6J mouse BAC library, RPCI-23, has an average insert size of 197 kb of genomic DNA per clone [ 18 ]. Because of their size, BACs often carry all of the genetic regulatory elements to faithfully recapitulate the expression of genes contained in them, and thus can be used to generate BAC transgenic mice [ 19 , 20 ]. Recombineering can be used to insert reporters in BACs that are then used to generate transgenic mice to accurately label cells and tissues according to the genes in the BACs [ 21 , 22 , 23 , 24 , 25 , 26 ]. A panoply of approaches to genetic engineering are available for researchers to manipulate the genome. ES cell and BAC transgene engineering have given way to directly editing genes in zygotes, consequently avoiding the need for ES cell or BAC intermediates on the way to an animal model.

Prior to the adaptation of Streptococcus pyogenes Cas9 protein to cause chromosome breaks, three other endonuclease systems were used: (1) rare-cutting meganucleases, (2) zinc finger nucleases (ZFNs), and (3) transcription activator-like effector (TALE) nucleases (TALENs) [ 27 ]. The I-CreI meganuclease recognizes a 22 bp DNA sequence [ 28 , 29 ]. Proof-of-concept experiments demonstrated that the engineered homing endonuclease I-CreI can be used to generate transgenic mice and transgenic rats [ 30 ]. I-CreI specificity can be adjusted to target specific sequences in DNA by protein engineering methodology, although this limits its widespread application to genetic engineering [ 31 ]. Subsequently, ZFN technology was developed to cause chromosome breaks [ 32 ]. A single zinc finger is made up of 30 amino acids that bind three base pairs. Thus, three zinc fingers can be combined to specifically recognize nine base pairs on one DNA strand and a triplet of zinc fingers is made to bind nine base pairs on the opposite strand. Each zinc finger is fused to the DNA-cutting domain of the FokI restriction endonuclease. Because FokI domains only cut DNA when they are present as dimers, a ZFN monomer binding to a chromosome cannot induce a DNA break [ 32 ], instead requiring ZFN heterodimers for sequence-specific chromosome breaks. It is estimated that 1 in every 500 genomic base pairs can be cleaved by ZNFs [ 33 ]. Compared with meganucleases, ZFNs are easier to construct because of publicly available resources [ 34 ]. Additionally, the value of ZFNs in mouse and rat genome engineering was demonstrated in several studies that produced knockout, knockin, and floxed (described below) animal models [ 35 , 36 , 37 ]. The development of transcription activator-like effector nucleases (TALENs) followed after ZFN technology [ 38 ]. TALENs are made up of tandem repeats of 34 amino acids. The central amino acids at positions 12 and 13, named repeat variable di-residues (NVDs), determine the base to which the repeat will bind [ 38 ]. To achieve a specific chromosomal break, 15 TALE repeats assembled and fused to the FokI endonuclease domain (TALEN monomer) are required. Thus, one TALEN monomer binds to 15 base pairs on one DNA strand, and a second TALEN monomer binds to bases on the opposite strand [ 38 ]. When the FokI endonuclease domains are brought together, a double-stranded DNA break occurs. In this way, a TALEN heterodimer can be used to cause a sequence-specific chromosome break. It has been estimated that, within the entire genome, TALENs have potential target cleavage sites every 35 bp [ 39 ]. Compared with ZFNs, TALENs are easier to construct with publicly available resources [ 40 , 41 ], and TALENs have been adopted for use in mouse and rat genome engineering in several laboratories that have produced knockout and knockin animal models [ 42 , 43 , 44 , 45 , 46 ].

The efficiencies of producing specific double-strand chromosome breaks, using prior technologies such as meganucleases, ZFNs, and TALENs [ 28 , 32 , 38 ], were surpassed when CRISPR/Cas9 technology was shown to be effective in mammalian cells [ 47 , 48 , 49 ]. The essential feature that all of these technologies have in common is the production of a chromosome break at a specific location to facilitate genetic modifications [ 50 ]. In particular, the discovery of bacterial CRISPR-mediated adaptive immunity, and its application to genetic modification of human and mouse cells in 2013 [ 47 , 48 , 49 ], was a watershed event to modern science. Moreover, the introduction of CRISPR/Cas9 methodology has revolutionized transgenic mouse generation. This paradigm shift can be seen by changes in demand for nucleic acid microinjections into zygotes, and ES cell microinjections into blastocysts at the University of Michigan Transgenic Core ( Figure 1 ). While previously established principles of genetic engineering using mouse ES cell technology [ 51 , 52 , 53 ] remain applicable, CRISPR/Cas9 methodologies have made it much easier to produce genetically engineered model organisms in mice, rats, and other species [ 54 , 55 ]. Herein, we discuss principles in genetic engineering for the design and characterization of targeted alleles in mouse and rat zygotes, or in cultured cell lines, for the production of animal and cell culture models for biomedical research.

An external file that holds a picture, illustration, etc.
Object name is genes-11-00291-g001.jpg

Recent trends in nucleic acid microinjection in zygotes, and embryonic stem (ES) cell microinjections into blastocysts, for the production of genetically engineered mice at the University of Michigan Transgenic Core. As shown, prior to the introduction of CRISPR/Cas9, the majority of injections were of ES cells, to produce gene-targeted mice, and DNA transgenes, to produce transgenic mice. After CRISPR/Cas9 became available, adoption was slow until 2014, when it was enthusiastically embraced, and the new technology corresponded to a reduced demand for ES cell and DNA microinjections.

2. Principles of Genetic Engineering

2.1. types of genetic modifications.

There are many types of genetic modifications that can be made to the genome. The ability to specifically target locations in the genome has expanded our ability to make changes that include knockouts (DNA sequence deletions), knockins (DNA sequence insertions), and replacements (replacement of DNA sequences with exogenous sequences). Deletions in the genome can be used to knockout gene expression [ 56 , 57 ]. Short deletions in the genome can be used to remove regulatory elements that knockout gene expression [ 58 ], activate gene expression [ 59 ], or change protein structure/function by changing coding sequences [ 60 ].

Insertion of new genomic information can be used to knock in a variety of genetic elements. Knockins are also powerful approaches for modifying genes. Just as genomic deletions can be used to change gene function, knockins can be used to block gene function by inserting fluorescent reporter genes such as eGFP or mCherry, in such a way as to knock out the gene at the insertion point [ 61 , 62 ]. It is also possible to knock in fluorescent protein reporter genes, without knocking out the targeted gene [ 63 , 64 ]. Just as fluorescent proteins can be used to label proteins and cells, short knockins of epitope tags in proteins can be used to label proteins for detection with antibodies [ 64 , 65 ].

Replacement of DNA sequences in the genome can be used to achieve two purposes at the same time, such as blocking gene function, while activating the function of a new gene such as the lacZ reporter [ 66 ]. Large-scale sequence replacements are possible with mouse ES cell technology, such as the replacement of the mouse immunoglobulin locus with the human immunoglobulin locus to produce a “humanized” mouse [ 67 ]. Furthermore, very small replacements of single nucleotides can be used to model point mutations that are suspected of causing human disease [ 68 , 69 , 70 ].

A special type of DNA sequence replacement is the conditional allele. Conditional alleles permit normal gene expression until the site-specific Cre recombinase removes a loxP-flanked critical exon to produce a “floxed” (flanked by loxP) exon. Cre recombinase recognizes 34 bp loxP (locus of recombination) elements, and catalyzes recombination between the two loxP sites [ 71 , 72 ]. Therefore, deletion of the critical exon causes a premature termination codon to occur in the mRNA transcript, triggering its nonsense-mediated decay and failure to make a protein [ 13 , 73 ]. Engineering conditional alleles was the approach used by the international knockout mouse project [ 13 ]. Mice with cell- and tissue-specific Cre recombinase expression are an important resource for the research community [ 74 ].

Other site-specific recombinases, such as FLP, Dre, and Vika, that work on the same principle have also been applied to mouse models [ 75 , 76 , 77 , 78 , 79 , 80 ]. Recombinase knockins can be designed to knock out the endogenous gene or preserve its function [ 81 , 82 ]. A variation in the conditional allele is the inducible allele, which is silent until its expression is activated by Cre recombinase [ 79 ]. For example, reporter models can activate the expression of a fluorescent protein [ 83 ], change fluorescent reporter protein colors from red to green [ 84 ], or use a combinatorial approach to produce up to 90 fluorescent colors [ 85 ]. Another type of inducible allele is the FLEX allele. FLEX genes are Cre-dependent gene switches based on the use of heterotypic loxP sites [ 86 ]. In one application that combined Cre and FLP recombinases, it was demonstrated that a gene inactivated in ES cells by a gene trap could be switched back on and then switched off again [ 87 ]. In another application of heterotypic loxP sites in mouse ES cells, it was demonstrated that genes could be made conditional by inversion (COIN) [ 88 ]. This application has been used to produce mice with conditional genes for point mutations [ 89 ] and has been applied to produce conditional single exon genes that lack critical exons by definition [ 90 ].

2.2. Genetic Engineering with CRISPR/Cas9

The central principle of gene targeting with CRISPR/Cas9, or other directed DNA endonucleases, is that a double-strand DNA break is generated in the cell of interest. Following a chromosomal break, the principal outcomes of interest are nonhomologous end joining (NHEJ) repair [ 91 ] or homology-directed repair (HDR) [ 92 ]. When the break is directed to a coding exon in a gene, the outcome of NHEJ is usually a small insertion or deletion of DNA sequence at the break (indel), causing frame shifts in mRNA transcripts that lead to premature termination codons, causing nonsense-mediated mRNA decay and loss of protein expression [ 73 ]. The HDR pathway copies a template during DNA repair, and thus the insertion of modified genetic sequences in the form of a DNA donor. This DNA donor can introduce new information into the genome flanked by homology arms on either side of the chromosome break. Typical applications of HDR include the use of genetic engineering to abrogate gene expression (gene knockouts), to modify amino acid codons (i.e.; point mutations), to replace genes with new genes (e.g.; knockins of fluorescent reporters, Cre recombinase, cDNA coding sequences), to produce conditional genes (floxed genes that are normally expressed until they are inactivated by Cre recombinase), to produce Cre-inducible genes (genes that are only expressed after Cre recombinase activates them), and to delete DNA from chromosomes (e.g.; delete regulatory elements that control gene expression, delete entire genes, or delete up to a megabase of chromosome segments). The simplest of these modifications is abrogation of gene expression. Multifunctional alleles, such as FLEX alleles, require the cloning or synthesis of multi-element plasmid DNA donors for HDR.

The processes of CRISPR/Cas9-mediated modifications of genes (gene editing) to produce a new cell line or animal model have in common a series of steps to achieve the final product. First, a gene of interest is identified and the final desired allele is specified. The next step is to identify single guide RNA(s) (gRNAs) that will be used to target a chromosomal break in one or more places. There are numerous online websites that can be used for this purpose [ 93 ]. One of the most up-to-date and versatile sites is CRISPOR ( http://crispor.tefor.net ) [ 94 ]. Interestingly, the authors provide evidence that the predictive powers of algorithms vary depending on whether they were based on the analysis of gRNAs delivered as RNA molecules, versus gRNAs delivered as U6-transcribed DNA molecules [ 94 ]. In any event, the selection of a gRNA target (20 nucleotides), adjacent to a protospacer-adjacent motif (PAM; NGG motif), should not be done without the aid of a computer algorithm that minimizes the possibility of off-target hits. After a gRNA target is identified, a decision is made to obtain gRNAs. While it is possible to produce in vitro-transcribed gRNAs, this may be inadvisable in so much as in vitro-transcribed RNAs can trigger innate immune responses and cause cytotoxicity in cells [ 95 ]. Chemically synthesized gRNAs using phosphorothioate modifications that improve gRNA stability may be preferable alternatives to in vitro-transcribed molecules [ 96 , 97 ]. With a gRNA in hand, a Cas9 protein is then selected. There are numerous forms of Cas9 that can be used for different purposes [ 98 ]. For practical purposes, we limit our discussion to Cas9 varieties that are on the market. A number of commercial entities sell wild-type Cas9 protein. When wild type Cas9 is used to target the genome with nonspecific guides, the frequency of off-target genomic hits, besides the desired Cas9 target, is very likely to increase [ 94 , 99 ]. Alternatives to the wild-type protein include enhanced specificity Cas9 from Sigma-Aldrich [ 100 ], and high-fidelity Cas9 from Integrated DNA Technologies [ 101 ]. In addition, there are other versions such as HF1 Cas9 [ 102 ], hyperaccurate Cas9 [ 103 ], and evolved Cas9 [ 104 ], all available in plasmid format from Addgene.org. As may be inferred from the names of these engineered Cas9 versions, they are designed to be more specific than wild type Cas9. Once the gRNAs and Cas9 protein are on hand, then it is a “simple” matter to combine them and deliver them to the target cell to produce a chromosome break and achieve a gene knockout by introducing premature termination codons or DNA sequence deletion of regulatory regions or entire genes.

2.3. Locus-Specific Genetic Engineering Vectors in Mouse and Rat Zygotes

The most challenging type of genetic engineering is the insertion (i.e.; knockin) of a long coding sequence to express a fluorescent reporter protein, Cre recombinase, or conditional allele (floxed gene). In addition to these genetic modifications, numerous other types of specialized reporters can be introduced, each designed to achieve a different purpose. There is great interest in achieving rapid and efficient gene insertions of reporters in animal models with CRISPR/Cas9 technology. It is generally recognized that, the longer the insertion, the less efficient it is to produce a knockin animal. Additional challenges are allele-specific differences that affect efficiency. For example, it is fairly efficient to produce knockins into the genomic ROSA26 locus in mice, while other loci are targeted less efficiently, and thus refractory to knockins. This accessibility to CRISPR/Cas9 complexes mirrors observations in mouse ES cell gene targeting technology, in which it was reported that some genes are not as efficiently targeted as others [ 105 ].

When the purpose of the experiment is to specifically modify the DNA sequence by changing amino acid codons, or introducing new genetic information, then a DNA donor must be delivered to the cells with Cas9 reagents. After the selected gRNAs and Cas9 proteins are demonstrated to produce the desired chromosome break, the DNA donor is designed and procured. The donor should be designed to insert into the genome such that it will not be cleaved by Cas9, usually by mutating the PAM site. The DNA donor may take the form of short oligonucleotides (<200 nt) [ 106 , 107 ], long single-stranded DNA molecules (>200 nt) [ 108 ], or double-stranded linear or circular DNA molecules of varying lengths [ 109 , 110 ].

DNA donor design principles should include the following: (1) nucleotide changes that prevent CRISPR/Cas9 cleavage of the chromosome, after introduction of the DNA donor; (2) insertion of restriction enzyme sites unique to the donor, to simplify downstream genotyping; (3) insertions of reporters or coding sequences, at least 1.5 kb in length, that can be introduced as long single-stranded DNA templates with short 100 base pair arms of homology [ 111 ], or as circular double-stranded DNA plasmids with longer (1.5 or 2 kb) arms of homology [ 63 , 110 ]; and (4) insertions of longer coding sequences, such as Cas9, that use circular double-stranded DNA donors with longer arms of homology [ 63 , 112 ]. It is also possible to use linear DNA fragments as donors [ 63 , 110 , 113 ], although random integration of linear DNA molecules is much higher than those of circular donors, thus requiring careful quality control.

The establishment of genetically modified mouse and rat models can be divided into three phases, after potential founder animals are born from CRISPR/Cas9-treated zygotes. In the first phase, animals with genetic modifications are identified. The first phase requires a sensitive and specific genotyping assay to identify cells or animals harboring the desired knockin. Genotyping potential founder mice for knockins typically begins with a PCR assay using a primer that recognizes the exogenous DNA sequence and a primer in genomic DNA outside of the homology arm in the targeting vector. Accordingly, PCR assays are designed to specifically detect the upstream and downstream junctions of the inserted DNA in genomic DNA. Subsequent assays may be used to confirm that the entire exogenous sequence is intact. Conditional genes represent a special case of insertion, as PCR assays designed to detect correct insertion of loxP-flanked exons will also detect genomic DNA [ 108 ]. In the second phase, founders are mated and G1 pups are identified that inherited the desired mutation [ 114 ]. In the third phase, it is essential to sequence additional genomic regions upstream and downstream of the inserted targeting vector DNA, because Cas9 is very efficient at inducing chromosomal breaks, but has no repair function. Thus, it is not unusual to identify deletions/insertions that flank the immediate vicinity of the Cas9 cut site or inserted targeting vector DNA sequences [ 115 , 116 ]. If such deletions affect nearby exons, gene expression can be disrupted, and confounding phenotypes may arise.

For gene knockouts, PCR amplicons from primers that span the chromosome break site are analyzed by DNA sequencing. Any animals that are wild-type at the allele are not further characterized or used, so as to prevent any off-target hits from entering the animal colony or confounding phenotypes. Animals that show disrupted DNA sequences at the Cas9 cut site are mated with wild-type animals for the transmission of mutant alleles that produce premature termination codons, for gene knockout models [ 57 , 73 ]. As founders from Cas9-treated zygotes are genetic mosaics [ 55 , 115 ], it is essential to mate them to wild-type breeding partners, such that obligate heterozygotes are produced. In the heterozygotes, the wild-type sequence and the mutant sequence can be precisely identified by techniques such as TOPO TA cloning (Invitrogen, CA, USA) or next-generation sequencing (NGS) methods [ 117 , 118 , 119 , 120 ]. Animals carrying a defined indel, with the desired properties, are then used to establish lines for phenotyping. The identical approach is used when short DNA sequences are deleted by two guide RNAs [ 58 ]. Intercrossing mosaic founders will produce offspring carrying two different mutations with different effects on gene expression. These animals are not suitable for line establishment.

2.4. Gene Editing in Immortalized Cell Lines

CRISPR/Cas9 gene editing in immortalized cell lines presents a set of challenges unique from those used in the generation of transgenic animals. Cell lines encompass a wide range of characteristics, resulting in each line being handled differently. Some of these characteristics include phenotype heterogeneity, aberrant chromosome ploidy, varying growth rates, DNA damage response efficiency, transfection efficiency, and clonability. While the principles of CRISPR/Cas9 experimental design, as stated above, remain the same, three major considerations must be taken into account when using cell lines: (1) copy number variation, or the number of alleles of the gene of interest; (2) transfection efficiency of the cell line; and (3) clonal isolation of the modified cell line. In cell lines, all alleles need to be modified in the generation of a null phenotype, or in the creation of a homozygous genotype. Unlike transgenic animals, where single allele gene edits can be bred to homozygosity, CRISPR/Cas9-edited cells must be screened for homozygous gene edits. Copy number variations within the cell line can decrease the efficiency and add labor and time (i.e.; editing 3 or 4 copies versus editing 1 or 2). Furthermore, an aberrant number of chromosomes, deletions, duplications, pseudogenes, and repetitive regions complicate genetic backgrounds for PCR analysis of the CRISPR edits. To help with some of these issues, one common approach is to use NGS on all the clonal isolates for a complete understanding of copy number variations for each clonal cell line generated, and the exact sequence for each allele.

As all cell types are not the same, different CRISPR/Cas9 delivery techniques may need to be tested to identify which method works best. One approach is to use viruses or transposons to deliver CRISPR/Cas9 reagents (detailed below). However, the viruses and transposons themselves will integrate into the genome, as well as allowing long-term expression of CRISPR/Cas9 in the cell. This prolonged expression of gRNAs and Cas9 protein may lead to off-target effects. Moreover, transfection and electroporation can have varying efficiencies, depending on the cell lines and the form of CRISPR/Cas9 reagents (e.g.; DNA plasmids or ribonucleoprotein particles (RNPs)).

Following delivery, clonal isolation is required to identify the edited cell line, and at times, can result in the isolation of a cell phenotype different than that expected, arising from events apart from the desired gene edit. While flow cytometry can aid in isolating individual cells, specific flow conditions, such as pressure, may require adjustment to ensure cell viability. Furthermore, one clonal isolate from a cell line may possess a different number of alleles for the targeted gene than another clonal isolate. Additionally, not all cell lines will grow from a single cell, thus complicating isolation. Growth conditions and cell viability can also change when isolating single cells.

Despite these challenges, new advances in CRISPR technology can likely alleviate some of these difficulties when editing cell lines. For example, fluorescently tagged Cas9 and RNAs help to isolate only transfected cells, which helps to eliminate time wasted on screening untransfected cells. Cas9-variants that harbor mutations that only create single-strand nicks (Cas9-nickases) complexed with two different, but proximal gRNAs can increase HDR-mediated knockin [ 48 , 121 ]. Similarly, fusing Cas9 with base-editing enzymes can also increase the efficiency of editing, without causing double-strand breaks [ 121 ].

2.5. Viruses and Transposons as Genetic Engineering Vectors

Viral and transposon vectors have been engineered to be safe, efficient delivery systems of exogenous genetic material into cells. The natural lifecycle of some viruses and transposons includes the stable integration into the host genome. In the field of genome engineering, these vectors can be used to modify the genome in a non-directed fashion, by inserting cassettes expressing any cDNA, shRNA, miRNA, or any non-coding RNA. The most widely used vectors capable of integrating ectopic genetic material into cells are retroviruses, lentiviruses, and adeno-associated virus (AAV). These viruses are flanked by terminal repeats that mark the boundaries of the integration. In engineering these viruses into recombinant vector systems, all the viral genes are removed from the flanking terminal repeats and supplied in trans for the recombinant virus to be packaged. These “gutted”, nonreplicable viral vectors allow for the packaging, delivery, integration, and expression of cDNAs of interest, shRNAs, and CRISPR/Cas9, without viral replication in various biological targets.

Similar to recombinant viruses, transposon vectors are also “gutted”, separating the transposase from the terminal repeat-flanked genetic material to be inserted into the genome. DNA transposons are mobile elements (“jumping genes”) that integrate into the host genome through a cut-and-paste mechanism [ 122 ]. Transposons, much like viral vectors, are flanked by repeats that mark the region to be transposed [ 123 ]. The enzyme transposase binds the flanking DNA repeats and mediates the excision and integration into the genome. Unlike viral vectors, transposons are not packaged into viral particles, but form a DNA-protein complex that stays in the host cell. Thus, the transgene to be integrated can be much larger than the packaging limits of some viruses.

Two transposons, Sleeping Beauty (SB) and piggybac (PB), have been engineered and optimized for high activity for generating transgenic mammalian cell lines [ 124 , 125 , 126 ]. Sleeping Beauty is a transposable element resurrected from fish genomes. The SB system has been used to generate transgenic HeLa cell lines, T-cells expressing chimeric antigen receptors that recognize tumor-specific antigens, and transgenic primary human stem cells [ 127 , 128 , 129 ]. The insect-derived PB system also has been used to generate transgenic cell lines [ 126 , 130 , 131 ]. The PB system was used to generate induced pluripotent stem cells (iPSCs) from mouse embryonic fibroblasts, by linking four or five cDNAs of the reprogramming (Yamanaka) factors [ 132 ] with intervening peptide self-cleavage (P2A) sites, thus delivering all of the factors in one vector [ 130 ]. Furthermore, once reprogrammed, the transgene may be removed by another round of PB transposase activity, leaving no genetic trace of integration or excision (i.e.; transgene-free iPSCs). Following PB transposase activity, epigenetic differences remaining at the endogenous promoters of the reprogramming factor genes result in sustained expression and pluripotency, despite transgene removal.

Aside from transgene insertion, Sleeping Beauty (SB) and piggyback (PB) have both been engineered to deliver CRISPR/Cas9 reagents into cells [ 133 , 134 , 135 ]. Similar to lentivirus, the stable integration of CRISPR/Cas9 by transposons could increase the efficacy of targeting and modifying multiple alleles. SB and PB have been used to deliver multiple gRNAs to target multiple genes (instead of just one), aiding in high-throughput screening. Furthermore, owing to the nature of PB excision stated above, the integrated CRISPR/Cas9 can be removed once a clonal cell line is established, to limit off-target effects. However, engineered transposons must be transfected into cells. As stated above, efficiencies vary between different cell lines and transfection methods. One potential solution to overcome this challenge is to merge technologies. For example, instead of transfecting cells with a plasmid harboring a gRNA flanked by SB terminal repeats (SB-CRISPR), the SB-CRISPR may be flanked by recombinant AAV (rAAV) terminal repeats (AAV-SB-CRISPR), allowing for packaging into rAAV. To that end, rAAV-SB-CRISPR has been used to infect primary murine T-cells, and deliver the SB-CRISPR construct [ 136 ].

2.6. Genetic Engineering Using Retroviruses

Retroviruses are RNA viruses that replicate through a DNA intermediate [ 137 ]. They belong to a large family of viruses including both onco-retroviruses, such as the Moloney murine leukemia virus (MMLV) (simply referred to as retrovirus), and lentiviruses, including human immunodeficiency virus (HIV). In all retroviruses, the RNA genome is flanked on both sides by long terminal repeats (LTRs); packaged with viral reverse transcriptase, integrase, and protease, surrounded by a protein capsid; and then enveloped into a lipid-based particle [ 138 ]. Envelope proteins interact with specific host cell surface receptors to mediate entry into host cells through membrane fusion. Then, the RNA genome is reverse-transcribed by the associated viral reverse transcriptase. The proviral DNA is then transported into the nucleus, along with viral integrase, resulting in integration into the host cell genome [ 139 ]. By contrast, the retroviral MMLV pre-integration complex is incapable of crossing the nuclear membrane, thus requiring the cell to undergo mitosis to gain access to chromatin [ 139 ], while lentiviral pre-integration complexes can cross nuclear membrane pores, allowing genome integration in both dividing and non-dividing cells.

Large-scale assessments of genomic material composition have uncovered features associated with retroviral insertion into mammalian genomes [ 140 ]. Although determination of integration target sites remains ill-defined, it does depend on both cellular and viral factors. For retroviruses such as MMLV, integration is preferentially targeted to promoter and regulatory regions [ 140 , 141 , 142 ]. Such preferences can be genotoxic owing to insertional activation of proto-oncogenes in patients undergoing gene therapy treatments for X-linked severe combined immunodeficiency [ 143 , 144 ], Wiskott–Aldrich syndrome [ 143 ], and chronic granulomatous disease [ 145 ]. Likewise, retroviral integration can generate chimeric and read-through transcripts driven by strong retroviral LTR promoters, post-transcriptional deregulation of endogenous gene expression by introducing retroviral splice sites (leading to aberrant splicing), and retroviral polyadenylation signals that lead to premature termination of endogenous transcripts [ 142 , 146 , 147 ].

Unlike retroviruses, lentiviruses prefer to integrate into transcribed portions of expressed genes in gene-rich regions, distanced from promoters and regulatory elements [ 140 , 142 , 148 ]. The cellular protein LEDGF/p75 aids in the target site selection by binding directly to both the active gene and the viral integrase within the HIV pre-integration complex [ 149 ]. Although the propensity of lentivirus to integrate into the body of expressed genes should increase the incidence of post-transcriptional deregulation, deletion of promoter elements from the lentiviral LTR (self-inactivating (SIN) vectors) has been reported to decrease transcriptional termination, but increase the generation of chimeric transcripts [ 149 ]. Overall, it appears that lentiviral SIN vectors are less likely to cause tumors than retroviral vectors with an active LTR promoter [ 148 , 150 , 151 , 152 ].

The 7.5–10 kb packaging limit of lentiviruses can accommodate the packaging, delivery, and stable integration of Cas9 cDNA, gRNAs, or Cas9 and gRNAs (all-in-one) to cells [ 153 , 154 ]. Often, a selectable marker, such as drug resistance, can also be included to isolate transduced cells. The high transduction efficiency of lentivirus can result in an abundance of CRISPR/Cas9-expressing cells to screen, compared with more traditional transfection methods. Stable and prolonged expression of CRISPR/Cas9 can facilitate targeting of multiple alleles of the gene of interest, resulting in more cells harboring homozygous gene modifications. Conversely, stable integration of CRISPR/Cas9 increases potential off-target effects. Moreover, lentiviral integration itself is a factor that may confound cellular phenotypes and should be considered when characterizing CRISPR-edited cell lines.

2.7. Gene Targeting Using Adeno-Associated Virus

Adeno-associated virus (AAV) is a human parvovirus with a single-stranded DNA genome of 4.7 kb, which was originally identified as a contaminant of adenoviral preparations [ 155 ]. The genome is flanked on both sides by inverted terminal repeats (ITR) and contains two genes, rep and cap [ 156 , 157 ]. Different capsid proteins confer serotype and tissue-specific targeting of distinct AAVs, in vivo. AAV cannot replicate on its own, and requires a helper virus, such as adenovirus or herpes simplex virus (HSV), to provide essential proteins in trans. AAV is the only known virus to integrate into the human genome in a site-specific manner at the AAVS1 site on chromosome 19q13.3-qter [ 158 , 159 , 160 ]. Although the precise mechanism is not well understood, the Rep protein functions to tether the virus to the host genome through direct binding of the AAV ITR and the AAVS1 site [ 158 , 160 , 161 ]. In the recombinant AAV (rAAV) vector system, the rep and cap genes are removed from the packaged virus, resulting in the loss of site-specific integration into the AAVS1 site. Despite removal of Rep, it has been shown that rAAV can still integrate, albeit randomly, into the host genome, via nonhomologous recombination, at low frequencies [ 162 , 163 , 164 ]. Furthermore, numerous clinical trials, to date, have shown that rAAV integration is safe and has no genotoxicity [ 165 , 166 , 167 ]. However, this “safety” is controversial, owing to preclinical studies suggesting genotoxicity in mouse models [ 168 , 169 , 170 , 171 ]. More studies are needed to understand the cellular impact of rAAV integration.

rAAVs have been used to deliver one or two CRISPR guide RNAs (gRNAs), in cells and model animals, by taking advantage of different rAAV serotypes to target specific cells or tissue types. Owing to the packaging capacity of rAAV, SpCas9 must be delivered as a separate virus, unlike lentivirus, which can be delivered as an “all-in-one” CRISPR/Cas9 vector. However, alternate, smaller Cas9s can be packaged into rAAVs [ 172 ]. Furthermore, rAAVs can be used to deliver repair templates or single-stranded donor oligonucleotides (ssODNs) for homology-directed repair (HDR), relying on the single-stranded nature of the AAV genome [ 173 , 174 ]. It has also been observed that rAAVs can integrate into the genome at CRISPR/Cas9-induced breaks in various cultured mouse tissue types, including neurons and muscle [ 175 ]. This observation goes against the notion of rAAVs integrating only at the AAVS1 locus, and should be considered when analyzing and characterizing rAAV-mediated CRISPR-edited cells.

3. Conclusions

There are many approaches to inserting new genetic information into chromosomes in cells and animals. At this time, the most appealing method is single copy gene insertion at a defined locus. This approach has numerous advantages, with respect to reproducible transgene expression. Random insertion transgenesis has been effectively used to probe gene function in mouse models [ 176 ]. It is generally accepted that this requires a spontaneous chromosome break [ 176 ]. Recent NGS data suggest that the repair mechanism resembles chromothripsis [ 118 , 177 ]. In addition to unintended gene disruptions owing to chromosome damage, the random insertion of transgenes exposes them to “position effects” in which their expression is controlled by neighboring genes [ 118 , 178 ]. Ideally, the insertion of reporter cDNAs in the genome results in single copy transgene insertions in defined loci in such a way that endogenous genes are not disrupted, and reporters are placed under the control of specific endogenous promoters [ 179 ]. The application of CRISPR/Cas9 technology to address this problem shows it can be used to achieve these goals [ 63 , 82 , 180 ]. The development of CRISPR/Cas9 base editing technology shows that it is possible to make single-nucleotide changes in the genome [ 181 , 182 , 183 , 184 ]. Base editors have the advantage that double-strand chromosome breaks are not produced, thus lessening the chances of undesirable mutations in the genome. A novel approach to small insertions in the genome by the use of a RNA donor sequence fused to the sgRNA in combination with a reverse transcriptase fused to dead Cas9 also avoids the need to produce double-strand breaks on chromosomes. This approach is referred to as “prime editing” [ 185 ]. CRISPR technology that avoids chromosome breaks, while making changes to the genome, is extremely important in clinical applications where unintended changes can adversely affect patients. These advanced versions of CRISPR technology will be important for future research.

The desire to apply CRISPR/Cas9 for the targeted insertion of transgenes is reflected in the profusion of methods directed towards this purpose [ 63 , 108 , 110 , 112 , 186 , 187 ]. Each method was successfully used to engineer mouse and rat genomes ( Table 1 ). Each method was shown to be more cost-effective and rapid than the application of mouse or rat ES cell technology. For the practitioner of the art, the question remains: which method is most efficient? That is to say, which method minimizes the number of animals needed for zygote production and maximizes the number of gene-targeted founders? One approach to this question is to compare the transgenic efficiency of each method [ 188 ]. The results in Table 1 show that the highest efficiency experiments were obtained when long single-stranded DNA donors and Cas9 ribonucleoproteins were used to produce genetically engineered mice. All methods are very effective compared with traditional methods of gene targeting in zygotes. Perhaps future avenues to even more efficient gene targeting lie in the application of small molecule activators for HDR [ 189 , 190 , 191 ].

Analysis of targeting vector knockin by CRISPR/Cas9 in mouse and rat zygotes.

1 Conditional: A critical exon was flanked by loxP sites, so as to produce a Cre-dependent knockout allele. Reporter: an exogenous coding sequence, such as for a fluorescent protein, was inserted. 2 RNP: ribonucleoprotein; Cas9 protein was complexed with guide RNA. Cas9 mRNA: in vitro transcribed mRNA from a plasmid containing Cas9 mixed with guide RNA. Cas9-mSa: in vitro transcribed mRNA from a plasmid containing Cas9 fused to monomeric streptavidin. 3 ssDNA: single-stranded DNA repair template. BioPCR: PCR was used to prepare biotinylated PCR amplicons. dsDNA: circular double-stranded DNA repair template. HMEJ: homology-mediated end joining; circular double-stranded DNA repair template incorporating sgRNA targets that flank homology arms. Tild: linear double-stranded DNA repair template. AAV: an adeno-associated vector donor was cultured with zygotes loaded with Cas9 RNP, by electroporation. 4 Efficiency, as calculated as the number of genetically engineered mice or rats produced per 100 zygotes treated with CRISPR/Cas9 reagents and transferred to pseudopregnant females.

Author Contributions

Conceptualization, T.L.S. Writing—review and editing, T.M.L.; H.C.K.; and, T.L.S. All authors have read and agreed to the published version of the manuscript.

This research was supported by Institutional Funds from the University of Michigan Biomedical Research Core Facilities.

Conflicts of Interest

The authors declare no conflict of interest.

X

  • Latest news
  • UCL in the media
  • Services for media
  • Student news
  • Tell us your story

Menu

Genetics journal retracts papers from China due to human rights concerns

15 February 2024

Read: Guardian

UCL Facebook page

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 19 February 2024

Genomic data in the All of Us Research Program

The all of us research program genomics investigators.

Nature ( 2024 ) Cite this article

50k Accesses

528 Altmetric

Metrics details

  • Genetic variation
  • Genome-wide association studies

Comprehensively mapping the genetic basis of human disease across diverse individuals is a long-standing goal for the field of human genetics 1 , 2 , 3 , 4 . The All of Us Research Program is a longitudinal cohort study aiming to enrol a diverse group of at least one million individuals across the USA to accelerate biomedical research and improve human health 5 , 6 . Here we describe the programme’s genomics data release of 245,388 clinical-grade genome sequences. This resource is unique in its diversity as 77% of participants are from communities that are historically under-represented in biomedical research and 46% are individuals from under-represented racial and ethnic minorities. All of Us identified more than 1 billion genetic variants, including more than 275 million previously unreported genetic variants, more than 3.9 million of which had coding consequences. Leveraging linkage between genomic data and the longitudinal electronic health record, we evaluated 3,724 genetic variants associated with 117 diseases and found high replication rates across both participants of European ancestry and participants of African ancestry. Summary-level data are publicly available, and individual-level data can be accessed by researchers through the All of Us Researcher Workbench using a unique data passport model with a median time from initial researcher registration to data access of 29 hours. We anticipate that this diverse dataset will advance the promise of genomic medicine for all.

Comprehensively identifying genetic variation and cataloguing its contribution to health and disease, in conjunction with environmental and lifestyle factors, is a central goal of human health research 1 , 2 . A key limitation in efforts to build this catalogue has been the historic under-representation of large subsets of individuals in biomedical research including individuals from diverse ancestries, individuals with disabilities and individuals from disadvantaged backgrounds 3 , 4 . The All of Us Research Program (All of Us) aims to address this gap by enrolling and collecting comprehensive health data on at least one million individuals who reflect the diversity across the USA 5 , 6 . An essential component of All of Us is the generation of whole-genome sequence (WGS) and genotyping data on one million participants. All of Us is committed to making this dataset broadly useful—not only by democratizing access to this dataset across the scientific community but also to return value to the participants themselves by returning individual DNA results, such as genetic ancestry, hereditary disease risk and pharmacogenetics according to clinical standards, to those who wish to receive these research results.

Here we describe the release of WGS data from 245,388 All of Us participants and demonstrate the impact of this high-quality data in genetic and health studies. We carried out a series of data harmonization and quality control (QC) procedures and conducted analyses characterizing the properties of the dataset including genetic ancestry and relatedness. We validated the data by replicating well-established genotype–phenotype associations including low-density lipoprotein cholesterol (LDL-C) and 117 additional diseases. These data are available through the All of Us Researcher Workbench, a cloud platform that embodies and enables programme priorities, facilitating equitable data and compute access while ensuring responsible conduct of research and protecting participant privacy through a passport data access model.

The All of Us Research Program

To accelerate health research, All of Us is committed to curating and releasing research data early and often 6 . Less than five years after national enrolment began in 2018, this fifth data release includes data from more than 413,000 All of Us participants. Summary data are made available through a public Data Browser, and individual-level participant data are made available to researchers through the Researcher Workbench (Fig. 1a and Data availability).

figure 1

a , The All of Us Research Hub contains a publicly accessible Data Browser for exploration of summary phenotypic and genomic data. The Researcher Workbench is a secure cloud-based environment of participant-level data in a Controlled Tier that is widely accessible to researchers. b , All of Us participants have rich phenotype data from a combination of physical measurements, survey responses, EHRs, wearables and genomic data. Dots indicate the presence of the specific data type for the given number of participants. c , Overall summary of participants under-represented in biomedical research (UBR) with data available in the Controlled Tier. The All of Us logo in a is reproduced with permission of the National Institutes of Health’s All of Us Research Program.

Participant data include a rich combination of phenotypic and genomic data (Fig. 1b ). Participants are asked to complete consent for research use of data, sharing of electronic health records (EHRs), donation of biospecimens (blood or saliva, and urine), in-person provision of physical measurements (height, weight and blood pressure) and surveys initially covering demographics, lifestyle and overall health 7 . Participants are also consented for recontact. EHR data, harmonized using the Observational Medical Outcomes Partnership Common Data Model 8 ( Methods ), are available for more than 287,000 participants (69.42%) from more than 50 health care provider organizations. The EHR dataset is longitudinal, with a quarter of participants having 10 years of EHR data (Extended Data Fig. 1 ). Data include 245,388 WGSs and genome-wide genotyping on 312,925 participants. Sequenced and genotyped individuals in this data release were not prioritized on the basis of any clinical or phenotypic feature. Notably, 99% of participants with WGS data also have survey data and physical measurements, and 84% also have EHR data. In this data release, 77% of individuals with genomic data identify with groups historically under-represented in biomedical research, including 46% who self-identify with a racial or ethnic minority group (Fig. 1c , Supplementary Table 1 and Supplementary Note ).

Scaling the All of Us infrastructure

The genomic dataset generated from All of Us participants is a resource for research and discovery and serves as the basis for return of individual health-related DNA results to participants. Consequently, the US Food and Drug Administration determined that All of Us met the criteria for a significant risk device study. As such, the entire All of Us genomics effort from sample acquisition to sequencing meets clinical laboratory standards 9 .

All of Us participants were recruited through a national network of partners, starting in 2018, as previously described 5 . Participants may enrol through All of Us - funded health care provider organizations or direct volunteer pathways and all biospecimens, including blood and saliva, are sent to the central All of Us Biobank for processing and storage. Genomics data for this release were generated from blood-derived DNA. The programme began return of actionable genomic results in December 2022. As of April 2023, approximately 51,000 individuals were sent notifications asking whether they wanted to view their results, and approximately half have accepted. Return continues on an ongoing basis.

The All of Us Data and Research Center maintains all participant information and biospecimen ID linkage to ensure that participant confidentiality and coded identifiers (participant and aliquot level) are used to track each sample through the All of Us genomics workflow. This workflow facilitates weekly automated aliquot and plating requests to the Biobank, supplies relevant metadata for the sample shipments to the Genome Centers, and contains a feedback loop to inform action on samples that fail QC at any stage. Further, the consent status of each participant is checked before sample shipment to confirm that they are still active. Although all participants with genomic data are consented for the same general research use category, the programme accommodates different preferences for the return of genomic data to participants and only data for those individuals who have consented for return of individual health-related DNA results are distributed to the All of Us Clinical Validation Labs for further evaluation and health-related clinical reporting. All participants in All of Us that choose to get health-related DNA results have the option to schedule a genetic counselling appointment to discuss their results. Individuals with positive findings who choose to obtain results are required to schedule an appointment with a genetic counsellor to receive those findings.

Genome sequencing

To satisfy the requirements for clinical accuracy, precision and consistency across DNA sample extraction and sequencing, the All of Us Genome Centers and Biobank harmonized laboratory protocols, established standard QC methodologies and metrics, and conducted a series of validation experiments using previously characterized clinical samples and commercially available reference standards 9 . Briefly, PCR-free barcoded WGS libraries were constructed with the Illumina Kapa HyperPrep kit. Libraries were pooled and sequenced on the Illumina NovaSeq 6000 instrument. After demultiplexing, initial QC analysis is performed with the Illumina DRAGEN pipeline (Supplementary Table 2 ) leveraging lane, library, flow cell, barcode and sample level metrics as well as assessing contamination, mapping quality and concordance to genotyping array data independently processed from a different aliquot of DNA. The Genome Centers use these metrics to determine whether each sample meets programme specifications and then submits sequencing data to the Data and Research Center for further QC, joint calling and distribution to the research community ( Methods ).

This effort to harmonize sequencing methods, multi-level QC and use of identical data processing protocols mitigated the variability in sequencing location and protocols that often leads to batch effects in large genomic datasets 9 . As a result, the data are not only of clinical-grade quality, but also consistent in coverage (≥30× mean) and uniformity across Genome Centers (Supplementary Figs. 1 – 5 ).

Joint calling and variant discovery

We carried out joint calling across the entire All of Us WGS dataset (Extended Data Fig. 2 ). Joint calling leverages information across samples to prune artefact variants, which increases sensitivity, and enables flagging samples with potential issues that were missed during single-sample QC 10 (Supplementary Table 3 ). Scaling conventional approaches to whole-genome joint calling beyond 50,000 individuals is a notable computational challenge 11 , 12 . To address this, we developed a new cloud variant storage solution, the Genomic Variant Store (GVS), which is based on a schema designed for querying and rendering variants in which the variants are stored in GVS and rendered to an analysable variant file, as opposed to the variant file being the primary storage mechanism (Code availability). We carried out QC on the joint call set on the basis of the approach developed for gnomAD 3.1 (ref.  13 ). This included flagging samples with outlying values in eight metrics (Supplementary Table 4 , Supplementary Fig. 2 and Methods ).

To calculate the sensitivity and precision of the joint call dataset, we included four well-characterized samples. We sequenced the National Institute of Standards and Technology reference materials (DNA samples) from the Genome in a Bottle consortium 13 and carried out variant calling as described above. We used the corresponding published set of variant calls for each sample as the ground truth in our sensitivity and precision calculations 14 . The overall sensitivity for single-nucleotide variants was over 98.7% and precision was more than 99.9%. For short insertions or deletions, the sensitivity was over 97% and precision was more than 99.6% (Supplementary Table 5 and Methods ).

The joint call set included more than 1 billion genetic variants. We annotated the joint call dataset on the basis of functional annotation (for example, gene symbol and protein change) using Illumina Nirvana 15 . We defined coding variants as those inducing an amino acid change on a canonical ENSEMBL transcript and found 272,051,104 non-coding and 3,913,722 coding variants that have not been described previously in dbSNP 16 v153 (Extended Data Table 1 ). A total of 3,912,832 (99.98%) of the coding variants are rare (allelic frequency < 0.01) and the remaining 883 (0.02%) are common (allelic frequency > 0.01). Of the coding variants, 454 (0.01%) are common in one or more of the non-European computed ancestries in All of Us, rare among participants of European ancestry, and have an allelic number greater than 1,000 (Extended Data Table 2 and Extended Data Fig. 3 ). The distributions of pathogenic, or likely pathogenic, ClinVar variant counts per participant, stratified by computed ancestry, filtered to only those variants that are found in individuals with an allele count of <40 are shown in Extended Data Fig. 4 . The potential medical implications of these known and new variants with respect to variant pathogenicity by ancestry are highlighted in a companion paper 17 . In particular, we find that the European ancestry subset has the highest rate of pathogenic variation (2.1%), which was twice the rate of pathogenic variation in individuals of East Asian ancestry 17 .The lower frequency of variants in East Asian individuals may be partially explained by the fact the sample size in that group is small and there may be knowledge bias in the variant databases that is reducing the number of findings in some of the less-studied ancestry groups.

Genetic ancestry and relatedness

Genetic ancestry inference confirmed that 51.1% of the All of Us WGS dataset is derived from individuals of non-European ancestry. Briefly, the ancestry categories are based on the same labels used in gnomAD 18 . We trained a classifier on a 16-dimensional principal component analysis (PCA) space of a diverse reference based on 3,202 samples and 151,159 autosomal single-nucleotide polymorphisms. We projected the All of Us samples into the PCA space of the training data, based on the same single-nucleotide polymorphisms from the WGS data, and generated categorical ancestry predictions from the trained classifier ( Methods ). Continuous genetic ancestry fractions for All of Us samples were inferred using the same PCA data, and participants’ patterns of ancestry and admixture were compared to their self-identified race and ethnicity (Fig. 2 and Methods ). Continuous ancestry inference carried out using genome-wide genotypes yields highly concordant estimates.

figure 2

a , b , Uniform manifold approximation and projection (UMAP) representations of All of Us WGS PCA data with self-described race ( a ) and ethnicity ( b ) labels. c , Proportion of genetic ancestry per individual in six distinct and coherent ancestry groups defined by Human Genome Diversity Project and 1000 Genomes samples.

Kinship estimation confirmed that All of Us WGS data consist largely of unrelated individuals with about 85% (215,107) having no first- or second-degree relatives in the dataset (Supplementary Fig. 6 ). As many genomic analyses leverage unrelated individuals, we identified the smallest set of samples that are required to be removed from the remaining individuals that had first- or second-degree relatives and retained one individual from each kindred. This procedure yielded a maximal independent set of 231,442 individuals (about 94%) with genome sequence data in the current release ( Methods ).

Genetic determinants of LDL-C

As a measure of data quality and utility, we carried out a single-variant genome-wide association study (GWAS) for LDL-C, a trait with well-established genomic architecture ( Methods ). Of the 245,388 WGS participants, 91,749 had one or more LDL-C measurements. The All of Us LDL-C GWAS identified 20 well-established genome-wide significant loci, with minimal genomic inflation (Fig. 3 , Extended Data Table 3 and Supplementary Fig. 7 ). We compared the results to those of a recent multi-ethnic LDL-C GWAS in the National Heart, Lung, and Blood Institute (NHLBI) TOPMed study that included 66,329 ancestrally diverse (56% non-European ancestry) individuals 19 . We found a strong correlation between the effect estimates for NHLBI TOPMed genome-wide significant loci and those of All of Us ( R 2  = 0.98, P  < 1.61 × 10 −45 ; Fig. 3 , inset). Notably, the per-locus effect sizes observed in All of Us are decreased compared to those in TOPMed, which is in part due to differences in the underlying statistical model, differences in the ancestral composition of these datasets and differences in laboratory value ascertainment between EHR-derived data and epidemiology studies. A companion manuscript extended this work to identify common and rare genetic associations for three diseases (atrial fibrillation, coronary artery disease and type 2 diabetes) and two quantitative traits (height and LDL-C) in the All of Us dataset and identified very high concordance with previous efforts across all of these diseases and traits 20 .

figure 3

Manhattan plot demonstrating robust replication of 20 well-established LDL-C genetic loci among 91,749 individuals with 1 or more LDL-C measurements. The red horizontal line denotes the genome wide significance threshold of P = 5 × 10 –8 . Inset, effect estimate ( β ) comparison between NHLBI TOPMed LDL-C GWAS ( x  axis) and All of Us LDL-C GWAS ( y  axis) for the subset of 194 independent variants clumped (window 250 kb, r2 0.5) that reached genome-wide significance in NHLBI TOPMed.

Genotype-by-phenotype associations

As another measure of data quality and utility, we tested replication rates of previously reported phenotype–genotype associations in the five predicted genetic ancestry populations present in the Phenotype/Genotype Reference Map (PGRM): AFR, African ancestry; AMR, Latino/admixed American ancestry; EAS, East Asian ancestry; EUR, European ancestry; SAS, South Asian ancestry. The PGRM contains published associations in the GWAS catalogue in these ancestry populations that map to International Classification of Diseases-based phenotype codes 21 . This replication study specifically looked across 4,947 variants, calculating replication rates for powered associations in each ancestry population. The overall replication rates for associations powered at 80% were: 72.0% (18/25) in AFR, 100% (13/13) in AMR, 46.6% (7/15) in EAS, 74.9% (1,064/1,421) in EUR, and 100% (1/1) in SAS. With the exception of the EAS ancestry results, these powered replication rates are comparable to those of the published PGRM analysis where the replication rates of several single-site EHR-linked biobanks ranges from 76% to 85%. These results demonstrate the utility of the data and also highlight opportunities for further work understanding the specifics of the All of Us population and the potential contribution of gene–environment interactions to genotype–phenotype mapping and motivates the development of methods for multi-site EHR phenotype data extraction, harmonization and genetic association studies.

More broadly, the All of Us resource highlights the opportunities to identify genotype–phenotype associations that differ across diverse populations 22 . For example, the Duffy blood group locus ( ACKR1 ) is more prevalent in individuals of AFR ancestry and individuals of AMR ancestry than in individuals of EUR ancestry. Although the phenome-wide association study of this locus highlights the well-established association of the Duffy blood group with lower white blood cell counts both in individuals of AFR and AMR ancestry 23 , 24 , it also revealed genetic-ancestry-specific phenotype patterns, with minimal phenotypic associations in individuals of EAS ancestry and individuals of EUR ancestry (Fig. 4 and Extended Data Table 4 ). Conversely, rs9273363 in the HLA-DQB1 locus is associated with increased risk of type 1 diabetes 25 , 26 and diabetic complications across ancestries, but only associates with increased risk of coeliac disease in individuals of EUR ancestry (Extended Data Fig. 5 ). Similarly, the TCF7L2 locus 27 strongly associates with increased risk of type 2 diabetes and associated complications across several ancestries (Extended Data Fig. 6 ). Association testing results are available in Supplementary Dataset 1 .

figure 4

Results of genetic-ancestry-stratified phenome-wide association analysis among unrelated individuals highlighting ancestry-specific disease associations across the four most common genetic ancestries of participant. Bonferroni-adjusted phenome-wide significance threshold (<2.88 × 10 −5 ) is plotted as a red horizontal line. AFR ( n  = 34,037, minor allele fraction (MAF) 0.82); AMR ( n  = 28,901, MAF 0.10); EAS ( n  = 32,55, MAF 0.003); EUR ( n  = 101,613, MAF 0.007).

The cloud-based Researcher Workbench

All of Us genomic data are available in a secure, access-controlled cloud-based analysis environment: the All of Us Researcher Workbench. Unlike traditional data access models that require per-project approval, access in the Researcher Workbench is governed by a data passport model based on a researcher’s authenticated identity, institutional affiliation, and completion of self-service training and compliance attestation 28 . After gaining access, a researcher may create a new workspace at any time to conduct a study, provided that they comply with all Data Use Policies and self-declare their research purpose. This information is regularly audited and made accessible publicly on the All of Us Research Projects Directory. This streamlined access model is guided by the principles that: participants are research partners and maintaining their privacy and data security is paramount; their data should be made as accessible as possible for authorized researchers; and we should continually seek to remove unnecessary barriers to accessing and using All of Us data.

For researchers at institutions with an existing institutional data use agreement, access can be gained as soon as they complete the required verification and compliance steps. As of August 2023, 556 institutions have agreements in place, allowing more than 5,000 approved researchers to actively work on more than 4,400 projects. The median time for a researcher from initial registration to completion of these requirements is 28.6 h (10th percentile: 48 min, 90th percentile: 14.9 days), a fraction of the weeks to months it can take to assemble a project-specific application and have it reviewed by an access board with conventional access models.

Given that the size of the project’s phenotypic and genomic dataset is expected to reach 4.75 PB in 2023, the use of a central data store and cloud analysis tools will save funders an estimated US$16.5 million per year when compared to the typical approach of allowing researchers to download genomic data. Storing one copy per institution of this data at 556 registered institutions would cost about US$1.16 billion per year. By contrast, storing a central cloud copy costs about US$1.14 million per year, a 99.9% saving. Importantly, cloud infrastructure also democratizes data access particularly for researchers who do not have high-performance local compute resources.

Here we present the All of Us Research Program’s approach to generating diverse clinical-grade genomic data at an unprecedented scale. We present the data release of about 245,000 genome sequences as part of a scalable framework that will grow to include genetic information and health data for one million or more people living across the USA. Our observations permit several conclusions.

First, the All of Us programme is making a notable contribution to improving the study of human biology through purposeful inclusion of under-represented individuals at scale 29 , 30 . Of the participants with genomic data in All of Us, 45.92% self-identified as a non-European race or ethnicity. This diversity enabled identification of more than 275 million new genetic variants across the dataset not previously captured by other large-scale genome aggregation efforts with diverse participants that have submitted variation to dbSNP v153, such as NHLBI TOPMed 31 freeze 8 (Extended Data Table 1 ). In contrast to gnomAD, All of Us permits individual-level genotype access with detailed phenotype data for all participants. Furthermore, unlike many genomics resources, All of Us is uniformly consented for general research use and enables researchers to go from initial account creation to individual-level data access in as little as a few hours. The All of Us cohort is significantly more diverse than those of other large contemporary research studies generating WGS data 32 , 33 . This enables a more equitable future for precision medicine (for example, through constructing polygenic risk scores that are appropriately calibrated to diverse populations 34 , 35 as the eMERGE programme has done leveraging All of Us data 36 , 37 ). Developing new tools and regulatory frameworks to enable analyses across multiple biobanks in the cloud to harness the unique strengths of each is an active area of investigation addressed in a companion paper to this work 38 .

Second, the All of Us Researcher Workbench embodies the programme’s design philosophy of open science, reproducible research, equitable access and transparency to researchers and to research participants 26 . Importantly, for research studies, no group of data users should have privileged access to All of Us resources based on anything other than data protection criteria. Although the All of Us Researcher Workbench initially targeted onboarding US academic, health care and non-profit organizations, it has recently expanded to international researchers. We anticipate further genomic and phenotypic data releases at regular intervals with data available to all researcher communities. We also anticipate additional derived data and functionality to be made available, such as reference data, structural variants and a service for array imputation using the All of Us genomic data.

Third, All of Us enables studying human biology at an unprecedented scale. The programmatic goal of sequencing one million or more genomes has required harnessing the output of multiple sequencing centres. Previous work has focused on achieving functional equivalence in data processing and joint calling pipelines 39 . To achieve clinical-grade data equivalence, All of Us required protocol equivalence at both sequencing production level and data processing across the sequencing centres. Furthermore, previous work has demonstrated the value of joint calling at scale 10 , 18 . The new GVS framework developed by the All of Us programme enables joint calling at extreme scales (Code availability). Finally, the provision of data access through cloud-native tools enables scalable and secure access and analysis to researchers while simultaneously enabling the trust of research participants and transparency underlying the All of Us data passport access model.

The clinical-grade sequencing carried out by All of Us enables not only research, but also the return of value to participants through clinically relevant genetic results and health-related traits to those who opt-in to receiving this information. In the years ahead, we anticipate that this partnership with All of Us participants will enable researchers to move beyond large-scale genomic discovery to understanding the consequences of implementing genomic medicine at scale.

The All of Us cohort

All of Us aims to engage a longitudinal cohort of one million or more US participants, with a focus on including populations that have historically been under-represented in biomedical research. Details of the All of Us cohort have been described previously 5 . Briefly, the primary objective is to build a robust research resource that can facilitate the exploration of biological, clinical, social and environmental determinants of health and disease. The programme will collect and curate health-related data and biospecimens, and these data and biospecimens will be made broadly available for research uses. Health data are obtained through the electronic medical record and through participant surveys. Survey templates can be found on our public website: https://www.researchallofus.org/data-tools/survey-explorer/ . Adults 18 years and older who have the capacity to consent and reside in the USA or a US territory at present are eligible. Informed consent for all participants is conducted in person or through an eConsent platform that includes primary consent, HIPAA Authorization for Research use of EHRs and other external health data, and Consent for Return of Genomic Results. The protocol was reviewed by the Institutional Review Board (IRB) of the All of Us Research Program. The All of Us IRB follows the regulations and guidance of the NIH Office for Human Research Protections for all studies, ensuring that the rights and welfare of research participants are overseen and protected uniformly.

Data accessibility through a ‘data passport’

Authorization for access to participant-level data in All of Us is based on a ‘data passport’ model, through which authorized researchers do not need IRB review for each research project. The data passport is required for gaining data access to the Researcher Workbench and for creating workspaces to carry out research projects using All of Us data. At present, data passports are authorized through a six-step process that includes affiliation with an institution that has signed a Data Use and Registration Agreement, account creation, identity verification, completion of ethics training, and attestation to a data user code of conduct. Results reported follow the All of Us Data and Statistics Dissemination Policy disallowing disclosure of group counts under 20 to protect participant privacy without seeking prior approval 40 .

At present, All of Us gathers EHR data from about 50 health care organizations that are funded to recruit and enrol participants as well as transfer EHR data for those participants who have consented to provide them. Data stewards at each provider organization harmonize their local data to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, and then submit it to the All of Us Data and Research Center (DRC) so that it can be linked with other participant data and further curated for research use. OMOP is a common data model standardizing health information from disparate EHRs to common vocabularies and organized into tables according to data domains. EHR data are updated from the recruitment sites and sent to the DRC quarterly. Updated data releases to the research community occur approximately once a year. Supplementary Table 6 outlines the OMOP concepts collected by the DRC quarterly from the recruitment sites.

Biospecimen collection and processing

Participants who consented to participate in All of Us donated fresh whole blood (4 ml EDTA and 10 ml EDTA) as a primary source of DNA. The All of Us Biobank managed by the Mayo Clinic extracted DNA from 4 ml EDTA whole blood, and DNA was stored at −80 °C at an average concentration of 150 ng µl −1 . The buffy coat isolated from 10 ml EDTA whole blood has been used for extracting DNA in the case of initial extraction failure or absence of 4 ml EDTA whole blood. The Biobank plated 2.4 µg DNA with a concentration of 60 ng µl −1 in duplicate for array and WGS samples. The samples are distributed to All of Us Genome Centers weekly, and a negative (empty well) control and National Institute of Standards and Technology controls are incorporated every two months for QC purposes.

Genome Center sample receipt, accession and QC

On receipt of DNA sample shipments, the All of Us Genome Centers carry out an inspection of the packaging and sample containers to ensure that sample integrity has not been compromised during transport and to verify that the sample containers correspond to the shipping manifest. QC of the submitted samples also includes DNA quantification, using routine procedures to confirm volume and concentration (Supplementary Table 7 ). Any issues or discrepancies are recorded, and affected samples are put on hold until resolved. Samples that meet quality thresholds are accessioned in the Laboratory Information Management System, and sample aliquots are prepared for library construction processing (for example, normalized with respect to concentration and volume).

WGS library construction, sequencing and primary data QC

The DNA sample is first sheared using a Covaris sonicator and is then size-selected using AMPure XP beads to restrict the range of library insert sizes. Using the PCR Free Kapa HyperPrep library construction kit, enzymatic steps are completed to repair the jagged ends of DNA fragments, add proper A-base segments, and ligate indexed adapter barcode sequences onto samples. Excess adaptors are removed using AMPure XP beads for a final clean-up. Libraries are quantified using quantitative PCR with the Illumina Kapa DNA Quantification Kit and then normalized and pooled for sequencing (Supplementary Table 7 ).

Pooled libraries are loaded on the Illumina NovaSeq 6000 instrument. The data from the initial sequencing run are used to QC individual libraries and to remove non-conforming samples from the pipeline. The data are also used to calibrate the pooling volume of each individual library and re-pool the libraries for additional NovaSeq sequencing to reach an average coverage of 30×.

After demultiplexing, WGS analysis occurs on the Illumina DRAGEN platform. The DRAGEN pipeline consists of highly optimized algorithms for mapping, aligning, sorting, duplicate marking and haplotype variant calling and makes use of platform features such as compression and BCL conversion. Alignment uses the GRCh38dh reference genome. QC data are collected at every stage of the analysis protocol, providing high-resolution metrics required to ensure data consistency for large-scale multiplexing. The DRAGEN pipeline produces a large number of metrics that cover lane, library, flow cell, barcode and sample-level metrics for all runs as well as assessing contamination and mapping quality. The All of Us Genome Centers use these metrics to determine pass or fail for each sample before submitting the CRAM files to the All of Us DRC. For mapping and variant calling, all Genome Centers have harmonized on a set of DRAGEN parameters, which ensures consistency in processing (Supplementary Table 2 ).

Every step through the WGS procedure is rigorously controlled by predefined QC measures. Various control mechanisms and acceptance criteria were established during WGS assay validation. Specific metrics for reviewing and releasing genome data are: mean coverage (threshold of ≥30×), genome coverage (threshold of ≥90% at 20×), coverage of hereditary disease risk genes (threshold of ≥95% at 20×), aligned Q30 bases (threshold of ≥8 × 10 10 ), contamination (threshold of ≤1%) and concordance to independently processed array data.

Array genotyping

Samples are processed for genotyping at three All of Us Genome Centers (Broad, Johns Hopkins University and University of Washington). DNA samples are received from the Biobank and the process is facilitated by the All of Us genomics workflow described above. All three centres used an identical array product, scanners, resource files and genotype calling software for array processing to reduce batch effects. Each centre has its own Laboratory Information Management System that manages workflow control, sample and reagent tracking, and centre-specific liquid handling robotics.

Samples are processed using the Illumina Global Diversity Array (GDA) with Illumina Infinium LCG chemistry using the automated protocol and scanned on Illumina iSCANs with Automated Array Loaders. Illumina IAAP software converts raw data (IDAT files; 2 per sample) into a single GTC file per sample using the BPM file (defines strand, probe sequences and illumicode address) and the EGT file (defines the relationship between intensities and genotype calls). Files used for this data release are: GDA-8v1-0_A5.bpm, GDA-8v1-0_A1_ClusterFile.egt, gentrain v3, reference hg19 and gencall cutoff 0.15. The GDA array assays a total of 1,914,935 variant positions including 1,790,654 single-nucleotide variants, 44,172 indels, 9,935 intensity-only probes for CNV calling, and 70,174 duplicates (same position, different probes). Picard GtcToVcf is used to convert the GTC files to VCF format. Resulting VCF and IDAT files are submitted to the DRC for ingestion and further processing. The VCF file contains assay name, chromosome, position, genotype calls, quality score, raw and normalized intensities, B allele frequency and log R ratio values. Each genome centre is running the GDA array under Clinical Laboratory Improvement Amendments-compliant protocols. The GTC files are parsed and metrics are uploaded to in-house Laboratory Information Management System systems for QC review.

At batch level (each set of 96-well plates run together in the laboratory at one time), each genome centre includes positive control samples that are required to have >98% call rate and >99% concordance to existing data to approve release of the batch of data. At the sample level, the call rate and sex are the key QC determinants 41 . Contamination is also measured using BAFRegress 42 and reported out as metadata. Any sample with a call rate below 98% is repeated one time in the laboratory. Genotyped sex is determined by plotting normalized x versus normalized y intensity values for a batch of samples. Any sample discordant with ‘sex at birth’ reported by the All of Us participant is flagged for further detailed review and repeated one time in the laboratory. If several sex-discordant samples are clustered on an array or on a 96-well plate, the entire array or plate will have data production repeated. Samples identified with sex chromosome aneuploidies are also reported back as metadata (XXX, XXY, XYY and so on). A final processing status of ‘pass’, ‘fail’ or ‘abandon’ is determined before release of data to the All of Us DRC. An array sample will pass if the call rate is >98% and the genotyped sex and sex at birth are concordant (or the sex at birth is not applicable). An array sample will fail if the genotyped sex and the sex at birth are discordant. An array sample will have the status of abandon if the call rate is <98% after at least two attempts at the genome centre.

Data from the arrays are used for participant return of genetic ancestry and non-health-related traits for those who consent, and they are also used to facilitate additional QC of the matched WGS data. Contamination is assessed in the array data to determine whether DNA re-extraction is required before WGS. Re-extraction is prompted by level of contamination combined with consent status for return of results. The arrays are also used to confirm sample identity between the WGS data and the matched array data by assessing concordance at 100 unique sites. To establish concordance, a fingerprint file of these 100 sites is provided to the Genome Centers to assess concordance with the same sites in the WGS data before CRAM submission.

Genomic data curation

As seen in Extended Data Fig. 2 , we generate a joint call set for all WGS samples and make these data available in their entirety and by sample subsets to researchers. A breakdown of the frequencies, stratified by computed ancestries for which we had more than 10,000 participants can be found in Extended Data Fig. 3 . The joint call set process allows us to leverage information across samples to improve QC and increase accuracy.

Single-sample QC

If a sample fails single-sample QC, it is excluded from the release and is not reported in this document. These tests detect sample swaps, cross-individual contamination and sample preparation errors. In some cases, we carry out these tests twice (at both the Genome Center and the DRC), for two reasons: to confirm internal consistency between sites; and to mark samples as passing (or failing) QC on the basis of the research pipeline criteria. The single-sample QC process accepts a higher contamination rate than the clinical pipeline (0.03 for the research pipeline versus 0.01 for the clinical pipeline), but otherwise uses identical thresholds. The list of specific QC processes, passing criteria, error modes addressed and an overview of the results can be found in Supplementary Table 3 .

Joint call set QC

During joint calling, we carry out additional QC steps using information that is available across samples including hard thresholds, population outliers, allele-specific filters, and sensitivity and precision evaluation. Supplementary Table 4 summarizes both the steps that we took and the results obtained for the WGS data. More detailed information about the methods and specific parameters can be found in the All of Us Genomic Research Data Quality Report 36 .

Batch effect analysis

We analysed cross-sequencing centre batch effects in the joint call set. To quantify the batch effect, we calculated Cohen’s d (ref.  43 ) for four metrics (insertion/deletion ratio, single-nucleotide polymorphism count, indel count and single-nucleotide polymorphism transition/transversion ratio) across the three genome sequencing centres (Baylor College of Medicine, Broad Institute and University of Washington), stratified by computed ancestry and seven regions of the genome (whole genome, high-confidence calling, repetitive, GC content of >0.85, GC content of <0.15, low mappability, the ACMG59 genes and regions of large duplications (>1 kb)). Using random batches as a control set, all comparisons had a Cohen’s d of <0.35. Here we report any Cohen’s d results >0.5, which we chose before this analysis and is conventionally the threshold of a medium effect size 44 .

We found that there was an effect size in indel counts (Cohen’s d of 0.53) in the entire genome, between Broad Institute and University of Washington, but this was being driven by repetitive and low-mappability regions. We found no batch effects with Cohen’s d of >0.5 in the ratio metrics or in any metrics in the high-confidence calling, low or high GC content, or ACMG59 regions. A complete list of the batch effects with Cohen’s d of >0.5 are found in Supplementary Table 8 .

Sensitivity and precision evaluation

To determine sensitivity and precision, we included four well-characterized control samples (four National Institute of Standards and Technology Genome in a Bottle samples (HG-001, HG-003, HG-004 and HG-005). The samples were sequenced with the same protocol as All of Us. Of note, these samples were not included in data released to researchers. We used the corresponding published set of variant calls for each sample as the ground truth in our sensitivity and precision calculations. We use the high-confidence calling region, defined by Genome in a Bottle v4.2.1, as the source of ground truth. To be called a true positive, a variant must match the chromosome, position, reference allele, alternate allele and zygosity. In cases of sites with multiple alternative alleles, each alternative allele is considered separately. Sensitivity and precision results are reported in Supplementary Table 5 .

Genetic ancestry inference

We computed categorical ancestry for all WGS samples in All of Us and made these available to researchers. These predictions are also the basis for population allele frequency calculations in the Genomic Variants section of the public Data Browser. We used the high-quality set of sites to determine an ancestry label for each sample. The ancestry categories are based on the same labels used in gnomAD 18 , the Human Genome Diversity Project (HGDP) 45 and 1000 Genomes 1 : African (AFR); Latino/admixed American (AMR); East Asian (EAS); Middle Eastern (MID); European (EUR), composed of Finnish (FIN) and Non-Finnish European (NFE); Other (OTH), not belonging to one of the other ancestries or is an admixture; South Asian (SAS).

We trained a random forest classifier 46 on a training set of the HGDP and 1000 Genomes samples variants on the autosome, obtained from gnomAD 11 . We generated the first 16 principal components (PCs) of the training sample genotypes (using the hwe_normalized_pca in Hail) at the high-quality variant sites for use as the feature vector for each training sample. We used the truth labels from the sample metadata, which can be found alongside the VCFs. Note that we do not train the classifier on the samples labelled as Other. We use the label probabilities (‘confidence’) of the classifier on the other ancestries to determine ancestry of Other.

To determine the ancestry of All of Us samples, we project the All of Us samples into the PCA space of the training data and apply the classifier. As a proxy for the accuracy of our All of Us predictions, we look at the concordance between the survey results and the predicted ancestry. The concordance between self-reported ethnicity and the ancestry predictions was 87.7%.

PC data from All of Us samples and the HGDP and 1000 Genomes samples were used to compute individual participant genetic ancestry fractions for All of Us samples using the Rye program. Rye uses PC data to carry out rapid and accurate genetic ancestry inference on biobank-scale datasets 47 . HGDP and 1000 Genomes reference samples were used to define a set of six distinct and coherent ancestry groups—African, East Asian, European, Middle Eastern, Latino/admixed American and South Asian—corresponding to participant self-identified race and ethnicity groups. Rye was run on the first 16 PCs, using the defined reference ancestry groups to assign ancestry group fractions to individual All of Us participant samples.

Relatedness

We calculated the kinship score using the Hail pc_relate function and reported any pairs with a kinship score above 0.1. The kinship score is half of the fraction of the genetic material shared (ranges from 0.0 to 0.5). We determined the maximal independent set 41 for related samples. We identified a maximally unrelated set of 231,442 samples (94%) for kinship scored greater than 0.1.

LDL-C common variant GWAS

The phenotypic data were extracted from the Curated Data Repository (CDR, Control Tier Dataset v7) in the All of Us Researcher Workbench. The All of Us Cohort Builder and Dataset Builder were used to extract all LDL cholesterol measurements from the Lab and Measurements criteria in EHR data for all participants who have WGS data. The most recent measurements were selected as the phenotype and adjusted for statin use 19 , age and sex. A rank-based inverse normal transformation was applied for this continuous trait to increase power and deflate type I error. Analysis was carried out on the Hail MatrixTable representation of the All of Us WGS joint-called data including removing monomorphic variants, variants with a call rate of <95% and variants with extreme Hardy–Weinberg equilibrium values ( P  < 10 −15 ). A linear regression was carried out with REGENIE 48 on variants with a minor allele frequency >5%, further adjusting for relatedness to the first five ancestry PCs. The final analysis included 34,924 participants and 8,589,520 variants.

Genotype-by-phenotype replication

We tested replication rates of known phenotype–genotype associations in three of the four largest populations: EUR, AFR and EAS. The AMR population was not included because they have no registered GWAS. This method is a conceptual extension of the original GWAS × phenome-wide association study, which replicated 66% of powered associations in a single EHR-linked biobank 49 . The PGRM is an expansion of this work by Bastarache et al., based on associations in the GWAS catalogue 50 in June 2020 (ref.  51 ). After directly matching the Experimental Factor Ontology terms to phecodes, the authors identified 8,085 unique loci and 170 unique phecodes that compose the PGRM. They showed replication rates in several EHR-linked biobanks ranging from 76% to 85%. For this analysis, we used the EUR-, and AFR-based maps, considering only catalogue associations that were P  < 5 × 10 −8 significant.

The main tools used were the Python package Hail for data extraction, plink for genomic associations, and the R packages PheWAS and pgrm for further analysis and visualization. The phenotypes, participant-reported sex at birth, and year of birth were extracted from the All of Us CDR (Controlled Tier Dataset v7). These phenotypes were then loaded into a plink-compatible format using the PheWAS package, and related samples were removed by sub-setting to the maximally unrelated dataset ( n  = 231,442). Only samples with EHR data were kept, filtered by selected loci, annotated with demographic and phenotypic information extracted from the CDR and ancestry prediction information provided by All of Us, ultimately resulting in 181,345 participants for downstream analysis. The variants in the PGRM were filtered by a minimum population-specific allele frequency of >1% or population-specific allele count of >100, leaving 4,986 variants. Results for which there were at least 20 cases in the ancestry group were included. Then, a series of Firth logistic regression tests with phecodes as the outcome and variants as the predictor were carried out, adjusting for age, sex (for non-sex-specific phenotypes) and the first three genomic PC features as covariates. The PGRM was annotated with power calculations based on the case counts and reported allele frequencies. Power of 80% or greater was considered powered for this analysis.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

The All of Us Research Hub has a tiered data access data passport model with three data access tiers. The Public Tier dataset contains only aggregate data with identifiers removed. These data are available to the public through Data Snapshots ( https://www.researchallofus.org/data-tools/data-snapshots/ ) and the public Data Browser ( https://databrowser.researchallofus.org/ ). The Registered Tier curated dataset contains individual-level data, available only to approved researchers on the Researcher Workbench. At present, the Registered Tier includes data from EHRs, wearables and surveys, as well as physical measurements taken at the time of participant enrolment. The Controlled Tier dataset contains all data in the Registered Tier and additionally genomic data in the form of WGS and genotyping arrays, previously suppressed demographic data fields from EHRs and surveys, and unshifted dates of events. At present, Registered Tier and Controlled Tier data are available to researchers at academic institutions, non-profit institutions, and both non-profit and for-profit health care institutions. Work is underway to begin extending access to additional audiences, including industry-affiliated researchers. Researchers have the option to register for Registered Tier and/or Controlled Tier access by completing the All of Us Researcher Workbench access process, which includes identity verification and All of Us-specific training in research involving human participants ( https://www.researchallofus.org/register/ ). Researchers may create a new workspace at any time to conduct any research study, provided that they comply with all Data Use Policies and self-declare their research purpose. This information is made accessible publicly on the All of Us Research Projects Directory at https://allofus.nih.gov/protecting-data-and-privacy/research-projects-all-us-data .

Code availability

The GVS code is available at https://github.com/broadinstitute/gatk/tree/ah_var_store/scripts/variantstore . The LDL GWAS pipeline is available as a demonstration project in the Featured Workspace Library on the Researcher Workbench ( https://workbench.researchallofus.org/workspaces/aou-rw-5981f9dc/aouldlgwasregeniedsubctv6duplicate/notebooks ).

The 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526 , 68–74 (2015).

Article   Google Scholar  

Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577 , 179–189 (2020).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570 , 514–518 (2019).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Lewis, A. C. F. et al. Getting genetic ancestry right for science and society. Science 376 , 250–252 (2022).

All of Us Program Investigators. The “All of Us” Research Program. N. Engl. J. Med. 381 , 668–676 (2019).

Ramirez, A. H., Gebo, K. A. & Harris, P. A. Progress with the All of Us Research Program: opening access for researchers. JAMA 325 , 2441–2442 (2021).

Article   PubMed   Google Scholar  

Ramirez, A. H. et al. The All of Us Research Program: data quality, utility, and diversity. Patterns 3 , 100570 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Overhage, J. M., Ryan, P. B., Reich, C. G., Hartzema, A. G. & Stang, P. E. Validation of a common data model for active safety surveillance research. J. Am. Med. Inform. Assoc. 19 , 54–60 (2012).

Venner, E. et al. Whole-genome sequencing as an investigational device for return of hereditary disease risk and pharmacogenomic results as part of the All of Us Research Program. Genome Med. 14 , 34 (2022).

Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536 , 285–291 (2016).

Tiao, G. & Goodrich, J. gnomAD v3.1 New Content, Methods, Annotations, and Data Availability ; https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/ .

Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625 , 92–100 (2022).

Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37 , 561–566 (2019).

Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37 , 555–560 (2019).

Stromberg, M. et al. Nirvana: clinical grade variant annotator. In Proc. 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics 596 (Association for Computing Machinery, 2017).

Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29 , 308–311 (2001).

Venner, E. et al. The frequency of pathogenic variation in the All of Us cohort reveals ancestry-driven disparities. Commun. Biol. https://doi.org/10.1038/s42003-023-05708-y (2024).

Karczewski, S. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581 , 434–443 (2020).

Selvaraj, M. S. et al. Whole genome sequence analysis of blood lipid levels in >66,000 individuals. Nat. Commun. 13 , 5995 (2022).

Wang, X. et al. Common and rare variants associated with cardiometabolic traits across 98,622 whole-genome sequences in the All of Us research program. J. Hum. Genet. 68 , 565–570 (2023).

Bastarache, L. et al. The phenotype-genotype reference map: improving biobank data science through replication. Am. J. Hum. Genet. 110 , 1522–1533 (2023).

Bianchi, D. W. et al. The All of Us Research Program is an opportunity to enhance the diversity of US biomedical research. Nat. Med. https://doi.org/10.1038/s41591-023-02744-3 (2024).

Van Driest, S. L. et al. Association between a common, benign genotype and unnecessary bone marrow biopsies among African American patients. JAMA Intern. Med. 181 , 1100–1105 (2021).

Chen, M.-H. et al. Trans-ethnic and ancestry-specific blood-cell genetics in 746,667 individuals from 5 global populations. Cell 182 , 1198–1213 (2020).

Chiou, J. et al. Interpreting type 1 diabetes risk with genetics and single-cell epigenomics. Nature 594 , 398–402 (2021).

Hu, X. et al. Additive and interaction effects at three amino acid positions in HLA-DQ and HLA-DR molecules drive type 1 diabetes risk. Nat. Genet. 47 , 898–905 (2015).

Grant, S. F. A. et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet. 38 , 320–323 (2006).

Article   CAS   PubMed   Google Scholar  

All of Us Research Program. Framework for Access to All of Us Data Resources v1.1 (2021); https://www.researchallofus.org/wp-content/themes/research-hub-wordpress-theme/media/data&tools/data-access-use/AoU_Data_Access_Framework_508.pdf .

Abul-Husn, N. S. & Kenny, E. E. Personalized medicine and the power of electronic health records. Cell 177 , 58–69 (2019).

Mapes, B. M. et al. Diversity and inclusion for the All of Us research program: A scoping review. PLoS ONE 15 , e0234962 (2020).

Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590 , 290–299 (2021).

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562 , 203–209 (2018).

Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607 , 732–740 (2022).

Kurniansyah, N. et al. Evaluating the use of blood pressure polygenic risk scores across race/ethnic background groups. Nat. Commun. 14 , 3202 (2023).

Hou, K. et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat. Genet. 55 , 549– 558 (2022).

Linder, J. E. et al. Returning integrated genomic risk and clinical recommendations: the eMERGE study. Genet. Med. 25 , 100006 (2023).

Lennon, N. J. et al. Selection, optimization and validation of ten chronic disease polygenic risk scores for clinical implementation in diverse US populations. Nat. Med. https://doi.org/10.1038/s41591-024-02796-z (2024).

Deflaux, N. et al. Demonstrating paths for unlocking the value of cloud genomics through cross cohort analysis. Nat. Commun. 14 , 5419 (2023).

Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9 , 4038 (2018).

Article   ADS   PubMed   PubMed Central   Google Scholar  

All of Us Research Program. Data and Statistics Dissemination Policy (2020); https://www.researchallofus.org/wp-content/themes/research-hub-wordpress-theme/media/2020/05/AoU_Policy_Data_and_Statistics_Dissemination_508.pdf .

Laurie, C. C. et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet. Epidemiol. 34 , 591–602 (2010).

Jun, G. et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am. J. Hum. Genet. 91 , 839–848 (2012).

Cohen, J. Statistical Power Analysis for the Behavioral Sciences (Routledge, 2013).

Andrade, C. Mean difference, standardized mean difference (SMD), and their use in meta-analysis. J. Clin. Psychiatry 81 , 20f13681 (2020).

Cavalli-Sforza, L. L. The Human Genome Diversity Project: past, present and future. Nat. Rev. Genet. 6 , 333–340 (2005).

Ho, T. K. Random decision forests. In Proc. 3rd International Conference on Document Analysis and Recognition (IEEE Computer Society Press, 2002).

Conley, A. B. et al. Rye: genetic ancestry inference at biobank scale. Nucleic Acids Res. 51 , e44 (2023).

Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53 , 1097–1103 (2021).

Denny, J. C. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotech. 31 , 1102–1111 (2013).

Buniello, A. et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47 , D1005–D1012 (2019).

Bastarache, L. et al. The Phenotype-Genotype Reference Map: improving biobank data science through replication. Am. J. Hum. Genet. 10 , 1522–1533 (2023).

Download references

Acknowledgements

The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers (OT2 OD026549; OT2 OD026554; OT2 OD026557; OT2 OD026556; OT2 OD026550; OT2 OD 026552; OT2 OD026553; OT2 OD026548; OT2 OD026551; OT2 OD026555); Inter agency agreement AOD 16037; Federally Qualified Health Centers HHSN 263201600085U; Data and Research Center: U2C OD023196; Genome Centers (OT2 OD002748; OT2 OD002750; OT2 OD002751); Biobank: U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: U24 OD023163; Communications and Engagement: OT2 OD023205; OT2 OD023206; and Community Partners (OT2 OD025277; OT2 OD025315; OT2 OD025337; OT2 OD025276). In addition, the All of Us Research Program would not be possible without the partnership of its participants. All of Us and the All of Us logo are service marks of the US Department of Health and Human Services. E.E.E. is an investigator of the Howard Hughes Medical Institute. We acknowledge the foundational contributions of our friend and colleague, the late Deborah A. Nickerson. Debbie’s years of insightful contributions throughout the formation of the All of Us genomics programme are permanently imprinted, and she shares credit for all of the successes of this programme.

Author information

Authors and affiliations.

Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA

Alexander G. Bick & Henry R. Condon

Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA

Ginger A. Metcalf, Eric Boerwinkle, Richard A. Gibbs, Donna M. Muzny, Eric Venner, Kimberly Walker, Jianhong Hu, Harsha Doddapaneni, Christie L. Kovar, Mullai Murugan, Shannon Dugan, Ziad Khan & Eric Boerwinkle

Vanderbilt Institute of Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, USA

Kelsey R. Mayo, Jodell E. Linder, Melissa Basford, Ashley Able, Ashley E. Green, Robert J. Carroll, Jennifer Zhang & Yuanyuan Wang

Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA

Lee Lichtenstein, Anthony Philippakis, Sophie Schwartz, M. Morgan T. Aster, Kristian Cibulskis, Andrea Haessly, Rebecca Asch, Aurora Cremer, Kylee Degatano, Akum Shergill, Laura D. Gauthier, Samuel K. Lee, Aaron Hatcher, George B. Grant, Genevieve R. Brandt, Miguel Covarrubias, Eric Banks & Wail Baalawi

Verily, South San Francisco, CA, USA

Shimon Rura, David Glazer, Moira K. Dillon & C. H. Albach

Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA

Robert J. Carroll, Paul A. Harris & Dan M. Roden

All of Us Research Program, National Institutes of Health, Bethesda, MD, USA

Anjene Musick, Andrea H. Ramirez, Sokny Lim, Siddhartha Nambiar, Bradley Ozenberger, Anastasia L. Wise, Chris Lunt, Geoffrey S. Ginsburg & Joshua C. Denny

School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA

I. King Jordan, Shashwat Deepali Nagar & Shivam Sharma

Neuroscience Institute, Institute of Translational Genomic Medicine, Morehouse School of Medicine, Atlanta, GA, USA

Robert Meller

Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA

Mine S. Cicek, Stephen N. Thibodeau & Mine S. Cicek

Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA

Kimberly F. Doheny, Michelle Z. Mawhinney, Sean M. L. Griffith, Elvin Hsu, Hua Ling & Marcia K. Adams

Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA

Evan E. Eichler, Joshua D. Smith, Christian D. Frazar, Colleen P. Davis, Karynne E. Patterson, Marsha M. Wheeler, Sean McGee, Mitzi L. Murray, Valeria Vasta, Dru Leistritz, Matthew A. Richardson, Aparna Radhakrishnan & Brenna W. Ehmen

Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA

Evan E. Eichler

Broad Institute of MIT and Harvard, Cambridge, MA, USA

Stacey Gabriel, Heidi L. Rehm, Niall J. Lennon, Christina Austin-Tse, Eric Banks, Michael Gatzen, Namrata Gupta, Katie Larsson, Sheli McDonough, Steven M. Harrison, Christopher Kachulis, Matthew S. Lebo, Seung Hoan Choi & Xin Wang

Division of Medical Genetics, Department of Medicine, University of Washington School of Medicine, Seattle, WA, USA

Gail P. Jarvik & Elisabeth A. Rosenthal

Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA

Dan M. Roden

Department of Pharmacology, Vanderbilt University Medical Center, Nashville, TN, USA

Center for Individualized Medicine, Biorepository Program, Mayo Clinic, Rochester, MN, USA

Stephen N. Thibodeau, Ashley L. Blegen, Samantha J. Wirkus, Victoria A. Wagner, Jeffrey G. Meyer & Mine S. Cicek

Color Health, Burlingame, CA, USA

Scott Topper, Cynthia L. Neben, Marcie Steeves & Alicia Y. Zhou

School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA

Eric Boerwinkle

Laboratory for Molecular Medicine, Massachusetts General Brigham Personalized Medicine, Cambridge, MA, USA

Christina Austin-Tse, Emma Henricks & Matthew S. Lebo

Department of Laboratory Medicine and Pathology, University of Washington School of Medicine, Seattle, WA, USA

Christina M. Lockwood, Brian H. Shirts, Colin C. Pritchard, Jillian G. Buchan & Niklas Krumm

Manuscript Writing Group

  • Alexander G. Bick
  • , Ginger A. Metcalf
  • , Kelsey R. Mayo
  • , Lee Lichtenstein
  • , Shimon Rura
  • , Robert J. Carroll
  • , Anjene Musick
  • , Jodell E. Linder
  • , I. King Jordan
  • , Shashwat Deepali Nagar
  • , Shivam Sharma
  •  & Robert Meller

All of Us Research Program Genomics Principal Investigators

  • Melissa Basford
  • , Eric Boerwinkle
  • , Mine S. Cicek
  • , Kimberly F. Doheny
  • , Evan E. Eichler
  • , Stacey Gabriel
  • , Richard A. Gibbs
  • , David Glazer
  • , Paul A. Harris
  • , Gail P. Jarvik
  • , Anthony Philippakis
  • , Heidi L. Rehm
  • , Dan M. Roden
  • , Stephen N. Thibodeau
  •  & Scott Topper

Biobank, Mayo

  • Ashley L. Blegen
  • , Samantha J. Wirkus
  • , Victoria A. Wagner
  • , Jeffrey G. Meyer
  •  & Stephen N. Thibodeau

Genome Center: Baylor-Hopkins Clinical Genome Center

  • Donna M. Muzny
  • , Eric Venner
  • , Michelle Z. Mawhinney
  • , Sean M. L. Griffith
  • , Elvin Hsu
  • , Marcia K. Adams
  • , Kimberly Walker
  • , Jianhong Hu
  • , Harsha Doddapaneni
  • , Christie L. Kovar
  • , Mullai Murugan
  • , Shannon Dugan
  • , Ziad Khan
  •  & Richard A. Gibbs

Genome Center: Broad, Color, and Mass General Brigham Laboratory for Molecular Medicine

  • Niall J. Lennon
  • , Christina Austin-Tse
  • , Eric Banks
  • , Michael Gatzen
  • , Namrata Gupta
  • , Emma Henricks
  • , Katie Larsson
  • , Sheli McDonough
  • , Steven M. Harrison
  • , Christopher Kachulis
  • , Matthew S. Lebo
  • , Cynthia L. Neben
  • , Marcie Steeves
  • , Alicia Y. Zhou
  • , Scott Topper
  •  & Stacey Gabriel

Genome Center: University of Washington

  • Gail P. Jarvik
  • , Joshua D. Smith
  • , Christian D. Frazar
  • , Colleen P. Davis
  • , Karynne E. Patterson
  • , Marsha M. Wheeler
  • , Sean McGee
  • , Christina M. Lockwood
  • , Brian H. Shirts
  • , Colin C. Pritchard
  • , Mitzi L. Murray
  • , Valeria Vasta
  • , Dru Leistritz
  • , Matthew A. Richardson
  • , Jillian G. Buchan
  • , Aparna Radhakrishnan
  • , Niklas Krumm
  •  & Brenna W. Ehmen

Data and Research Center

  • Lee Lichtenstein
  • , Sophie Schwartz
  • , M. Morgan T. Aster
  • , Kristian Cibulskis
  • , Andrea Haessly
  • , Rebecca Asch
  • , Aurora Cremer
  • , Kylee Degatano
  • , Akum Shergill
  • , Laura D. Gauthier
  • , Samuel K. Lee
  • , Aaron Hatcher
  • , George B. Grant
  • , Genevieve R. Brandt
  • , Miguel Covarrubias
  • , Melissa Basford
  • , Alexander G. Bick
  • , Ashley Able
  • , Ashley E. Green
  • , Jennifer Zhang
  • , Henry R. Condon
  • , Yuanyuan Wang
  • , Moira K. Dillon
  • , C. H. Albach
  • , Wail Baalawi
  •  & Dan M. Roden

All of Us Research Demonstration Project Teams

  • Seung Hoan Choi
  • , Elisabeth A. Rosenthal

NIH All of Us Research Program Staff

  • Andrea H. Ramirez
  • , Sokny Lim
  • , Siddhartha Nambiar
  • , Bradley Ozenberger
  • , Anastasia L. Wise
  • , Chris Lunt
  • , Geoffrey S. Ginsburg
  •  & Joshua C. Denny

Contributions

The All of Us Biobank (Mayo Clinic) collected, stored and plated participant biospecimens. The All of Us Genome Centers (Baylor-Hopkins Clinical Genome Center; Broad, Color, and Mass General Brigham Laboratory for Molecular Medicine; and University of Washington School of Medicine) generated and QCed the whole-genomic data. The All of Us Data and Research Center (Vanderbilt University Medical Center, Broad Institute of MIT and Harvard, and Verily) generated the WGS joint call set, carried out quality assurance and QC analyses and developed the Researcher Workbench. All of Us Research Demonstration Project Teams contributed analyses. The other All of Us Genomics Investigators and NIH All of Us Research Program Staff provided crucial programmatic support. Members of the manuscript writing group (A.G.B., G.A.M., K.R.M., L.L., S.R., R.J.C. and A.M.) wrote the first draft of this manuscript, which was revised with contributions and feedback from all authors.

Corresponding author

Correspondence to Alexander G. Bick .

Ethics declarations

Competing interests.

D.M.M., G.A.M., E.V., K.W., J.H., H.D., C.L.K., M.M., S.D., Z.K., E. Boerwinkle and R.A.G. declare that Baylor Genetics is a Baylor College of Medicine affiliate that derives revenue from genetic testing. Eric Venner is affiliated with Codified Genomics, a provider of genetic interpretation. E.E.E. is a scientific advisory board member of Variant Bio, Inc. A.G.B. is a scientific advisory board member of TenSixteen Bio. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Timothy Frayling and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 historic availability of ehr records in all of us v7 controlled tier curated data repository (n = 413,457)..

For better visibility, the plot shows growth starting in 2010.

Extended Data Fig. 2 Overview of the Genomic Data Curation Pipeline for WGS samples.

The Data and Research Center (DRC) performs additional single sample quality control (QC) on the data as it arrives from the Genome Centers. The variants from samples that pass this QC are loaded into the Genomic Variant Store (GVS), where we jointly call the variants and apply additional QC. We apply a joint call set QC process, which is stored with the call set. The entire joint call set is rendered as a Hail Variant Dataset (VDS), which can be accessed from the analysis notebooks in the Researcher Workbench. Subsections of the genome are extracted from the VDS and rendered in different formats with all participants. Auxiliary data can also be accessed through the Researcher Workbench. This includes variant functional annotations, joint call set QC results, predicted ancestry, and relatedness. Auxiliary data are derived from GVS (arrow not shown) and the VDS. The Cohort Builder directly queries GVS when researchers request genomic data for subsets of samples. Aligned reads, as cram files, are available in the Researcher Workbench (not shown). The graphics of the dish, gene and computer and the All of Us logo are reproduced with permission of the National Institutes of Health’s All of Us Research Program.

Extended Data Fig. 3 Proportion of allelic frequencies (AF), stratified by computed ancestry with over 10,000 participants.

Bar counts are not cumulative (eg, “pop AF < 0.01” does not include “pop AF < 0.001”).

Extended Data Fig. 4 Distribution of pathogenic, and likely pathogenic ClinVar variants.

Stratified by ancestry filtered to only those variants that are found in allele count (AC) < 40 individuals for 245,388 short read WGS samples.

Extended Data Fig. 5 Ancestry specific HLA-DQB1 ( rs9273363 ) locus associations in 231,442 unrelated individuals.

Phenome-wide (PheWAS) associations highlight ancestry specific consequences across ancestries.

Extended Data Fig. 6 Ancestry specific TCF7L2 ( rs7903146 ) locus associations in 231,442 unrelated individuals.

Phenome-wide (PheWAS) associations highlight diabetic consequences across ancestries.

Supplementary information

Supplementary information.

Supplementary Figs. 1–7, Tables 1–8 and Note.

Reporting Summary

Supplementary dataset 1.

Associations of ACKR1, HLA-DQB1 and TCF7L2 loci with all Phecodes stratified by genetic ancestry.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

The All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature (2024). https://doi.org/10.1038/s41586-023-06957-x

Download citation

Received : 22 July 2022

Accepted : 08 December 2023

Published : 19 February 2024

DOI : https://doi.org/10.1038/s41586-023-06957-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

research paper about genetics

IMAGES

  1. (PDF) GENETICS AND GENOMICS

    research paper about genetics

  2. Journal of Animal Genetics Research

    research paper about genetics

  3. (PDF) Genetics of intelligence

    research paper about genetics

  4. Genetics Research Paper

    research paper about genetics

  5. 120+ Genetics Research Topics For All Kind Academic Papers

    research paper about genetics

  6. Genetics Research Assignment by 7AM Science

    research paper about genetics

VIDEO

  1. Genetics lab #4

  2. genetics terminology part 4

  3. Genetics and Types #notes #presentation #booklover #genetics #biology #mdcat #cell

  4. Paper -3 |Biotechnology& Genetics Engineering of Plant and Microbes|MSc(Botany)3rd semester|2023-24|

  5. Cell biology, Genetics, Taxonomy, Evolution Paper# Kumaun University

  6. Test paper bio genetics

COMMENTS

  1. Human Molecular Genetics and Genomics

    Highlights in Human Molecular Genetics and Genomics. The IOM's early years coincided with paradigm-shifting discoveries related to DNA, as biologic research swiftly incorporated Boyer and...

  2. Genetics

    Genetics is the branch of science concerned with genes, heredity, and variation in living organisms.

  3. Genetics research

    Genetics research is the scientific discipline concerned with the study of the role of genes in traits such as the development of disease. It has a key role in identifying potential targets for...

  4. Population genetics: past, present, and future

    Excellent reviews of population genetics have been written (Chakraborty 2006; Charlesworth and Charlesworth 2017; Crow 1987; Crow and Kimura 1970) documenting the development of population genetics from early achievements by Mendel ( 1866 ), Hardy ( 1908 ), and Weinberg ( 1908) up to highly sophisticated theoretical developments, mostly by Ameri...

  5. Nature Genetics

    Nature Genetics publishes the very highest quality research in genetics. It encompasses genetic and functional genomic studies on human traits, agricultural genomics and other model organisms.

  6. Population genetics: past, present, and future

    Excellent reviews of population genetics have been written (Chakraborty 2006; Charlesworth and Charlesworth 2017; Crow 1987; Crow and Kimura 1970) documenting the development of population genetics from early achievements by Mendel ( 1866 ), Hardy ( 1908 ), and Weinberg ( 1908) up to highly sophisticated theoretical developments, mostly by Ameri...

  7. The genetic basis of disease

    Genetics plays a role, to a greater or lesser extent, in all diseases. Variations in our DNA and differences in how that DNA functions (alone or in combinations), alongside the environment (which encompasses lifestyle), contribute to disease processes.

  8. PLOS Genetics

    Precise coordination between nutrient transporters ensures fertility in the malaria mosquito Anopheles gambiae. Reciprocal regulation between lipid transporter lipophorin (Lp) and yolk precursor protein vitellogenin (Vg) is crucial for proper egg development, making them potential targets for mosquito control. Image credit: pgen.1011145.

  9. Frontiers in Genetics

    The most cited genetics and heredity journal, which advances our understanding of genes from humans to plants and other model organisms. It highlights developments in the function and variability o...

  10. Home

    Overview. Human Genetics focuses on publishing timely articles covering all aspects of human genetics. It covers a broad range of topics from gene structure and organization to genetic epidemiology and ethical, legal, and social issues. Authors have the choice to publish using either the traditional publishing route or immediate gold Open Access.

  11. Genetics Research

    Genetics Research is a fully open access journal providing a key forum for original research on all aspects of human and animal genetics, reporting key findings on genomes, genes, mutations, developmental, evolutionary, and population genetics as well as ethical, legal and social aspects.

  12. Genetics

    Fungal Genetics and Genomics. The fungal kingdom is remarkable in its breadth and depth of impact on global health, agriculture, biodiversity, ecology, manufacturing, and biomedical research. Overseen by editors Leah Cowen and Joseph Heitman, this series aims to report and thereby further stimulate advances in genetics and genomics across a ...

  13. Frontiers in Genetics

    Advancing In Vitro Cell Culture Practices: Achieving Truly Animal-Free Experiments and Scientifically Reliable and Reproducible Methods. The most cited genetics and heredity journal, which advances our understanding of genes from humans to plants and other model organisms. It highlights developments in the function and variability o...

  14. Genetics of attention deficit hyperactivity disorder

    Subject terms: Genetics, Neuroscience. Attention deficit hyperactivity disorder (ADHD) is a childhood-onset condition with impairing symptoms of inattention, impulsivity, and hyperactivity. Decades of research have documented and replicated key facts about the disorder (for a review, see ref. [ 1 ]). It occurs in about 5% of children with ...

  15. Genetics Research

    Genetics Research is a fully open access journal providing a key forum for original research on all aspects of human and animal genetics, reporting key findings on genomes, genes, mutations and molecular interactions, extending out to developmental, evolutionary, and population genetics as well as ethical, legal and social aspects.

  16. Journal of Human Genetics

    Welcome to the Journal of Human Genetics An international journal publishing articles on human genetics, including medical genetics and human genome analysis Current issue Identification of...

  17. A review on genetic algorithm: past, present, and future

    In this paper, the analysis of recent advances in genetic algorithms is discussed. The genetic algorithms of great interest in research community are selected for analysis. This review will help the new and demanding researchers to provide the wider vision of genetic algorithms. The well-known algorithms and their implementation are presented with their pros and cons. The genetic operators and ...

  18. 275 million new genetic variants identified in NIH precision medicine

    Monday, February 19, 2024. Credit: NIH/All of Us Research Program. Researchers have discovered more than 275 million previously unreported genetic variants, identified from data shared by nearly 250,000 participants of the National Institutes of Health's All of Us Research Program. Half of the genomic data are from participants of non ...

  19. Researchers optimize genetic tests for diverse populations to tackle

    Improved genetic tests more accurately assess disease risk regardless of genetic ancestry. To prevent an emerging genomic technology from contributing to health disparities, a scientific team funded by the National Institutes of Health has devised new ways to improve a genetic testing method called a polygenic risk score. Since polygenic risk ...

  20. The role of genetics and genomics in clinical psychiatry

    From the mid-1980s, family, twin, and adoption studies have provided consistent evidence for aggregate genetic effects for psychiatric disorders, demonstrating the substantial role of genetic factors in the etiology of mental illness. 1 The heritability estimates for most psychiatric disorders were found to be high, between 0.4 and 0.8. 2 These ...

  21. Researchers optimize genetic tests for diverse populations to tackle

    Email: [email protected]. Phone: (301) 402-0911. Last updated: February 19, 2024. To prevent an emerging genomic technology from contributing to health disparities, a scientific team funded by the National Institutes of Health has devised new ways to improve a genetic testing method called a polygenic risk score.

  22. 251+ Life Science Research Topics [Updated]

    Restoration ecology techniques. Urban ecology and biodiversity. Marine biology and coral reef conservation. Habitat fragmentation and species extinction. Ecological modeling and forecasting. Wildlife conservation genetics. Microbial ecology in natural environments. See also 150+ Persuasive Research Paper Topics: Unlocking the Power of Persuasion.

  23. Genomics

    Genomics is the study of the full genetic complement of an organism (the genome). It employs recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyse the...

  24. Genetics journal retracts 18 papers from China due to human rights

    A genetics journal from a leading scientific publisher has retracted 17 papers from China, in what is thought to be the biggest mass retraction of academic research due to concerns about human ...

  25. Principles of Genetic Engineering

    Here, we describe principles of genetic engineering and detail: (1) how common elements of current technologies include the need for a chromosome break to occur, (2) the use of specific and sensitive genotyping assays to detect altered genomes, and (3) delivery modalities that impact characterization of gene modifications.

  26. Conway publishes new E. coli research, adds to OSU's microbiome work

    Oklahoma State University Department of Microbiology and Molecular Genetics Regents Professor Dr. Tyrrell Conway has published a paper on nitrogen sources and the E. coli bacteria, work that correlates directly with his and OSU's roles in the recently established Oklahoma Center for Microbiome Research.

  27. Population genetics

    Population genetics is the study of the genetic composition of populations, including distributions and changes in genotype and phenotype frequency in response to the processes of natural...

  28. Genetics journal retracts papers from China due to human rights ...

    Professor David Curtis (UCL Biosciences) comments on Molecular Genetics & Genomic Medicine's move to retract 18 research papers from China.

  29. Genomic data in the All of Us Research Program

    Comprehensively mapping the genetic basis of human disease across diverse individuals is a long-standing goal for the field of human genetics 1,2,3,4.The All of Us Research Program is a ...