What is a Randomized Control Trial (RCT)?

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, Ph.D., is a qualified psychology teacher with over 18 years experience of working in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

A randomized control trial (RCT) is a type of study design that involves randomly assigning participants to either an experimental group or a control group to measure the effectiveness of an intervention or treatment.

Randomized Controlled Trials (RCTs) are considered the “gold standard” in medical and health research due to their rigorous design.

Randomized Controlled Trial RCT

Control Group

A control group consists of participants who do not receive any treatment or intervention but a placebo or reference treatment. The control participants serve as a comparison group.

The control group is matched as closely as possible to the experimental group, including age, gender, social class, ethnicity, etc.

Because the participants are randomly assigned, the characteristics between the two groups should be balanced, enabling researchers to attribute any differences in outcome to the study intervention.

Since researchers can be confident that any differences between the control and treatment groups are due solely to the effects of the treatments, scientists view RCTs as the gold standard for clinical trials.

Random Allocation

Random allocation and random assignment are terms used interchangeably in the context of a randomized controlled trial (RCT).

Both refer to assigning participants to different groups in a study (such as a treatment group or a control group) in a way that is completely determined by chance.

The process of random assignment controls for confounding variables , ensuring differences between groups are due to chance alone.

Without randomization, researchers might consciously or subconsciously assign patients to a particular group for various reasons.

Several methods can be used for randomization in a Randomized Control Trial (RCT). Here are a few examples:

  • Simple Randomization: This is the simplest method, like flipping a coin. Each participant has an equal chance of being assigned to any group. This can be achieved using random number tables, computerized random number generators, or drawing lots or envelopes.
  • Block Randomization: In this method, participants are randomized within blocks, ensuring that each block has an equal number of participants in each group. This helps to balance the number of participants in each group at any given time during the study.
  • Stratified Randomization: This method is used when researchers want to ensure that certain subgroups of participants are equally represented in each group. Participants are divided into strata, or subgroups, based on characteristics like age or disease severity, and then randomized within these strata.
  • Cluster Randomization: In this method, groups of participants (like families or entire communities), rather than individuals, are randomized.
  • Adaptive Randomization: In this method, the probability of being assigned to each group changes based on the participants already assigned to each group. For example, if more participants have been assigned to the control group, new participants will have a higher probability of being assigned to the experimental group.

Computer software can generate random numbers or sequences that can be used to assign participants to groups in a simple randomization process.

For more complex methods like block, stratified, or adaptive randomization, computer algorithms can be used to consider the additional parameters and ensure that participants are assigned to groups appropriately.

Using a computerized system can also help to maintain the integrity of the randomization process by preventing researchers from knowing in advance which group a participant will be assigned to (a principle known as allocation concealment). This can help to prevent selection bias and ensure the validity of the study results .

Allocation Concealment

Allocation concealment is a technique to ensure the random allocation process is truly random and unbiased.

RCTs use allocation concealment to decide which patients get the real medicine and which get a placebo (a fake medicine)

It involves keeping the sequence of group assignments (i.e., who gets assigned to the treatment group and who gets assigned to the control group next) hidden from the researchers before a participant has enrolled in the study.

This helps to prevent the researchers from consciously or unconsciously selecting certain participants for one group or the other based on their knowledge of which group is next in the sequence.

Allocation concealment ensures that the investigator does not know in advance which treatment the next person will get, thus maintaining the integrity of the randomization process.

Blinding (Masking)

Binding, or masking, refers to withholding information regarding the group assignments (who is in the treatment group and who is in the control group) from the participants, the researchers, or both during the study .

A blinded study prevents the participants from knowing about their treatment to avoid bias in the research. Any information that can influence the subjects is withheld until the completion of the research.

Blinding can be imposed on any participant in an experiment, including researchers, data collectors, evaluators, technicians, and data analysts.

Good blinding can eliminate experimental biases arising from the subjects’ expectations, observer bias, confirmation bias, researcher bias, observer’s effect on the participants, and other biases that may occur in a research test.

In a double-blind study , neither the participants nor the researchers know who is receiving the drug or the placebo. When a participant is enrolled, they are randomly assigned to one of the two groups. The medication they receive looks identical whether it’s the drug or the placebo.

Evidence-based medicine pyramid.

Figure 1 . Evidence-based medicine pyramid. The levels of evidence are appropriately represented by a pyramid as each level, from bottom to top, reflects the quality of research designs (increasing) and quantity (decreasing) of each study design in the body of published literature. For example, randomized control trials are higher quality and more labor intensive to conduct, so there is a lower quantity published.

Prevents bias

In randomized control trials, participants must be randomly assigned to either the intervention group or the control group, such that each individual has an equal chance of being placed in either group.

This is meant to prevent selection bias and allocation bias and achieve control over any confounding variables to provide an accurate comparison of the treatment being studied.

Because the distribution of characteristics of patients that could influence the outcome is randomly assigned between groups, any differences in outcome can be explained only by the treatment.

High statistical power

Because the participants are randomized and the characteristics between the two groups are balanced, researchers can assume that if there are significant differences in the primary outcome between the two groups, the differences are likely to be due to the intervention.

This warrants researchers to be confident that randomized control trials will have high statistical power compared to other types of study designs.

Since the focus of conducting a randomized control trial is eliminating bias, blinded RCTs can help minimize any unconscious information bias.

In a blinded RCT, the participants do not know which group they are assigned to or which intervention is received. This blinding procedure should also apply to researchers, health care professionals, assessors, and investigators when possible.

“Single-blind” refers to an RCT where participants do not know the details of the treatment, but the researchers do.

“ Double-blind ” refers to an RCT where both participants and data collectors are masked of the assigned treatment.

Limitations

Costly and timely.

Some interventions require years or even decades to evaluate, rendering them expensive and time-consuming.

It might take an extended period of time before researchers can identify a drug’s effects or discover significant results.

Requires large sample size

There must be enough participants in each group of a randomized control trial so researchers can detect any true differences or effects in outcomes between the groups.

Researchers cannot detect clinically important results if the sample size is too small.

Change in population over time

Because randomized control trials are longitudinal in nature, it is almost inevitable that some participants will not complete the study, whether due to death, migration, non-compliance, or loss of interest in the study.

This tendency is known as selective attrition and can threaten the statistical power of an experiment.

Randomized control trials are not always practical or ethical, and such limitations can prevent researchers from conducting their studies.

For example, a treatment could be too invasive, or administering a placebo instead of an actual drug during a trial for treating a serious illness could deny a participant’s normal course of treatment. Without ethical approval, a randomized control trial cannot proceed.

Fictitious Example

An example of an RCT would be a clinical trial comparing a drug’s effect or a new treatment on a select population.

The researchers would randomly assign participants to either the experimental group or the control group and compare the differences in outcomes between those who receive the drug or treatment and those who do not.

Real-life Examples

  • Preventing illicit drug use in adolescents: Long-term follow-up data from a randomized control trial of a school population (Botvin et al., 2000).
  • A prospective randomized control trial comparing medical and surgical treatment for early pregnancy failure (Demetroulis et al., 2001).
  • A randomized control trial to evaluate a paging system for people with traumatic brain injury (Wilson et al., 2009).
  • Prehabilitation versus Rehabilitation: A Randomized Control Trial in Patients Undergoing Colorectal Resection for Cancer (Gillis et al., 2014).
  • A Randomized Control Trial of Right-Heart Catheterization in Critically Ill Patients (Guyatt, 1991).
  • Berry, R. B., Kryger, M. H., & Massie, C. A. (2011). A novel nasal excitatory positive airway pressure (EPAP) device for the treatment of obstructive sleep apnea: A randomized controlled trial. Sleep , 34, 479–485.
  • Gloy, V. L., Briel, M., Bhatt, D. L., Kashyap, S. R., Schauer, P. R., Mingrone, G., . . . Nordmann, A. J. (2013, October 22). Bariatric surgery versus non-surgical treatment for obesity: A systematic review and meta-analysis of randomized controlled trials. BMJ , 347.
  • Streeton, C., & Whelan, G. (2001). Naltrexone, a relapse prevention maintenance treatment of alcohol dependence: A meta-analysis of randomized controlled trials. Alcohol and Alcoholism, 36 (6), 544–552.

How Should an RCT be Reported?

Reporting of a Randomized Controlled Trial (RCT) should be done in a clear, transparent, and comprehensive manner to allow readers to understand the design, conduct, analysis, and interpretation of the trial.

The Consolidated Standards of Reporting Trials ( CONSORT ) statement is a widely accepted guideline for reporting RCTs.

Further Information

  • Cocks, K., & Torgerson, D. J. (2013). Sample size calculations for pilot randomized trials: a confidence interval approach. Journal of clinical epidemiology, 66(2), 197-201.
  • Kendall, J. (2003). Designing a research project: randomised controlled trials and their principles. Emergency medicine journal: EMJ, 20(2), 164.

Akobeng, A.K., Understanding randomized controlled trials. Archives of Disease in Childhood , 2005; 90: 840-844.

Bell, C. C., Gibbons, R., & McKay, M. M. (2008). Building protective factors to offset sexually risky behaviors among black youths: a randomized control trial. Journal of the National Medical Association, 100 (8), 936-944.

Bhide, A., Shah, P. S., & Acharya, G. (2018). A simplified guide to randomized controlled trials. Acta obstetricia et gynecologica Scandinavica, 97 (4), 380-387.

Botvin, G. J., Griffin, K. W., Diaz, T., Scheier, L. M., Williams, C., & Epstein, J. A. (2000). Preventing illicit drug use in adolescents: Long-term follow-up data from a randomized control trial of a school population. Addictive Behaviors, 25 (5), 769-774.

Demetroulis, C., Saridogan, E., Kunde, D., & Naftalin, A. A. (2001). A prospective randomized control trial comparing medical and surgical treatment for early pregnancy failure. Human Reproduction, 16 (2), 365-369.

Gillis, C., Li, C., Lee, L., Awasthi, R., Augustin, B., Gamsa, A., … & Carli, F. (2014). Prehabilitation versus rehabilitation: a randomized control trial in patients undergoing colorectal resection for cancer. Anesthesiology, 121 (5), 937-947.

Globas, C., Becker, C., Cerny, J., Lam, J. M., Lindemann, U., Forrester, L. W., … & Luft, A. R. (2012). Chronic stroke survivors benefit from high-intensity aerobic treadmill exercise: a randomized control trial. Neurorehabilitation and Neural Repair, 26 (1), 85-95.

Guyatt, G. (1991). A randomized control trial of right-heart catheterization in critically ill patients. Journal of Intensive Care Medicine, 6 (2), 91-95.

MediLexicon International. (n.d.). Randomized controlled trials: Overview, benefits, and limitations. Medical News Today. Retrieved from https://www.medicalnewstoday.com/articles/280574#what-is-a-randomized-controlled-trial

Wilson, B. A., Emslie, H., Quirk, K., Evans, J., & Watson, P. (2005). A randomized control trial to evaluate a paging system for people with traumatic brain injury. Brain Injury, 19 (11), 891-894.

Print Friendly, PDF & Email

Study Design 101: Randomized Controlled Trial

  • Case Report
  • Case Control Study
  • Cohort Study
  • Randomized Controlled Trial
  • Practice Guideline
  • Systematic Review
  • Meta-Analysis
  • Helpful Formulas
  • Finding Specific Study Types

A study design that randomly assigns participants into an experimental group or a control group. As the study is conducted, the only expected difference between the control and experimental groups in a randomized controlled trial (RCT) is the outcome variable being studied.

  • Good randomization will "wash out" any population bias
  • Easier to blind/mask than observational studies
  • Results can be analyzed with well known statistical tools
  • Populations of participating individuals are clearly identified

Disadvantages

  • Expensive in terms of time and money
  • Volunteer biases: the population that participates may not be representative of the whole
  • Loss to follow-up attributed to treatment

Design pitfalls to look out for

An RCT should be a study of one population only.

Was the randomization actually "random", or are there really two populations being studied?

The variables being studied should be the only variables between the experimental group and the control group.

Are there any confounding variables between the groups?

Fictitious Example

To determine how a new type of short wave UVA-blocking sunscreen affects the general health of skin in comparison to a regular long wave UVA-blocking sunscreen, 40 trial participants were randomly separated into equal groups of 20: an experimental group and a control group. All participants' skin health was then initially evaluated. The experimental group wore the short wave UVA-blocking sunscreen daily, and the control group wore the long wave UVA-blocking sunscreen daily.

After one year, the general health of the skin was measured in both groups and statistically analyzed. In the control group, wearing long wave UVA-blocking sunscreen daily led to improvements in general skin health for 60% of the participants. In the experimental group, wearing short wave UVA-blocking sunscreen daily led to improvements in general skin health for 75% of the participants.

Real-life Examples

van Der Horst, N., Smits, D., Petersen, J., Goedhart, E., & Backx, F. (2015). The preventive effect of the nordic hamstring exercise on hamstring injuries in amateur soccer players: a randomized controlled trial. The American Journal of Sports Medicine, 43 (6), 1316-1323. https://doi.org/10.1177/0363546515574057

This article reports on the research investigating whether the Nordic Hamstring Exercise is effective in preventing both the incidence and severity of hamstring injuries in male amateur soccer players. Over the course of a year, there was a statistically significant reduction in the incidence of hamstring injuries in players performing the NHE, but for those injured, there was no difference in severity of injury. There was also a high level of compliance in performing the NHE in that group of players.

Natour, J., Cazotti, L., Ribeiro, L., Baptista, A., & Jones, A. (2015). Pilates improves pain, function and quality of life in patients with chronic low back pain: a randomized controlled trial. Clinical Rehabilitation, 29 (1), 59-68. https://doi.org/10.1177/0269215514538981

This study assessed the effect of adding pilates to a treatment regimen of NSAID use for individuals with chronic low back pain. Individuals who included the pilates method in their therapy took fewer NSAIDs and experienced statistically significant improvements in pain, function, and quality of life.

Related Formulas

  • Relative Risk

Related Terms

Blinding/Masking

When the groups that have been randomly selected from a population do not know whether they are in the control group or the experimental group.

Being able to show that an independent variable directly causes the dependent variable. This is generally very difficult to demonstrate in most study designs.

Confounding Variables

Variables that cause/prevent an outcome from occurring outside of or along with the variable being studied. These variables render it difficult or impossible to distinguish the relationship between the variable and outcome being studied).

Correlation

A relationship between two variables, but not necessarily a causation relationship.

Double Blinding/Masking

When the researchers conducting a blinded study do not know which participants are in the control group of the experimental group.

Null Hypothesis

That the relationship between the independent and dependent variables the researchers believe they will prove through conducting a study does not exist. To "reject the null hypothesis" is to say that there is a relationship between the variables.

Population/Cohort

A group that shares the same characteristics among its members (population).

Population Bias/Volunteer Bias

A sample may be skewed by those who are selected or self-selected into a study. If only certain portions of a population are considered in the selection process, the results of a study may have poor validity.

Randomization

Any of a number of mechanisms used to assign participants into different groups with the expectation that these groups will not differ in any significant way other than treatment and outcome.

Research (alternative) Hypothesis

The relationship between the independent and dependent variables that researchers believe they will prove through conducting a study.

Sensitivity

The relationship between what is considered a symptom of an outcome and the outcome itself; or the percent chance of not getting a false positive (see formulas).

Specificity

The relationship between not having a symptom of an outcome and not having the outcome itself; or the percent chance of not getting a false negative (see formulas).

Type 1 error

Rejecting a null hypothesis when it is in fact true. This is also known as an error of commission.

Type 2 error

The failure to reject a null hypothesis when it is in fact false. This is also known as an error of omission.

Now test yourself!

1. Having a volunteer bias in the population group is a good thing because it means the study participants are eager and make the study even stronger.

a) True b) False

2. Why is randomization important to assignment in an RCT?

a) It enables blinding/masking b) So causation may be extrapolated from results c) It balances out individual characteristics between groups. d) a and c e) b and c

Evidence Pyramid - Navigation

  • Meta- Analysis
  • Case Reports
  • << Previous: Cohort Study
  • Next: Practice Guideline >>

Creative Commons License

  • Last Updated: Sep 25, 2023 10:59 AM
  • URL: https://guides.himmelfarb.gwu.edu/studydesign101

GW logo

  • Himmelfarb Intranet
  • Privacy Notice
  • Terms of Use
  • GW is committed to digital accessibility. If you experience a barrier that affects your ability to access content on this page, let us know via the Accessibility Feedback Form .
  • Himmelfarb Health Sciences Library
  • 2300 Eye St., NW, Washington, DC 20037
  • Phone: (202) 994-2850
  • [email protected]
  • https://himmelfarb.gwu.edu

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • For authors
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 90, Issue 8
  • Understanding randomised controlled trials
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • A K Akobeng
  • Correspondence to: Dr A K Akobeng Department of Paediatric Gastroenterology, Central Manchester and Manchester Children’s University Hospitals, Booth Hall Children’s Hospital, Charlestown Road, Blackley, Manchester, M9 7AA, UK; tony.akobengcmmc.nhs.uk

The hierarchy of evidence in assessing the effectiveness of interventions or treatments is explained, and the gold standard for evaluating the effectiveness of interventions, the randomised controlled trial, is discussed. Issues that need to be considered during the critical appraisal of randomised controlled trials, such as assessing the validity of trial methodology and the magnitude and precision of the treatment effect, and deciding on the applicability of research results, are discussed. Important terminologies such as randomisation, allocation concealment, blinding, intention to treat, p values, and confidence intervals are explained.

  • CONSORT, consolidated standards of reporting trials
  • EBM, evidence based medicine
  • PCDAI, paediatric Crohn’s disease activity index
  • RCT, randomised controlled trial
  • evidence based medicine
  • hierarchy of evidence
  • randomised controlled trial
  • random allocation
  • critical appraisal

https://doi.org/10.1136/adc.2004.058222

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

In the first article of the series, 1 I described evidence based medicine (EBM) as a systematic approach to clinical problem solving, which allows the integration of the best available research evidence with clinical expertise and patient values. In this article, I will explain the hierarchy of evidence in assessing the effectiveness of interventions or treatments, and discuss the randomised controlled trial, the gold standard for evaluating the effectiveness of interventions.

HIERARCHY OF EVIDENCE

It is well recognised that some research designs are more powerful than others in their ability to answer research questions on the effectiveness of interventions. This notion has given rise to the concept of “hierarchy of evidence”. The hierarchy provides a framework for ranking evidence that evaluates health care interventions and indicates which studies should be given most weight in an evaluation where the same question has been examined using different types of study. 2

Figure 1 illustrates such a hierarchy. The ranking has an evolutionary order, moving from simple observational methods at the bottom, through to increasingly rigorous methodologies. The pyramid shape is used to illustrate the increasing risk of bias inherent in study designs as one goes down the pyramid. 3 The randomised controlled trial (RCT) is considered to provide the most reliable evidence on the effectiveness of interventions because the processes used during the conduct of an RCT minimise the risk of confounding factors influencing the results. Because of this, the findings generated by RCTs are likely to be closer to the true effect than the findings generated by other research methods. 4

  • Download figure
  • Open in new tab
  • Download powerpoint

 Hierarchy of evidence for questions about the effectiveness of an intervention or treatment.

The hierarchy implies that when we are looking for evidence on the effectiveness of interventions or treatments, properly conducted systematic reviews of RCTs with or without meta-analysis or properly conducted RCTs will provide the most powerful form of evidence. 3 For example, if you want to know whether there is good evidence that children with meningitis should be given corticosteroids or not, the best articles to look for would be systematic reviews or RCTs.

WHAT IS A RANDOMISED CONTROLLED TRIAL?

An RCT is a type of study in which participants are randomly assigned to one of two or more clinical interventions. The RCT is the most scientifically rigorous method of hypothesis testing available, 5 and is regarded as the gold standard trial for evaluating the effectiveness of interventions. 6 The basic structure of an RCT is shown in fig 2.

 The basic structure of a randomised controlled trial.

A sample of the population of interest is randomly allocated to one or another intervention and the two groups are followed up for a specified period of time. Apart from the interventions being compared, the two groups are treated and observed in an identical manner. At the end of the study, the groups are analysed in terms of outcomes defined at the outset. The results from, say, the treatment A group are compared with results from the treatment B group. As the groups are treated identically apart from the intervention received, any differences in outcomes are attributed to the trial therapy. 6

WHY A RANDOMISED CONTROLLED TRIAL?

The main purpose of random assignment is to prevent selection bias by distributing the characteristics of patients that may influence the outcome randomly between the groups, so that any difference in outcome can be explained only by the treatment. 7 Thus random allocation makes it more likely that there will be balancing of baseline systematic differences between intervention groups with regard to known and unknown factors—such as age, sex, disease activity, and duration of disease—that may affect the outcome.

APPRAISING A RANDOMISED CONTROLLED TRIAL

When you are reading an RCT article, the answers to a few questions will help you decide whether you can trust the results of the study and whether you can apply the results to your patient or population. Issues to consider when reading an RCT may be condensed into three important areas 8 :

the validity of the trial methodology;

the magnitude and precision of the treatment effect;

the applicability of the results to your patient or population.

A list of 10 questions that may be used for critical appraisal of an RCT in all three areas is given in box 1. 9

Box 1: Questions to consider when assessing an RCT 9

Did the study ask a clearly focused question?

Was the study an RCT and was it appropriately so?

Were participants appropriately allocated to intervention and control groups?

Were participants, staff, and study personnel blind to participants’ study groups?

Were all the participants who entered the trial accounted for at its conclusion?

Were participants in all groups followed up and data collected in the same way?

Did the study have enough participants to minimise the play of chance?

How are the results presented and what are the main results?

How precise are the results?

Were all important outcomes considered and can the results be applied to your local population?

ASSESSING THE VALIDITY OF TRIAL METHODOLOGY

Focused research question.

It is important that research questions be clearly defined at the outset. The question should be focused on the problem of interest, and should be framed in such a way that even somebody who is not a specialist in the field would understand why the study was undertaken.

Randomisation

Randomisation refers to the process of assigning study participants to experimental or control groups at random such that each participant has an equal probability of being assigned to any given group. 10 The main purpose of randomisation is to eliminate selection bias and balance known and unknown confounding factors in order to create a control group that is as similar as possible to the treatment group.

Methods for randomly assigning participants to groups, which limits bias, include the use of a table of random numbers and a computer program that generates random numbers. Methods of assignment that are prone to bias include alternating assignment or assignment by date of birth or hospital admission number. 10

In very large clinical trials, simple randomisation may lead to a balance between groups in the number of patients allocated to each of the groups, and in patient characteristics. However, in “smaller” studies this may not be the case. Block randomisation and stratification are strategies that may be used to help ensure balance between groups in size and patient characteristics. 11

Block randomisation

Block randomisation may be used to ensure a balance in the number of patients allocated to each of the groups in the trial. Participants are considered in blocks of, say, four at a time. Using a block size of four for two treatment arms (A and B) will lead to six possible arrangements of two As and two Bs (blocks):

AABB BBAA ABAB BABA ABBA BAAB

A random number sequence is used to select a particular block, which determines the allocation order for the first four subjects. In the same vein, treatment group is allocated to the next four patients in the order specified by the next randomly selected block.

Stratification

While randomisation may help remove selection bias, it does not always guarantee that the groups will be similar with regard to important patient characteristics. 12 In many studies, important prognostic factors are known before the study. One way of trying to ensure that the groups are as identical as possible is to generate separate block randomisation lists for different combinations of prognostic factors. This method is called stratification or stratified block sampling. For example, in a trial of enteral nutrition in the induction of remission in active Crohn’s disease, potential stratification factors might be disease activity (paediatric Crohn’s disease activity index (PCDAI) ⩽25 v >25) and disease location (small bowel involvement v no small bowel involvement). A set of blocks could be generated for those patients who have PCDAI ⩽25 and have small bowel disease; those who have PCDAI ⩽25 and have no small bowel disease; those who have PCDAI >25 and have small bowel disease; and those who have PCDAI >25 and have no small bowel disease.

Allocation concealment

Allocation concealment is a technique that is used to help prevent selection bias by concealing the allocation sequence from those assigning participants to intervention groups, until the moment of assignment. The technique prevents researchers from consciously or unconsciously influencing which participants are assigned to a given intervention group. For instance, if the randomisation sequence shows that patient number 9 will receive treatment A, allocation concealment will remove the ability of researchers or other health care professionals from manoeuvring to place another patient in position 9.

In a recent observational study, Schulz et al showed that in trials in which allocation was not concealed, estimates of treatment effect were exaggerated by about 41% compared with those that reported adequate allocation concealment. 13

A common way for concealing allocation is to seal each individual assignment in an opaque envelope. 10 However, this method may have disadvantages, and “distance” randomisation is generally preferred. 14 Distance randomisation means that assignment sequence should be completely removed from those who make the assignments. The investigator, on recruiting a patient, telephones a central randomisation service which issues the treatment allocation.

Although an RCT should, in theory, eliminate selection bias, there are instances where bias can occur. 15 You should not assume that a trial methodology is valid merely because it is stated to be an RCT. Any selection bias in an RCT invalidates the study design and makes the results no more reliable than an observational study. As Torgesson and Roberts have suggested, the results of a supposed RCT which has had its randomisation compromised by, say, poor allocation concealment may be more damaging than an explicitly unrandomised study, as bias in the latter is acknowledged and the statistical analysis and subsequent interpretation might have taken this into account. 14

There is always a risk in clinical trials that perceptions about the advantages of one treatment over another might influence outcomes, leading to biased results. This is particularly important when subjective outcome measures are being used. Patients who are aware that they are receiving what they believe to be an expensive new treatment may report being better than they really are. The judgement of a doctor who expects a particular treatment to be more effective than another may be clouded in favour of what he perceives to be the more effective treatment. When people analysing data know which treatment group was which, there can be the tendency to “overanalyse” the data for any minor differences that would support one treatment.

Knowledge of treatment received could also influence management of patients during the trial, and this can be a source of bias. For example, there could be the temptation for a doctor to give more care and attention during the study to patients receiving what he perceives to be the less effective treatment in order to compensate for perceived disadvantages.

To control for these biases,“blinding” may be undertaken. The term blinding (sometimes called masking) refers to the practice of preventing study participants, health care professionals, and those collecting and analysing data from knowing who is in the experimental group and who is in the control group, in order to avoid them being influenced by such knowledge. 16 It is important for authors of papers describing RCTs to state clearly whether participants, researchers, or data evaluators were or were not aware of assigned treatment.

In a study where participants do not know the details of the treatment but the researchers do, the term “single blind” is used. When both participants and data collectors (health care professionals, investigators) are kept ignorant of the assigned treatment, the term “double blind” is used. When, rarely, study participants, data collectors, and data evaluators such as statisticians are all blinded, the study is referred to as “triple blind”. 5

Recent studies have shown that blinding of patients and health care professionals prevents bias. Trials that were not double blinded yielded larger estimates of treatment effects than trials in which authors reported double blinding (odds ratios exaggerated, on average, by 17%). 17

It should be noted that, although blinding helps prevent bias, its effect in doing so is weaker than that of allocation concealment. 17 Moreover, unlike allocation concealment, blinding is not always appropriate or possible. For example, in a randomised controlled trial where one is comparing enteral nutrition with corticosteroids in the treatment of children with active Crohn’s disease, it may be impossible to blind participants and health care professionals to assigned intervention, although it may still be possible to blind those analysing the data, such as statisticians.

Intention to treat analysis

As stated earlier, the validity of an RCT depends greatly on the randomisation process. Randomisation ensures that known and unknown baseline confounding factors would balance out in the treatment and control groups. However, after randomisation, it is almost inevitable that some participants would not complete the study for whatever reason. Participants may deviate from the intended protocol because of misdiagnosis, non-compliance, or withdrawal. When such patients are excluded from the analysis, we can no longer be sure that important baseline prognostic factors in the two groups are similar. Thus the main rationale for random allocation is defeated, leading to potential bias.

To reduce this bias, results should be analysed on an “intention to treat” basis.

Intention to treat analysis is a strategy in the conduct and analysis of randomised controlled trials that ensures that all patients allocated to either the treatment or control groups are analysed together as representing that treatment arm whether or not they received the prescribed treatment or completed the study. 5 Intention to treat introduces clinical reality into research by recognising that for several reasons, not all participants randomised will receive the intended treatment or complete the follow up. 18

According to the revised CONSORT statement for reporting RCTs, authors of papers should state clearly which participants are included in their analyses. 19 The sample size per group, or the denominator when proportions are being reported, should be provided for all summary information. The main results should be analysed on the basis of intention to treat. Where necessary, additional analyses restricted only to participants who fulfilled the intended protocol (per protocol analyses) may also be reported.

Power and sample size calculation

The statistical power of an RCT is the ability of the study to detect a difference between the groups when such a difference exists. The power of a study is determined by several factors, including the frequency of the outcome being studied, the magnitude of the effect, the study design, and the sample size. 5 For an RCT to have a reasonable chance of answering the research question it addresses, the sample size must be large enough—that is, there must be enough participants in each group.

When the sample size of a study is too small, it may be impossible to detect any true differences in outcome between the groups. Such a study might be a waste of resources and potentially unethical. Frequently, however, small sized studies are published that claim no difference in outcome between groups without reporting the power of the studies. Researchers should ensure at the planning stage that there are enough participants to ensure that the study has a high probability of detecting as statistically significant the smallest effect that would be regarded as clinically important. 20

MAGNITUDE AND SIGNIFICANCE OF TREATMENT EFFECT

Once you have decided that the methodology of a study is valid within reason, the next step is to decide whether the results are reliable. Two things usually come into mind in making this decision—how big is the treatment effect, and how likely is it that the result obtained is due to chance alone?

Magnitude of treatment effect

Magnitude refers to the size of the measure of effect. Treatment effect in RCTs may be reported in various ways including absolute risk, relative risk, odds ratio, and number needed to treat. These measures of treatment effect and their advantages and disadvantages have recently been reviewed. 21 A large treatment effect may be more important than a small one.

Statistical significance

Statistical significance refers to the likelihood that the results obtained in a study were not due to chance alone. Probability (p) values and confidence intervals may be used to assess statistical significance.

A p value can be thought of as the probability that the observed difference between two treatment groups might have occurred by chance. The choice of a significance level is artificial but by convention, many researchers use a p value of 0.05 as the cut off for significance. What this means is that if the p value is less than 0.05, the observed difference between the groups is so unlikely to have occurred by chance that we reject the null hypothesis (that there is no difference) and accept the alternative hypothesis that there is a real difference between the treatment groups. When the p value is below the chosen cut off, say 0.05, the result is generally referred to as being statistically significant. If the p value is greater than 0.05, then we say that the observed difference might have occurred by chance and we fail to reject the null hypothesis. In such a situation, we are unable to demonstrate a difference between the groups and the result is usually referred to as not statistically significant.

Confidence intervals

The results of any study are estimates of what might happen if the treatment were to be given to the entire population of interest. When I test a new asthma drug on a randomly selected sample of children with asthma in the United Kingdom, the treatment effect I will get will be an estimate of the “true” treatment effect for the whole population of children with asthma in the country. The 95% confidence interval (CI) of the estimate will be the range within which we are 95% certain that the true population treatment effect will lie. It is most common to report 95% CI, but other intervals, such as 90% and 99% CI, may also be calculated for an estimate.

If the CI for a mean difference includes 0, then we have been unable to demonstrate a difference between the groups being compared (“not statistically significant”), but if the CI for a mean difference does not include 0, then a statistically significant difference between the groups has been shown. In the same vein, if the CI for relative risk or odds ratio for an estimate includes 1, then we have been unable to demonstrate a statistically significant difference between the groups being compared, and if it does not include 1, then there is a statistically significant difference.

Confidence intervals versus p values

CIs convey more useful information than p values. CI may be used to assess statistical significance, provide a range of plausible values for a population parameter, and gives an idea about how precise the measured treatment effect is (see below). Authors of articles could report both p values and CIs. 22 However, if only one is to be reported, then it should be the CI, as the p value is less important and can be deduced from the CI; p values tell us little extra when CIs are known. 22, 23

Clinical significance

A statistically significant finding by itself can have very little to do with clinical practice and has no direct relation to clinical significance. Clinical significance reflects the value of the results to patients and may be defined as a difference in effect size between groups that could be considered to be important in clinical decision making, regardless of whether the difference is statistically significant or not. Magnitude and statistical significance are numerical calculations, but judgements about the clinical significance or clinical importance of the measured effect are relative to the topic of interest. 2 Judgements about clinical significance should take into consideration how the benefits and any adverse events of an intervention are valued by the patient.

PRECISION OF TREATMENT EFFECT

CI is important because it gives an idea about how precise an estimate is. The width of the interval indicates the precision of the estimate. The wider the interval, the less the precision. A very wide interval may indicate that more data should be collected before anything definite can be said about the estimate.

APPLYING RESULTS TO YOUR OWN PATIENTS

An important concept of EBM is that clinicians should make decisions about whether the valid results of a study are applicable to their patients. The fact that good evidence is available on a particular asthma treatment does not necessarily mean that all patients with asthma can or should be given that treatment. Some of the issues one needs to consider before deciding whether to incorporate a particular piece of research evidence into clinical practice are briefly discussed below.

Are the participants in the study similar enough to my patients?

If a particular drug has been found to be effective in adults with meningitis in the USA, you need to decide whether there is any biological, geographical, or cultural reason why that particular drug will not be effective in children with meningitis in the United Kingdom.

Do the potential side effects of the drug outweigh the benefits?

If a particular treatment is found to be effective in an RCT, you need to consider whether the reported or known side effects of the drug may outweigh its potential benefits to your patient. You may also need to consider whether an individual patient has any potential co-morbid condition which may alter the balance of benefits and risks. In such a situation, you may, after consultation with the patient or carers, decide not to offer the treatment.

Does the treatment conflict with the patient’s values and expectations?

Full information about the treatment should be given to the patient or carers, and their views on the treatment should be taken into account. A judgement should be made about how the patient and carers value the potential benefits of the treatment as against potential harms.

Is the treatment available and is my hospital prepared to fund it?

There will be no point in prescribing a treatment which cannot either be obtained in your area of work or which your hospital or practice is not in a position to fund, for whatever reason, including cost.

CONCLUSIONS

An RCT is the most rigorous scientific method for evaluating the effectiveness of health care interventions. However, bias could arise when there are flaws in the design and management of a trial. It is important for people reading medical reports to develop the skills for critically appraising RCTs, including the ability to assess the validity of trial methodology, the magnitude and precision of the treatment effect, and the applicability of results.

  • ↵ Akobeng AK . Evidence-based child health. 1. Principles of evidence-based medicine. Arch Dis Child 2005 ; 90 : 837 –40. OpenUrl Abstract / FREE Full Text
  • ↵ Rychetnik L , Hawe P, Waters E, et al. A glossary for evidence based public health. J Epidemiol Community Health 2004 ; 58 : 538 –45. OpenUrl Abstract / FREE Full Text
  • ↵ Craig JV , Smyth RL. The evidence-based manual for nurses . London: Churchill Livingstone, 2002 .
  • ↵ Evans D . Hierarchy of evidence: a framework for ranking evidence evaluating healthcare interventions. J Clin Nurs 2003 ; 12 : 77 –84. OpenUrl CrossRef PubMed Web of Science
  • ↵ Last JM . A dictionary of epidemiology . New York: Oxford University Press, 2001 .
  • ↵ McGovern DPB . Randomized controlled trials. In: McGovern DPB, Valori RM, Summerskill WSM, eds. Key topics in evidence based medicine . Oxford: BIOS Scientific Publishers, 2001 : 26 –9.
  • ↵ Roberts C , Torgesson D. Randomisation methods in controlled trials. BMJ 1998 ; 317 : 1301 –10. OpenUrl FREE Full Text
  • ↵ Sackett DL , Strauss SE, Richardson WS, et al. evidence-based medicine: how to practice and teach EBM . London: Churchill-Livingstone, 2000 .
  • ↵ Critical Appraisal Skills Programme . Appraisal tools . Oxford, UK, http://www.phru.nhs.uk/casp/rcts.htm (accessed 8 December 2004).
  • ↵ Lang TA , Secic M. How to report statistics in medicine . Philadelphia: American College of Physicians, 1997 .
  • ↵ Altman DG , Bland JM. How to randomise. BMJ 1999 ; 319 : 703 –4. OpenUrl FREE Full Text
  • ↵ Chia KS . Randomisation: magical cure for bias. Ann Acad Med Singapore 2000 ; 29 : 563 –4. OpenUrl PubMed
  • ↵ Schulz KF , Chalmers I, Hayes RJ, et al. Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 1995 ; 273 : 408 –12. OpenUrl CrossRef PubMed Web of Science
  • ↵ Torgerson DJ , Roberts C. Randomisation methods: concealment. BMJ 1999 ; 319 : 375 –6. OpenUrl FREE Full Text
  • ↵ Torgerson DJ , Torgerson CJ. Avoiding bias in randomised controlled trials in educational research. Br J Educ Stud 2003 ; 51 : 36 –45. OpenUrl CrossRef
  • ↵ Day SJ , Altman DG. Blinding in clinical trials and other studies. BMJ 2000 ; 321 : 504 . OpenUrl FREE Full Text
  • ↵ Schulz KF . Assessing allocation concealment and blinding in randomised controlled trials: why bother? Evid Based Nurs 2000 ; 5 : 36 –7. OpenUrl
  • ↵ Summerskill WSM . Intention to treat. In: McGovern DPB, Valori RM, Summerskill WSM, eds. Key topics in evidence based medicine . Oxford: BIOS Scientific Publishers, 2001 : 105 –7.
  • ↵ Altman DG , Schulz KF, Moher D, et al. The revised CONSORT statement for reporting randomised controlled trials: explanation and elaboration. Ann Intern Med 2001 ; 134 : 663 –94. OpenUrl CrossRef PubMed Web of Science
  • ↵ Devane D , Begley CM, Clarke M. How many do I need? Basic principles of sample size estimation. J Adv Nurs 2004 ; 47 : 297 –302. OpenUrl CrossRef PubMed
  • ↵ Akobeng AK . Understanding measures of treatment effect in clinical trials. Arch Dis Child . 2005;90 : 54 –6.
  • ↵ Altman DG . Practical statistics for medical research . London: Chapman and Hall/CRC, 1991 : 152 –78.
  • ↵ Coggon D . Statistics in clinical practice . London: BMJ Publishing Group, 1995 .

Competing interests: none declared

Read the full text or download the PDF:

Book cover

Clinical Trials Design in Operative and Non Operative Invasive Procedures pp 51–58 Cite as

Overview of the Randomized Clinical Trial and the Parallel Group Design

  • Domenic J. Reda 3  
  • First Online: 17 May 2017

1476 Accesses

The randomized clinical trial (RCT) is considered to be the gold standard for determining whether a therapeutic intervention is effective in treating the specific medical condition of interest. The advent of the RCT is relatively recent, with the first major trial reporting results shortly after World War II. However, the various design elements that provide the rigor of the RCT had developed over hundreds of years. In fact, the basic idea of learning from comparison existed long before the advent of the scientific method and the subsequent development of the RCT.

While randomization is a key component of the method, a poorly designed (despite randomization) and conducted trial can negate the benefits of randomization. Thus, blinding of the treatment assignment when possible, maintaining high follow-up rates for those randomized, adherence to the study protocol and attention to high data quality help preserve the benefit of randomization.

There are three factors that need to be considered when designing a trial. The first is the principle of equipoise, which essentially means that there is enough knowledge of the potential effectiveness of an experimental treatment to warrant further study, but there remains sufficient doubt how the experimental treatment compares to an existing control (whether it is the current standard of care, placebo, or no intervention) to justify the trial. In fact, equipoise justifies randomization. The second is that the overall choice of treatments, assessments and other facets of the study design must adhere to ethical principles for human research. Finally, the trial must be feasible and have a reasonable expectation of successful completion.

  • Randomized control trial
  • Parallel group design

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

First Clinical Research. First clinical research milestones. http://www.firstclinical.com/milestones/ . Accessed on 3rd Jan 2017.

Amberson JB, McMahon BT, Pinner M. A clinical trial of sanocrysin in pulmonary tuberculosis. Am Rev Tuberc. 1931;24:401–35.

Google Scholar  

Medical Research Council. Clinical trial of patulin in the common cold. Lancet. 1944;2:373–5.

Medical Research Council. Streptomycin treatment of pulmonary tuberculosis. Br Med J. 1948;2(4582):769–82.

Article   Google Scholar  

Bill AB. The environment and disease: association or causation? Proc R Soc Med. 1965;58:295–300.

Materson BJ, Reda DC, Cushman WC, et al. Single-drug therapy for hypertension in men—a comparison of six antihypertensive agents with placebo. N Engl J Med. 1993;328:914–21.

Article   CAS   PubMed   Google Scholar  

Download references

Author information

Authors and affiliations.

Department of Veterans Affairs, Cooperative Studies Program Coordinating Center (151K), Hines VA Hospital, Building 1, Room B240, Hines, IL, 60141, USA

Domenic J. Reda

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Domenic J. Reda .

Editor information

Editors and affiliations.

Department of Surgery, Boston VA Healthcare System Department of Surgery, West Roxbury, Massachusetts, USA

Kamal M.F. Itani

Depatment of Veteran Affairs, Cooperative Studies Program Coordinating Center, Hines VA Hospital, Hines, Illinois, USA

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter.

Reda, D.J. (2017). Overview of the Randomized Clinical Trial and the Parallel Group Design. In: Itani, K., Reda, D. (eds) Clinical Trials Design in Operative and Non Operative Invasive Procedures. Springer, Cham. https://doi.org/10.1007/978-3-319-53877-8_6

Download citation

DOI : https://doi.org/10.1007/978-3-319-53877-8_6

Published : 17 May 2017

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-53876-1

Online ISBN : 978-3-319-53877-8

eBook Packages : Medicine Medicine (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Randomized Controlled Trial

Establish causal effects with randomized controlled trials.

Randomized controlled trials are considered experiments due to the use of random selection and random assignment.

Randomized controlled trials are considered the "gold standard" experimental design

Randomized controlled trials yield can yield causal effects due to the use of random selection and random assignment.

Necessary methodological parts of a randomized controlled trial

random assignment rct

Hire A Statistician

DO YOU NEED TO HIRE A STATISTICIAN?

Eric Heidel, Ph.D., PStat   will provide you with statistical consultation services for your research project at $100/hour. Secure checkout is available with Stripe, Venmo, Zelle, or PayPal.

  • Statistical Analysis on any kind of project
  • Dissertation and Thesis Projects
  • DNP Capstone Projects
  • Clinical Trials
  • Analysis of Survey Data

Designing, random assigning and evaluating Randomized Control Trials

Description.

RCT provides three important group of functions: a) functions for pre-processing the design of the RCT b) Functions for assigning treatment status and checking for balances c) Function for evaluating the impact of the RCT

RCT helps you focus on the statistics of the randomized control trials, rather than the heavy programming lifting. RCT helps you in the whole process of designing and evaluating a RCT. 1. Clean and summarise the data in which you want to randomly assign treatment 2. Decide the share of observations that will go to control group 3. Decide which variables to use for strata building 4. Robust Random Assignment by strata/blocks 5 Impact evaluation of all y's and heterogeneities To lean more about RCT, start with the vignette: browseVignettes(package = "RCT")

RCT functions

treatment_assign: Robust treatment assign by strata/blocks

impact_eval: Automatized impact evaluation with heterogeneous treatment effects

balance_table: Balance tables for any length of covariates

balance_regression: LPM of treatment status against covariates with F-test

tau_min: Computation of the minimum detectable effect between control & treatment units

tau_min_probability: Computation of the minimum detectable effect between control & treatment units for dichotomous y-vars

summary_statistics: Summary statistics of all numeric columns in your data

ntile_label: Rank and divide observations in n groups, with label

Isidoro Garcia Urquieta, [email protected]

Athey, Susan, and Guido W. Imbens (2017) "The Econometrics Randomized Experiments". Handbook of economic field experiments. https://arxiv.org/abs/1607.00698

Useful links: https://github.com/isidorogu/RCT Report bugs at https://github.com/isidorogu/RCT/issues

Introduction to Field Experiments and Randomized Controlled Trials

Painting of a girl holding a bottle

Have you ever been curious about the methods researchers employ to determine causal relationships among various factors, ultimately leading to significant breakthroughs and progress in numerous fields? In this article, we offer an overview of field experimentation and its importance in discerning cause and effect relationships. We outline how randomized experiments represent an unbiased method for determining what works. Furthermore, we discuss key aspects of experiments, such as intervention, excludability, and non-interference. To illustrate these concepts, we present a hypothetical example of a randomized controlled trial evaluating the efficacy of an experimental drug called Covi-Mapp.

Why experiments?

Every day, we find ourselves faced with questions of cause and effect. Understanding the driving forces behind outcomes is crucial, ranging from personal decisions like parenting strategies to organizational challenges such as effective advertising. This blog aims to provide a systematic introduction to experimentation, igniting enthusiasm for primary research and highlighting the myriad of experimental applications and opportunities available.

The challenge for those who seek to answer causal questions convincingly is to develop a research methodology that doesn't require identifying or measuring all potential confounders. Since no planned design can eliminate every possible systematic difference between treatment and control groups, random assignment emerges as a powerful tool for minimizing bias. In the contentious world of causal claims, randomized experiments represent an unbiased method for determining what works. Random assignment means participants are assigned to different groups or conditions in a study purely by chance. Basically, each participant has an equal chance to be assigned to a control group or a treatment group. 

Field experiments, or randomized studies conducted in real-world settings, can take many forms. While experiments on college campuses are often considered lab studies, certain experiments on campus – such as those examining club participation – may be regarded as field experiments, depending on the experimental design. Ultimately, whether a study is considered a field experiment hinges on the definition of "the field."

Researchers may employ two main scenarios for randomization. The first involves gathering study participants and randomizing them at the time of the experiment. The second capitalizes on naturally occurring randomizations, such as the Vietnam draft lottery. 

Intervention, Excludability, and Non-Interference

Three essential features of any experiment are intervention, excludability, and non-interference. In a general sense, the intervention refers to the treatment or action being tested in an experiment. The excludability principle is satisfied when the only difference between the experimental and control groups is the presence or absence of the intervention. The non-interference principle holds when the outcome of one participant in the study does not influence the outcomes of other participants. Together, these principles ensure that the experiment is designed to provide unbiased and reliable results, isolating the causal effect of the intervention under study.

Omitted Variables and Non-Compliance

To ensure unbiased results, researchers must randomize as much as possible to minimize omitted variable bias. Omitted variables are factors that influence the outcome but are not measured or are difficult to measure. These unmeasured attributes, sometimes called confounding variables or unobserved heterogeneity, must be accounted for to guarantee accurate findings.

Non-compliance can also complicate experiments. One-sided non-compliance occurs when individuals assigned to a treatment group don't receive the treatment (failure to treat), while two-sided non-compliance occurs when some subjects assigned to the treatment group go untreated or individuals assigned to the control group receive the treatment. Addressing these issues at the design level by implementing a blind or double-blind study can help mitigate potential biases.

Achieving Precision through Covariate Balance

To ensure the control and treatment groups are comparatively similar in all relevant aspects, particularly when the sample size (n) is small, it is essential to achieve covariate balance. Covariance measures the association between two variables, while a covariate is a factor that influences the outcome variable. By balancing covariates, we can more accurately isolate the effects of the treatment, leading to improved precision in our findings.

Fictional Example of Randomized Controlled Trial of Covi-Mapp for COVID-19 Management

Let's explore a fictional example to better understand experiments: a one-week randomized controlled trial of the experimental drug Covi-Mapp for managing Covid. In this case, the control group receives the standard care for Covid patients, while the treatment group receives the standard care plus Covi-Mapp. The outcome of interest is whether patients have cough symptoms on day 7, as subsidizing cough symptoms is an encouraging sign in Covid recovery. We'll measure the presence of cough on day 0 and day 7, as well as temperature on day 0 and day 7. Gender is also tracked. The control represents the standard care for COVID-19 patients, while the treatment includes standard care plus the experimental drug.

In this Covi-Mapp example, the intervention is the Covi-Mapp drug, the excludability principle is satisfied if the only difference in patient care between the groups is the drug administration, and the non-interference principle holds if one patient's outcome doesn't affect another's.

First, let's assume we have a dataset containing the relevant information for each patient, including cough status on day 0 and day 7, temperature on day 0 and day 7, treatment assignment, and gender. We'll read the data and explore the dataset:

Simple treatment effect of the experimental drug

Without any covariates, let's first look at the estimated effect of the treatment on the presence of cough on day 7. The estimated proportion of patients with a cough on day 7 for the control group (not receiving the experimental drug) is 0.847458. In other words, about 84.7% of patients in the control group are expected to have a cough on day 7, all else being equal. The estimated effect of the experimental drug on the presence of cough on day 7 is -0.23. This means that, on average, receiving the experimental drug reduces the proportion of patients with a cough on day 7 by 23.8% compared to the control group.

We know that a patient's initial condition would affect the final outcome. If the patient has a cough and a fever on day 0, they might not fare well with the treatment. To better understand the treatment's effect, let's add these covariates:

The output shows the results of a linear regression model, estimating the effect of the experimental drug (treat_covid_mapp) on the presence of cough on day 7, adjusting for cough on day 0 and temperature on day 0. The experimental drug significantly reduces the presence of cough on day 7 by approximately 16.6% compared to the control group (p-value = 0.046242). The presence of cough on day 0 does not significantly predict the presence of cough on day 7 (p-value = 0.717689). A one-unit increase in temperature on day 0 is associated with a 20.6% increase in the presence of cough on day 7, and this effect is statistically significant (p-value = 0.009859).

Should we add day 7 temperature as a covariate? By including it, we might find that the treatment is no longer statistically significant since the temperature on day 7 could be affected by the treatment itself. It is a post-treatment variable, and by including it, the experiment loses value as we used something that was affected by intervention as our covariate.

However, we'd like to investigate if the treatment affects men or women differently. Since we collected gender as part of the study, we could check for Heterogeneous Treatment Effect (HTE) for male vs. female. The experimental drug has a marginally significant effect on the outcome variable for females, reducing it by approximately 23.1% (p-value = 0.05391).

Which group, those coded as male == 0 or male == 1, have better health outcomes (cough) in control? What about in treatment? How does this help to contextualize any heterogeneous treatment effect that might have been estimated?

Stargazer is a popular R package that enables users to create well-formatted tables and reports for statistical analysis results.

Looking at this regression report, we see that males in control have a temperature of 102; females in control have a temperature of 98.6 (which is very nearly a normal temperature). So, in control, males are worse off. In treatment, males have a temperature of 102 - 2.59 = 99.41. While this is closer to a normal temperature, this is still elevated. Females in treatment have a temperature of 98.5 - .32 = 98.18, which is slightly lower than a normal temperature, and is better than an elevated temperature. It appears that the treatment is able to have a stronger effect among male participants than females because males are *more sick* at baseline.

In conclusion, experimentation offers a fascinating and valuable avenue for primary research, allowing us to address causal questions and enhance our understanding of the world around us. Covariate control helps to isolate the causal effect of the treatment on the outcome variable, ensuring that the observed effect is not driven by confounding factors. Proper control of covariates enhances the internal validity of the study and ensures that the estimated treatment effect is an accurate representation of the true causal relationship. By exploring and accounting for sub groups in data, researchers can identify whether the treatment has different effects on different groups, such as men and women or younger and older individuals. This information can be critical for making informed policy decisions and developing targeted interventions that maximize the benefits for specific groups. The ongoing investigation of experimental methodologies and their potential applications represents a compelling and significant area of inquiry. 

Gerber, A. S., & Green, D. P. (2012). Field Experiments: Design, Analysis, and Interpretation . W. W. Norton.

“DALL·E 2.” OpenAI , https://openai.com/product/dall-e-2

“Data Science 241. Experiments and Causal Inference.” UC Berkeley School of Information , https://www.ischool.berkeley.edu/courses/datasci/241

IDR Explains | Randomised Controlled Trials (RCTs)

In just seven minutes and six questions, get a quick introduction to rcts—how they work, why they're used, and what some of the criticisms and challenges around them are..

With Esther Duflo , Abhijit Banerji , and Michael Kremer winning the 2019 Nobel Memorial Prize in Economic Sciences , there is renewed interest and discourse around randomised controlled trials (RCTs).  

But RCTs are complex, and there seem to be lingering questions around the subject. From conceptual queries to the ethics of RCTs, here are some of those questions, answered.

What is an RCT and how does it work?  

textbox describing the seva mandir randomised controlled trial example

Why are RCTs used?

There are often various factors at play within development programmes, and narrowing in on which variables most significantly affect outcomes can be challenging. RCTs are therefore used to zero in on which aspects of the programme are affecting change and creating impact.  Measuring impact often involves comparisons: to what extent has the programme affected a group or community compared to if it had never been implemented in the first place? Because this is a question that is difficult to measure directly, the control group serves as an indicator of what the absence of the programme would reflect (referred to as the ‘counterfactual’).  While there are other methods of examining this, RCTs are generally considered to be more rigorous and unbiased, and are less dependent on the assumptions that other evaluation techniques sometimes need to make.  What also distinguishes RCTs from other impact evaluations is that participation is determined randomly, before programme implementation begins. 

Who are the stakeholders in an RCT?

gloved hand holding stained petri dish to show a randomised controlled trial

What are some of the ethical considerations with RCTs?

RCTs have been criticised on the grounds that ‘randomistas’ (as they are often referred to) are willing to sacrifice the well-being of participants in order to ‘learn’. Who participates in an RCT is also an ethical question that researchers must consider. It is often pointed out that due to randomisation, people who need a certain treatment do not receive it, while others receive a treatment they do not need.  Randomisation could also lead to potential conflict. If households within a particular village, for example, are randomly selected to receive a particular intervention while others remain in the control group, it could lead to disruption within the community. The lack of attention to the question of human agency is another limitation of RCTs.  Having Institutional Review Boards (IRBs) in place has become the norm for these kinds of studies, in order to protect the rights and welfare of participants. But these bodies are largely self-regulating, and beyond anecdotal evidence, it isn’t clear how well they have worked for development RCTs. 

What are some of the challenges with RCTs?

The debate around ethics aside, there are also certain design-based challenges that RCTs face. Here are some considerations to keep in mind:  a. What level to randomise at The nature of the programme or intervention usually guides the researcher in deciding what level to randomise at. For example, if chlorine pills to treat contaminated water are being distributed, random assignment to control and treatment groups at a household level might not be viable.  Apart from ethics (giving one family pills for their water source but denying their neighbour), feasibility also needs to be considered. If the community drinks from a common tank of water, treating this tank would automatically make randomisation at a household or individual level unfeasible. Even if households had individual sources of water, logistically, screening out control group households while distributing pills could be inconvenient, and ensuring that treatment group participants don’t share their pills with control group neighbours is difficult.  It may also not be politically feasible to randomise at a household level. Political leaders may demand that all members of their community receive assistance, and this demand could come from the community itself as well. Proponents of RCTs like Duflo and Kremer say that , “all too often development policy is based on fads, and randomised evaluations could allow it to be based on evidence”. But not everyone is convinced that randomisation is infallible. According to economist Pranab Bardhan, “it is very hard to ensure true randomness in setting up treatment and control groups. So even within the domain of an RCT, impurities emanate from design, participation, and implementation problems.” b. Threats to data collection Statistically, larger the sample size, the more it represents the population. Even when a sample size is large enough, if respondents drop out during the data collection phase the results are susceptible to attrition bias . Attrition and failure by evaluators to collect data diminishes the size of the sample, reducing the ‘generalisability’ of the study. And if attrition is skewed more in either the treatment or the control group, and doesn’t occur at a roughly equal pace, the validity of the findings will be compromised. Spillovers and crossovers also affect data collection. Spillovers occur when individuals in the control group are indirectly affected by the treatment. For example, if the intervention involves vaccinations, when a significant amount of the population is vaccinated and becomes immune to a disease, ‘herd immunity’ could end up protecting individuals who may not have received vaccinations as well. Individuals who crossover, on the other hand, find themselves directly affected by the treatment. For example, if a parent transfers their child from a control group school to a treatment group school.  Impartial compliance refers to instances where the individuals within the treatment group choose not to participate. Statistical interventions can be used to produce valid results, but these come with certain assumptions—many of which randomisation aims to avoid in the first place. c. Uncertain internal and external validity Internal validity refers to the extent to which a study establishes a relationship between a treatment and an outcome. Randomisation can help establish a cause-effect relationship, but internal validity depends largely on the procedures of a study and how meticulously it is performed.  External validity, or generalisability, is more difficult to obtain. This refers to whether the same programme would have the same impact if replicated with a different target population, or if scaled up.  While the internal validity of RCTs is well-recognised as being rigorous, Nancy Cartwright and Angus Deaton question the external validity of RCTs. They call this the ‘transportation’ problem , where, “demonstrating that a treatment works in one situation is exceedingly weak evidence that it will work in the same way elsewhere.” Cartwright also points out that the rigour demanded to achieve internal validity is hardly ever found in establishing external validity. 

What does an RCT cost?

RCTs are known to be prohibitively expensive, but since they have come to be synonymous with ‘hard evidence’, numerous governments and nonprofits have invested in them, and donors have been willing to fund them.  While it is difficult to estimate exact figures, there are certain ‘line item’ categories that contribute to an RCT’s cost structure: Staff costs: Principal investigators, professors, field research associates, and a chain of other people all participate in pre-study evaluations, implementation, and post-study evaluation. Their fees, salaries, travel costs, and accommodation costs must be taken into consideration.  Data collection: There exist multiple rounds of data collection—typically baseline, midline, and endline studies. Depending on the kind of programme being evaluated, there may be multiple midlines and endlines over a particular period of time. These studies also have several sub-costs: staffing, training, technology, and incentives for survey participants, amongst others.  Intervention costs: Depending on the programme being evaluated, the intervention costs would differ. For example, medical treatments or something that requires input from the implementing organisation would increase costs, as opposed to an RCT that studies state-run programmes like subsidies or direct benefit transfers (DBTs).  Overheads and utilities: Office space, utilities, laptops, survey printouts, and other miscellaneous costs also significantly contribute to an RCT’s cost structure.  The complexity and scale of the RCT would determine how complex these buckets are in and of themselves, along with factors such as sample size or the design and duration of the study. 

donate-now-banner

Insights in this explainer have been sourced from J-PAL’s Introduction to Evaluations , in consultation with other sources.  

Ayesha Marfatia contributed to this article.  

With Esther Duflo, Abhijit Banerji, and Michael Kremer winning the 2019 Nobel Memorial Prize in Economic Sciences, there is renewed interest and discourse around randomised controlled trials (RCTs).  But RCTs […]

India Development Review-Image

India Development Review (IDR) is India’s first independent online media platform for leaders in the development community. Our mission is to advance knowledge on social impact in India. We publish ideas, opinion, analysis, and lessons from real-world practice.

jawaharlal nehru signing the indian constitution

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 20, Issue 2
  • Designing a research project: randomised controlled trials and their principles
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • J M Kendall
  • Correspondence to:
 Dr J M Kendall, North Bristol NHS Trust, Frenchay Hospital, Frenchay Park road, Bristol BS16 1LE, UK; 
 frenchayed{at}cableinet.co.uk

The sixth paper in this series discusses the design and principles of randomised controlled trials.

  • randomised controlled trials

https://doi.org/10.1136/emj.20.2.164

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

The randomised control trial (RCT) is a trial in which subjects are randomly assigned to one of two groups: one (the experimental group) receiving the intervention that is being tested, and the other (the comparison group or control) receiving an alternative (conventional) treatment (fig 1). The two groups are then followed up to see if there are any differences between them in outcome. The results and subsequent analysis of the trial are used to assess the effectiveness of the intervention, which is the extent to which a treatment, procedure, or service does patients more good than harm. RCTs are the most stringent way of determining whether a cause-effect relation exists between the intervention and the outcome. 1

  • Download figure
  • Open in new tab
  • Download powerpoint

The randomised control trial.

This paper discusses various key features of RCT design, with particular emphasis on the validity of findings. There are many potential errors associated with health services research, but the main ones to be considered are bias, confounding, and chance. 2

Bias is the deviation of results from the truth, due to systematic error in the research methodology. Bias occurs in two main forms: (a) selection bias , which occurs when the two groups being studied differ systematically in some way, and (b) observer/information bias , which occurs when there are systematic differences in the way information is being collected for the groups being studied.

A confounding factor is some aspect of a subject that is associated both with the outcome of interest and with the intervention of interest. For example, if older people are less likely to receive a new treatment, and are also more likely for unrelated reasons to experience the outcome of interest, (for example, admission to hospital), then any observed relation between the intervention and the likelihood of experiencing the outcome would be confounded by age.

Chance is a random error appearing to cause an association between an intervention and an outcome. The most important design strategy to minimise random error is to have a large sample size.

These errors have an important impact on the interpretation and generalisability of the results of a research project. The beauty of a well planned RCT is that these errors can all be effectively reduced or designed out (see box 1). The appropriate design strategies will be discussed below.

Box 1 Features of a well designed RCT

The sample to be studied will be appropriate to the hypothesis being tested so that any results are appropriately generalisable. The study will recruit sufficient patients to allow it to have a high probability of detecting a clinicaly important difference between treatments if a difference truly exists.

There will be effective (concealed) randomisation of the subjects to the intervention/control groups (to eliminate selection bias and minimise confounding variables).

Both groups will be treated identically in all respects except for the intervention being tested and to this end patients and investigators will ideally be blinded to which group an individual is assigned.

The investigator assessing outcome will be blinded to treatment allocation.

Patients are analysed within the group to which they were allocated, irrespective of whether they experienced the intended intervention (intention to treat analysis).

Analysis focuses on testing the research question that initialy led to the trial (that is, according to the a priori hypothesis being tested), rather than “trawling” to find a significant difference.

GETTING STARTED: DEVELOPING A PROTOCOL FROM THE INITIAL HYPOTHESIS

Analytical studies need a hypothesis that specifies an anticipated association between predictor and outcome variables (or no association, as in a null hypothesis ), so that statistical tests of significance can be performed. 3 Good hypotheses are specific and formulated in advance of commencement (a priori) of the study. Having chosen a subject to research and specifically a hypothesis to be tested, preparation should be thorough and is best documented in the form of a protocol that will outline the proposed methodology. This will start with a statement of the hypothesis to be tested, for example: “...that drug A is more efficacious in reducing the diastolic blood pressure than drug B in patients with moderate essential hypertension.” An appropriate rationale for the study will follow with a relevant literature review, which is focused on any existing evidence relating to the condition or interventions to be studied.

The subject to be addressed should be of clinical, social, or economic significance to afford relevance to the study, and the hypothesis to be evaluated must contain outcomes that can be accurately measured. The subsequent study design (population sampling, randomisation, applying the intervention, outcome measures, analysis, etc) will need to be defined to permit a true evaluation of the hypothesis being tested. In practice, this will be the best compromise between what is ideal and what is practical.

Writing a thorough and comprehensive protocol in the planning stage of the research project is essential. Peer review of a written protocol allows others to criticise the methodology constructively at a stage when appropriate modification is possible. Seeking advice from experienced researchers, particularly involving a local research and development support unit, or some other similar advisory centre, can be very beneficial. It is far better to identify and correct errors in the protocol at the design phase than to try to adjust for them in the analysis phase. Manuscripts rarely get rejected for publication because of inappropriate analysis, which is remediable, but rather because of design flaws.

There are several steps in performing an RCT, all of which need to be considered while developing a protocol. The first is to choose an appropriate (representative) sample of the population from which to recruit. Having measured relevant baseline variables, the next task is to randomise subjects into one of two (or more) groups, and subsequently to perform the intervention as appropriate to the assignment of the subject. The pre-defined outcome measures will then be recorded and the findings compared between the two groups, with appropriate quality control measures in place to assure quality data collection. Each of these steps, which can be tested in a pilot study, has implications for the design of the trial if the findings are to be valid. They will now be considered in turn.

CHOOSING THE RIGHT POPULATION

This part of the design is crucial because poor sampling will undermine the generalisability of the study or, even worse, reduce the validity if sampling bias is introduced. 4 The task begins with deciding what kind of subjects to study and how to go about recruiting them. The target population is that population to which it is intended to apply the results. It is important to set inclusion and exclusion criteria defining target populations that are appropriate to the research hypothesis. These criteria are also typically set to make the researchers’ task realistic, for within the target population there must be an accessible/appropriate sample to recruit.

The sampling strategy used will determine whether the sample actually studied is representative of the target population. For the findings of the study to be generalisable to the population as a whole, the sample must be representative of the population from which it is drawn. The best design is consecutive sampling from the accessible population (taking every patient who meets the selection criteria over the specified time period). This may produce an excessively large sample from which, if necessary, a subsample can be randomly drawn. If the inclusion criteria are broad, it will be easy to recruit study subjects and the findings will be generalisable to a comparatively large population. Exclusion criteria need to be defined and will include such subjects who have conditions which may contraindicate the intervention to be tested, subjects who will have difficulty complying with the required regimens, those who cannot provide informed consent, etc.

Summary: population sampling

The study sample must be representative of the target population for the findings of the study to be generalisable.

Inclusion and exclusion criteria will determine who will be studied from within the accessible population.

The most appropriate sampling strategy is normally consecutive sampling, although stratified sampling may legitimately be required.

A sample size calculation and pilot study will permit appropriate planning in terms of time and money for the recruitment phase of the main study.

Follow CONSORT guidelines on population sampling. 6

In designing the inclusion criteria, the investigator should consider the outcome to be measured; if this is comparatively rare in the population as a whole, then it would be appropriate to recruit at random or consecutively from populations at high risk of the condition in question ( stratified sampling). The subsamples in a stratified sample will draw disproportionately from groups that are less common in the population as a whole, but of particular relevance to the investigator.

Other forms of sampling where subjects are recruited who are easily accessible or appropriate, ( convenience or judgmental sampling) will have advantages in terms of cost, time, and logistics, but may produce a sample that is not representative of the target population and it is likely to be dificult to define exactly who has and has not been included.

Having determined an appropriate sample to recruit, it is necessary to estimate the size of the sample required to allow the study to detect a clinically important difference between the groups being compared. This is performed by means of a sample size calculation . 5 As clinicians, we must be able to specify what we would consider to be a clinically significant difference in outcome. Given this information, or an estimate of the effect size based on previous experience (from the literature or from a pilot study), and the design of the study, a statistical adviser will be able to perform an appropriate sample size calculation. This will determine the required sample size to detect the pre-determined clinically significant difference to a certain degree of power. As previously mentioned, early involvement of an experienced researcher or research support unit in the design stage is essential in any RCT.

After deciding on the population to be studied and the sample size required, it will now be possible to plan the appropriate amount of time (and money) required to collect the data necessary. A limited pilot of the methods is essential to gauge recruitment rate and address in advance any practical issues that may arise once data collection in the definitive study is underway. Pilot studies will guide decisions about designing approaches to recruitment and outcome measurement. A limited pilot study will give the investigator an idea of what the true recruitment rate will be (not just the number of subjects available, but also their willingness to participate). It may be even more helpful in identifying any methodological issues related to applying the intervention or measuring outcome variables (see below), which can be appropriately addressed.

RANDOMISATION: THE CORNERSTONE OF THE RCT

Various baseline characteristics of the subjects recruited should be measured at the stage of initial recruitment into the trial. These will include basic demographic observations, such as name, age, sex, hospital identification, etc, but more importantly should include any important prognostic factors. It will be important at the analysis stage to show that these potential confounding variables are equally distributed between the two groups; indeed, it is usual practice when reporting an RCT to demonstrate the integrity of the randomisation process by showing that there is no significant difference between baseline variables (following CONSORT guidelines). 6

The random assignment of subjects to one or another of two groups (differing only by the intervention to be studied) is the basis for measuring the marginal difference between these groups in the relevant outcome. Randomisation should equally distribute any confounding variables between the two groups, although it is important to be aware that differences in confounding variables may arise through chance.

Randomisation is one of the cornerstones of the RCT 7 and a true random allocation procedure should be used. It is also essential that treatment allocations are concealed from the investigator until recruitment is irrevocable, so that bias (intentional or otherwise) cannot be introduced at the stage of assigning subjects to their groups. 8 The production of computer generated sets of random allocations, by a research support unit (who will not be performing data collection) in advance of the start of the study, which are then sealed in consecutively numbered opaque envelopes, is an appropriate method of randomisation. Once the patient has given consent to be included in the trial, he/she is then irreversibly randomised by opening the next sealed envelope containing his/her assignment.

An alternative method, particularly for larger, multicentre trials is to have a remote randomisation facility. The clinician contacts this facility by telephone when he is ready to randomise the next patient; the initials and study number of the patient are read to the person performing the randomisation, who records it and then reads back the randomisation for that subject.

Studies that involve small to moderate sample sizes (for example, less than 50 per group) may benefit from “blocked” and/or “stratified” randomisation techniques. These methods will balance (where chance alone might not) the groups in terms of the number of subjects they contain, and in the distribution of potential confounding variables (assuming, of course, that these variables are known before the onset of the trial). They are the design phase alternative to statistically adjusting for confounding variables in the analysis phase, and are preferred if the investigator intends to carry out subgroup analysis (on the basis of the stratification variable).

Blocked randomisation is a technique used to ensure that the number of subjects assigned to each group is equally distributed. Randomisation is set up in blocks of a pre-determined set size (for example 6, 8, 10, etc). Randomisation for a block size of 10 would proceed normally until five assignments had been made to one group, and then the remaining assignments would be to the other group until the block of 10 was complete. This means that for a sample size of 80 subjects, exactly 40 would be assigned to each group. Block size must be blinded from the investigator performing the study and, if the study is non-blinded, the block sizes should vary randomly (otherwise the last allocation(s) in a block would, in effect, be unconcealed).

Stratified randomisation is a technique for ensuring that an important baseline variable (potential confounding factor) is more evenly distributed between the two groups than chance alone might otherwise assure. In examining the effect of a treatment for cardiac failure, for example, the degree of existing cardiac failure will be a baseline variable predicting outcome, and so it is important that this is the same in the two groups. To achieve this, the sample can be stratified at baseline into patients with mild, moderate, or severe cardiac failure, and then randomisation occurs within each of these “strata”. There is a limited number of baseline variables that can be balanced by stratification because the numbers of patients within a stratum are reduced. In the above example, to stratify also for age, previous infarction, and the co-existence of diabetes would be impractical.

Summary: randomisation

The random assignment of subjects into one of two groups is the basis for establishing a causal interpretation for an intervention.

Effective randomisation will minimise confounding variables that exist at the time of randomisation.

Randomisation must be concealed from the investigator.

Blocked randomisation may be appropriate for smaller trials to ensure equal numbers in each group.

Stratified randomisation will ensure that a potential baseline confounding variable is equally distributed between the two groups.

Analysis of results should occur based on the initial randomisation, irrespective of what may subsequently actually have happened to the subject (that is, “intention to treat analysis”).

Sample attrition (“drop outs”), once subjects have consented and been randomised, may be an important factor. Patients may refuse to continue with the trial, they may be lost to analysis for whatever reason, and there may be changes in the protocol (or mistakes) subsequent to randomisation, even resulting in the patient receiving the wrong treatment. This is, in fact, not that uncommon: a patient randomised to have a minimally invasive procedure may need to progress to an open operation, for example, or a patient assigned to medical treatment may require surgery at a later stage. In the RCT, the analysis must include an unbiased comparison of the groups produced by the process of randomisation, based on all the people who were randomised; this is known as analysis by intention to treat . Intention to treat analysis depends on having outcomes for all subjects, so even if patients “drop out”, it is important to try to keep them in the trial if only for outcome measurement. This avoids the introduction of bias as a consequence of potentialy selectively dropping patients from previously randomised/balanced groups.

APPLYING THE INTERVENTION AND MEASURING OUTCOME: THE IMPORTANCE OF BLINDING

After randomisation there will be two (or more) groups, one of which will receive the test intervention and another (or more) which receives a standard intervention or placebo. Ideally, neither the study subjects, nor anybody performing subsequent measurements and data collection, should be aware of the study group assignment. Effective randomisation will eliminate confounding by variables that exist at the time of randomisation. Without effective blinding, if subject assignment is known by the investigator, bias can be introduced because extra attention may be given to the intervention group (intended or otherwise). 8 This would introduce variables into one group not present in the other, which may ultimately be responsible for any differences in outcome observed. Confounding can therefore also occur after randomisation. Double blinding of the investigator and patient (for example, by making the test treatment and standard/placebo treatments appear the same) will eliminate this kind of confounding, as any extra attentions should be equally spread between the two groups (with the exception, as for randomisation, of chance maldistributions).

While the ideal study design will be double blind, this is often difficult to achieve effectively, and is sometimes not possible (for example, surgical interventions). Where blinding is possible, complex (and costly) arrangements need to be made to manufacture placebo that appears similar to the test drug, to design appropriate and foolproof systems for packaging and labelling, and to have a system to permit rapid unblinding in the event of any untoward event causing the patient to become unwell. The hospital pharmacy can be invaluable in organising these issues. Blinding may break down subsequently if the intervention has recognisable side effects. The effectiveness of the blinding can be systematically tested after the study is completed by asking investigators to guess treatment assignments; if a significant proportion are able to correctly guess the assignment, then the potential for this as a source of bias should be considered.

Summary: intervention and outcome

Blinding at the stage of applying the intervention and measuring the outcome is essential if bias (intentional or otherwise) is to be avoided.

The subject and the investigator should ideally be blinded to the assignment (double blind), but even where this is not possible, a blinded third party can measure outcome.

Blinding is achieved by making the intervention and the control appear similar in every respect.

Blinding can break down for various reasons, but this can be systematically assessed.

Continuous outcome variables have the advantage over dichotomous outcome variables of increasing the power of a study, permitting a smaller sample size.

Once the intervention has been applied, the groups will need to be followed up and various outcome measures will be performed to evaluate the effect or otherwise of that intervention. The outcome measures to be assessed should be appropriate to the research question, and must be ones that can be measured accurately and precisely. Continuous outcome variables (quantified on an infinite arithmetic scale, for example, time) have the advantage over dichotomous outcome variables (only two categories, for example, dead or alive) of increasing the power of a study, permitting a smaller sample size. It may be desirable to have several outcome measures evaluating different aspects of the results of the intervention. It is also necessary to design outcome measures that will detect the occurrence of specified adverse effects of the intervention.

It is important to emphasise, as previously mentioned, that the person measuring the outcome variables (as well as the person applying the intervention) should be blinded to the treatment group of the subject to prevent the introduction of bias at this stage, particularly when the outcome variable requires any judgement on the part of the observer. Even if it has not been possible to blind the administration of the intervention, it should be possible to design the study so that outcome measurement is performed by someone who is blinded to the original treatment assignment.

QUALITY CONTROL

A critical aspect of clinical research is quality control. Quality control is often overlooked during data collection, a potentially tedious and repetitive phase of the study, which may lead subsequently to errors because of missing or inaccurate measurements. Essentially, quality control issues occur in clinical procedures, measuring outcomes, and handling data. Quality control begins in the design phase of the study when the protocol is being written and is first evaluated in the pilot study, which will be invaluable in testing the proposed sampling strategy, methods for data collection and subsequent data handling.

Once the methods part of the protocol is finalised, an operations manual can be written that specifically defines how to recruit subjects, perform measurements, etc. This is essential when there is more than one investigator, as it will standardise the actions of all involved. After allowing all those involved to study the operations manual, there will be the opportunity to train (and subsequently certify) investigators to perform various tasks uniformly.

Ideally, any outcome measurement taken on a patient should be precise and reproducible; it should not depend on the observer who took the measurement. 4 It is well known, for example, that some clinicians in their routine medical practice record consistently higher blood pressure values than others. Such interobserver variation in the setting of a clinical trial is clearly unacceptable and steps must be taken to avoid it. It may be possible, if the trial is not too large, for all measurements to be performed by the same observer, in which case the problem is avoided. However, it is often necessary to use multiple observers, especially in multicentre trials. Training sessions should be arranged to ensure that observers (and their equipment) can produce the same measurements in any given subject. Repeat sessions may be necessary if the trial is of long duration. You should try to use as few observers as possible without exhausting the available staff. The trial should be designed so that any interobserver variability cannot bias the results by having each observer evaluate patients in all treatment groups.

Inevitably, there will be a principal investigator; this person will be responsible for assuring the quality of data measurement through motivation, appropriate delegation of responsibility, and supervision. An investigators’ meeting before the study starts and regular visits to the team members or centres by the principal investigator during data collection, permit communication, supervision, early detection of problems, feedback and are good for motivation.

Quality control of data management begins before the start of the study and continues during the study. Forms to be used for data collection should be appropriately designed to encourage the collection of good quality data. They should be user friendly, self explanatory, clearly formatted, and collect only data that is needed. They can be tested in the pilot. Data will subsequently need to be transcribed onto a computer database from these forms. The database should also be set up so that it is similar in format to the forms, allowing for easy transcription of information. The database can be pre-prepared to accept only variables within given permissible ranges and that are consistent with previous entries and to alert the user to missing values. Ideally, data should be entered in duplicate, with the database only accepting data that are concordant with the first entry; this, however, is time consuming, and it may be adequate to check randomly selected forms with a printout of the corresponding datasheet to ensure transcription error is minimal, acting appropriately if an unacceptably high number of mistakes are discovered.

Once the main phase of data collection has begun, you should try to make as few changes to the protocol as possible. In an ideal world, the pilot study will have identified any issues that will require a modification of the protocol, but inevitably some problem, minor or major, will arise once the study has begun. It is better to leave any minor alterations that are considered “desirable” but not necessary and resist the inclination to make changes. Sometimes, more substantive issues are highlighted and protocol modification is necessary to strengthen the study. These changes should be documented and disseminated to all the investigators (with appropriate changes made to the operations manual and any re-training performed as necessary). The precise date that the revision is implemented is noted, with a view to separate analysis of data collected before and after the revision, if this is considered necessary by the statistical advisor. Such revisions to the protocol should only be undertaken if, after careful consideration, it is felt that making the alteration will significantly improve the findings, or not changing the protocol will seriously jeopardise the project. These considerations have to be balanced against the statistical difficulty in analysis after protocol revision.

...SOME FINAL THOUGHTS

A well designed, methodologically sound RCT evaluating an intervention provides strong evidence of a cause-effect relation if one exists; it is therefore powerful in changing practice to improve patient outcome, this being the ultimate goal of research on therapeutic effectiveness. Conversely, poorly designed studies are dangerous because of their potential to influence practice based on flawed methodology. As discussed above, the validity and generalisability of the findings are dependent on the study design.

Summary: quality control

An inadequate approach to quality control will lead to potentially significant errors due to missing or inaccurate results.

An operations manual will allow standardisation of all procedures to be performed.

To reduce interobserver variability in outcome measurement, training can be provided to standardise procedures in accordance with the operations manual.

Data collection forms should be user friendly, self explanatory, and clearly formatted, with only truly relevant data being collected.

Subsequent data transfer onto a computerised database can be safe guarded with various measures to reduce transcription errors.

Protocol revisions after study has started should be avoided if at all possible, but, if necessary, should be appropriately documented and dated to permit separate analysis.

Early involvement of the local research support unit is essential in developing a protocol. Subsequent peer review and ethical committee review will ensure that it is well designed, and a successful pilot will ensure that the research goals are practical and achievable.

Delegate tasks to those who have the expertise; for example, allow the research support unit to perform the randomisation, leave the statistical analysis to a statistician, and let a health economist advise on any cost analysis. Networking with the relevant experts is invaluable in the design phase and will contribute considerably to the final credence of the findings.

Finally, dissemination of the findings through publication is the final peer review process and is vital to help others act on the available evidence. Writing up the RCT at completion, like developing the protocol at inception, should be thorough and detailed 9 (following CONSORT guidelines 6 ), with emphasis not just on findings, but also on methodology. Potential limitations or sources of error should be discussed so that the readership can judge for themselves the validity and generalisability of the research. 10

  • ↵ Sibbald B , Roland M. Understanding controlled trials: Why are randomised controlled trials important? BMJ 1998 ; 316 : 201 . OpenUrl FREE Full Text
  • ↵ Pocock SJ . Clinical trials: a practical approach. Chichester: Wiley, 1984 .
  • ↵ Hulley SB , Cunnings SR. Designing clinical research—an epidemiological approach . Chicago: Williams and Wilkins, 1988 .
  • ↵ Bowling A . Researching methods in health: investigating health and health services. Buckingham: Open University Press, 1997 .
  • ↵ Lowe D . Planning for medical research: a practical guide to research methods. Middlesborough: Astraglobe, 1993 .
  • ↵ Begg C , Cho M, Eastwood S, et al . Improving the quality of reporting of randomised controlled trials: the CONSORT statement. JAMA 1996 ; 276 : 637 –9. OpenUrl CrossRef PubMed Web of Science
  • ↵ Altman DG . Randomisation. BMJ 1991 ; 302 : 1481 –2.
  • ↵ Schultz KF , Chalmers I, Hayes RJ, et al . Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 1995 ; 273 : 408 –12. OpenUrl CrossRef PubMed Web of Science
  • ↵ The Standards of Reporting Trials Group . A proposal for structured reporting of randomised controlled trials. JAMA 1994 ; 272 : 1926 –31. OpenUrl CrossRef PubMed Web of Science
  • ↵ Altman DG . Better reporting of randomised controlled trials: the CONSORT statement. BMJ 1996 ; 313 : 570 –1. OpenUrl FREE Full Text

Further reading

  • Sackett DL, Haynes RB, Guyatt GH, et al . Clinical epidemiology: a basic science for clinical medicine. 2nd edn. Toronto: Little, Brown, 1991.
  • Sackett DL, Richardson WS, Rosenberg W, et al . Evidence-based medicine: how to practice and teach EBM. Edinburgh: Churchill Livingstone, 1997.
  • Polgar S. Introduction to research in health sciences. 2nd edn. Edinburgh: Churchill Livingstone, 1991.
  • Bland M. An introduction to medical statistics. Oxford: Oxford Medical Publications, 1987.

Read the full text or download the PDF:

random assignment rct

  • High contrast

Logo UNICEF Innocenti

7 Randomized Controlled Trials (RCTs)

7 essais contrôlés randomisés (ecr), 7 ensayos controlados aleatorios.

A randomized controlled trial (RCT) is an experimental form of impact evaluation in which the population receiving the programme or policy intervention is chosen at random from the eligible population, and a control group is also chosen at random from the same eligible population. It tests the extent to which specific, planned impacts are being achieved. The distinguishing feature of an RCT is the random assignment of units (e.g. people, schools, villages, etc.) to the intervention or control groups. One of its strengths is that it provides a very powerful response to questions of causality, helping evaluators and programme implementers to know that what is being achieved is as a result of the intervention and not anything else.

Les essais contrôlés randomisés (ECR) sont une forme expérimentale d’évaluation d’impact qui se caractérise par la sélection aléatoire au sein de la population éligible d’un segment de population bénéficiaire du programme ou de la politique d’une part, et d’un groupe contrôle d’autre part. Ils testent dans quelle mesure les impacts prévus seront réalisés et se distinguent par la répartition aléatoire des unités (personnes, écoles, villages, etc.) entre le groupe expérimental et le groupe contrôle. Leur intérêt réside dans le fait qu’ils permettent d’établir une causalité et d’ainsi garantir aux évaluateurs et agents d’exécution que les effets obtenus ont été produits par l’intervention évaluée et non par des facteurs extérieurs. Veuillez écrire à l’adresse [email protected] pour demander une version française ou espagnole de cette note.

Un ensayo controlado aleatorio es una forma experimental de evaluación de impacto en el que la población que se beneficia de la intervención del programa o la política y el grupo testigo se eligen de manera aleatoria entre la misma población elegible. Evalúa en qué medida se están alcanzando los impactos específicos planeados. Lo que caracteriza a los ensayos controlados aleatorios es la distribución aleatoria de las unidades (por ejemplo, personas, colegios, pueblos, etc.) entre el grupo testigo y de intervención. Uno de sus puntos fuertes es que proporciona una respuesta contundente a cuestiones de causalidad y ayuda a los evaluadores y ejecutores del programa a saber que lo que se está consiguiendo se debe únicamente a la intervención. Para solicitar esta síntesis en español o francés escriba a [email protected]

DOWNLOAD REPORT

random assignment rct

TÉLÉCHARGER LE RAPPORT

Descargar informe.

We're sorry, but some features of Research Randomizer require JavaScript. If you cannot enable JavaScript, we suggest you use an alternative random number generator such as the one available at Random.org .

RESEARCH RANDOMIZER

Random sampling and random assignment made easy.

Research Randomizer is a free resource for researchers and students in need of a quick way to generate random numbers or assign participants to experimental conditions. This site can be used for a variety of purposes, including psychology experiments, medical trials, and survey research.

GENERATE NUMBERS

In some cases, you may wish to generate more than one set of numbers at a time (e.g., when randomly assigning people to experimental conditions in a "blocked" research design). If you wish to generate multiple sets of random numbers, simply enter the number of sets you want, and Research Randomizer will display all sets in the results.

Specify how many numbers you want Research Randomizer to generate in each set. For example, a request for 5 numbers might yield the following set of random numbers: 2, 17, 23, 42, 50.

Specify the lowest and highest value of the numbers you want to generate. For example, a range of 1 up to 50 would only generate random numbers between 1 and 50 (e.g., 2, 17, 23, 42, 50). Enter the lowest number you want in the "From" field and the highest number you want in the "To" field.

Selecting "Yes" means that any particular number will appear only once in a given set (e.g., 2, 17, 23, 42, 50). Selecting "No" means that numbers may repeat within a given set (e.g., 2, 17, 17, 42, 50). Please note: Numbers will remain unique only within a single set, not across multiple sets. If you request multiple sets, any particular number in Set 1 may still show up again in Set 2.

Sorting your numbers can be helpful if you are performing random sampling, but it is not desirable if you are performing random assignment. To learn more about the difference between random sampling and random assignment, please see the Research Randomizer Quick Tutorial.

Place Markers let you know where in the sequence a particular random number falls (by marking it with a small number immediately to the left). Examples: With Place Markers Off, your results will look something like this: Set #1: 2, 17, 23, 42, 50 Set #2: 5, 3, 42, 18, 20 This is the default layout Research Randomizer uses. With Place Markers Within, your results will look something like this: Set #1: p1=2, p2=17, p3=23, p4=42, p5=50 Set #2: p1=5, p2=3, p3=42, p4=18, p5=20 This layout allows you to know instantly that the number 23 is the third number in Set #1, whereas the number 18 is the fourth number in Set #2. Notice that with this option, the Place Markers begin again at p1 in each set. With Place Markers Across, your results will look something like this: Set #1: p1=2, p2=17, p3=23, p4=42, p5=50 Set #2: p6=5, p7=3, p8=42, p9=18, p10=20 This layout allows you to know that 23 is the third number in the sequence, and 18 is the ninth number over both sets. As discussed in the Quick Tutorial, this option is especially helpful for doing random assignment by blocks.

Please note: By using this service, you agree to abide by the SPN User Policy and to hold Research Randomizer and its staff harmless in the event that you experience a problem with the program or its results. Although every effort has been made to develop a useful means of generating random numbers, Research Randomizer and its staff do not guarantee the quality or randomness of numbers generated. Any use to which these numbers are put remains the sole responsibility of the user who generated them.

Note: By using Research Randomizer, you agree to its Terms of Service .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Understanding and misunderstanding randomized controlled trials

Angus deaton.

Princeton University, NBER, and University of Southern California

Nancy Cartwright

Durham University and UC San Diego

Associated Data

Randomized Controlled Trials (RCTs) are increasingly popular in the social sciences, not only in medicine. We argue that the lay public, and sometimes researchers, put too much trust in RCTs over other methods of investigation. Contrary to frequent claims in the applied literature, randomization does not equalize everything other than the treatment in the treatment and control groups, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) covariates. Finding out whether an estimate was generated by chance is more difficult than commonly believed. At best, an RCT yields an unbiased estimate, but this property is of limited practical value. Even then, estimates apply only to the sample selected for the trial, often no more than a convenience sample, and justification is required to extend the results to other groups, including any population to which the trial sample belongs, or to any individual, including an individual in the trial. Demanding ‘external validity’ is unhelpful because it expects too much of an RCT while undervaluing its potential contribution. RCTs do indeed require minimal assumptions and can operate with little prior knowledge. This is an advantage when persuading distrustful audiences, but it is a disadvantage for cumulative scientific progress, where prior knowledge should be built upon, not discarded. RCTs can play a role in building scientific knowledge and useful predictions but they can only do so as part of a cumulative program, combining with other methods, including conceptual and theoretical development, to discover not ‘what works’, but ‘why things work’.

Introduction

Randomized controlled trials (RCTs) are widely encouraged as the ideal methodology for causal inference. This has long been true in medicine (e.g. for drug trials by the FDA. A notable exception is the recent paper by Frieden (2017) , ex-director of the U.S. Centers for Disease Control and Prevention, who lists key limitations of RCTs as well as a range of contexts where RCTs, even when feasible, are dominated by other methods. Earlier critiques in medicine include Feinstein and Horwitz (1997) , Concato, Shah, and Horwitz (2000) , Rawlins (2008) , and Concato (2013) .) It is also increasingly true in other health sciences and across the social sciences, including psychology, economics, education, political science, and sociology. Among both researchers and the general public, RCTs are perceived to yield causal inferences and estimates of average treatment effects (ATEs) that are more reliable and more credible than those from any other empirical method. They are taken to be largely exempt from the myriad problems that characterize observational studies, to require minimal substantive assumptions, little or no prior information, and to be largely independent of ‘expert’ knowledge that is often regarded as manipulable, politically biased, or otherwise suspect. They are also sometimes felt to be more resistant to researcher and publisher degrees of freedom (for example through p -hacking, selective analyses, or publication bias) than non-randomized studies given that trial registration and pre-specified analysis plans are mandatory or at least the norm.

We argue that any special status for RCTs is unwarranted. Which method is most likely to yield a good causal inference depends on what we are trying to discover as well as on what is already known. When little prior knowledge is available, no method is likely to yield well-supported conclusions. This paper is not a criticism of RCTs in and of themselves, nor does it propose any hierarchy of evidence, nor attempt to identify good and bad studies. Instead, we will argue that, depending on what we want to discover, why we want to discover it, and what we already know, there will often be superior routes of investigation and, for a great many questions where RCTs can help, a great deal of other work—empirical, theoretical, and conceptual—needs to be done to make the results of an RCT serviceable.

Our arguments are intended not only for those who are innocent of the technicalities of causal inference but also aim to offer something to those who are well versed with the field. Most of what is in the paper is known to someone in some subject. But what epidemiology knows is not what is known by economics, or political science, or sociology, or philosophy—and the reverse. The literatures on RCTs in these areas are overlapping but often quite different; each uses its own language and different understandings and misunderstandings characterize different fields and different kinds of projects. We highlight issues arising across a range of disciplines where we have observed misunderstanding among serious researchers and research users, even if not shared by all experts in those fields. Although we aim for a broad cross-disciplinary perspective, we will, given our own disciplinary backgrounds, be most at home with how these issues arise in economics and how they have been treated by philosophers.

We present two sets of arguments. The first is an enquiry into the idea that ATEs estimated from RCTS are likely to be closer to the truth than those estimated in other ways. The second explores how to use the results of RCTs once we have them.

In the first section, our discussion runs in familiar statistical terms of bias and precision, or efficiency, or expected loss. Unbiasedness means being right on average, where the average is taken over an infinite number of repetitions using the same set of subjects in the trial, but with no limits on how far any one estimate is from the truth, while precision means being close to the truth on average; an estimator that is far from the truth in one direction half of the time and equally far from the truth in the other direction half of the time is unbiased, but it is imprecise. We review the difference between balance of covariates in expectation versus balance in a single run of the experiment (sometimes called ‘random confounding’ or ‘realized confounding’ in epidemiology, see for instance Greenland and Mansournia (2015) or Vander Weele (2012) ) and the related distinction between precision and unbiasedness. These distinctions should be well known wherever RCTs are conducted or RCT results are used, though much of the discussion is, if not confused, unhelpfully imprecise. Even less recognized are problems with statistical inference, and especially the threat to significance testing posed when there is an asymmetric distribution of individual treatment effects in the study population.

The second section describes several different ways to use the evidence from RCTs. The types of use we identify have analogues, with different labels, across disciplines. This section stresses the importance for using RCT results of being clear about the hypothesis at stake and the purpose of the investigation. It argues that in the usual literature, which stresses extrapolation and generalization, RCTs are both under- and over-sold. Oversold because extrapolating or generalizing RCT results requires a great deal of additional information that cannot come from RCTs; under-sold, because RCTs can serve many more purposes than predicting that results obtained in a trial population will hold elsewhere.

One might be tempted to label the two sections ‘Internal validity’ and ‘External validity’. We resist this, especially in the way that external validity is often characterized. RCTs are under-sold when external validity means that the ‘the same ATE holds in this new setting’, or ‘the ATE from the trial holds generally’, or even that the ATE in a new setting can be calculated in some reasonable way from that in the study population. RCT results can be useful much more broadly. RCTs are oversold when their non-parametric and theory-free nature, which is arguably an advantage in estimation or internal validity, is used as an argument for their usefulness. The lack of structure is often a disadvantage when we try to use the results outside of the context in which the results were obtained; credibility in estimation can lead to incredibility in use. You cannot know how to use trial results without first understanding how the results from RCTs relate to the knowledge that you already possess about the world, and much of this knowledge is obtained by other methods. Once RCTs are located within this broader structure of knowledge and inference, and when they are designed to enhance it, they can be enormously useful, not just for warranting claims of effectiveness but for scientific progress more generally. Cumulative science is difficult; so too is reliable prediction about what will happen when we act.

Nothing we say in the paper should be taken as a general argument against RCTs; we simply try to challenge unjustifiable claims and expose misunderstandings. We are not against RCTs, only magical thinking about them. The misunderstandings are important because they contribute to the common perception that RCTs always provide the strongest evidence for causality and for effectiveness and because they detract from the usefulness of RCT evidence as part of more general scientific projects. In particular, we do not try to rank RCTs versus other methods. What methods are best to use and in what combinations depends on the exact question at stake, the kind of background assumptions that can be acceptably employed, and what the costs are of different kinds of mistakes. By getting clear in Section 1 just what an RCT, qua RCT, can and cannot deliver, and laying out in Section 2 a variety of ways in which the information secured in an RCT can be used, we hope to expose how unavailing is the ‘head-to-head between methods’ discourse that often surrounds evidence-ranking schemes.

Section 1: Do RCTs give good estimates of Average Treatment Effects

We start from a trial sample , a collection of subjects that will be allocated randomly to either the treatment or control arm of the trial. This ‘sample’ might be, but rarely is, a random sample from some population of interest. More frequently, it is selected in some way, for example to those willing to participate, or is simply a convenience sample that is available to the those conducting the trial. Given random allocation to treatments and controls, the data from the trial allow the identification of the two (marginal) distributions, F 1 ( Y 1 ) and F 0 ( Y 0 ), of outcomes Y 1 and Y 0 in the treated and untreated cases within the trial sample. The ATE estimate is the difference in means of the two distributions and is the focus of much of the literature in social science and medicine.

Policy makers and researchers may be interested in features of the two marginal distributions and not simply the ATE, which is our main focus here. For example, if Y is disease burden, measured perhaps in QALYs, public health officials may be interested in whether a treatment reduced inequality in disease burden, or in what it did to the 10 th or 90 th percentiles of the distribution, even though different people occupy those percentiles in the treatment and control distributions. Economists are routinely concerned with the 90/10 ratio in the income distribution, and in how a policy might affect it (see Bitler et al. (2006) for a related example in US welfare policy). Cancer trials standardly use the median difference in survival, which compares the times until half the patients have died in each arm. More comprehensively, policy makers may wish to compare expected utilities for treated and untreated under the two distributions and consider optimal expected-utility maximizing treatment rules conditional on the characteristics of subjects (see Manski (2004) and Manski and Tetenov (2016) ; Bhattacharya and Dupas (2012) give an application.) These other kinds of information are important, but we focus on ATEs and do not consider these other uses of RCTs further in this paper.

1.1 Estimating average treatment effects

A useful way to think about the estimation of treatment effects is to use a schematic linear causal model of the form:

where, Y i is the outcome for unit i (which may be a person, a village, a hospital ward), T i is a dichotomous (1,0) treatment dummy indicating whether or not i is treated, and β i is the individual treatment effect of the treatment on i : it represents (or regulates) how much a value t of T contributes to the outcome Y for individual i . The x ’s are observed or unobserved other linear causes of the outcome, and we suppose that (1) captures a minimal set of causes of Y i sufficient to fix its value. J may be (very) large. The unrestricted heterogeneity of the individual treatment effects, β i , allows the possibility that the treatment interacts with the x ’s or other variables, so that the effects of T can depend on (be modified by) any other variables. Note that we do not need i subscripts on the γ ’s that control the effects of the other causes; if their effects differ across individuals, we include the interactions of individual characteristics with the original x ’s as new x ’s. Given that the x ’s can be unobservable, this is not restrictive. Usage here differs across fields; we shall typically refer to factors other than T represented on the right-hand side of (1) by the term covariates , while noting that these include both what are sometimes labelled the ‘independently operating causes’ (represented by the x ’s) as well as ‘effect modifiers’ when they interact with the β′s , a case we shall return to below. They may also capture the possibility that there are different baselines for different observations.

We can connect (1) with the counterfactual approach, often referred to as the Rubin Causal Model, now common in epidemiology and increasingly so in economics (see Rubin (2005) , or Hernán (2004) for an exposition for epidemiologists, and Freedman (2006) for the history). To illustrate, suppose that T is dichotomous. For each unit i there will be two possible outcomes, typically labelled Y i 0 and Y i 1 , the former occurring if there is no treatment at the time in question, the latter if the unit is treated. By inspection of (1) , the differences between the two outcomes, Y i 1 − Y i 0 , are the individual treatment effects, β i , which are typically different for different units. No unit can be both treated and untreated at the same time, so only one or other of the outcomes occurs, but not both—the other is counterfactual so that individual treatment effects are in principle unobservable.

The basic theorem from this setup is a remarkable one. It states that the average treatment effect is the difference between the average outcome in the treatment group minus the average outcome in the control group so that, while we cannot observe the individual treatment effects, we can observe their mean. The estimate of the average treatment effect is simply the difference between the means in the two groups, and it has a standard error that can be estimated using the statistical theory that applies to the difference of two means, on which more below. The difference in means is an unbiased estimator of the mean treatment effect. The theorem is remarkable because it requires so few assumptions, although it relies on the fact that the mean is a linear operator, so that the difference in means is the mean of differences. No similar fact is true for other statistics, such as medians, percentiles, or variances of treatment effects, none of which can be identified from an RCT without substantive further assumptions, see Deaton (2010 , 439) for a simple exposition. Otherwise, no model is required, no assumptions about covariates, confounders, or other causes are needed, the treatment effects can be heterogeneous, and nothing is required about the shapes of statistical distributions other than the existence of the counterfactual outcome values.

Dawid (2000) argues that the existence of counterfactuals is a metaphysical assumption that cannot be confirmed (or refuted) by any empirical evidence and is controversial because, under some circumstances, there is an unresolvable arbitrariness to causal inference, something that is not true of (1) , for example. See also the arguments by the empiricist philosopher, Reichenbach (1954) , reissued as Reichenbach (1976) .) In economics, the case for the counterfactual approach is eloquently made by Imbens and Wooldridge (2009 , Introduction), who emphasize the benefits of a theory-free specification with almost unlimited heterogeneity in treatment effects. Heckman and Vytlacil (2007 , Introduction) are equally eloquent on the drawbacks, noting that the counterfactual approach often leaves us in the dark about the exact nature of the treatment, so that the treatment effects can be difficult to link to invariant quantities that would be useful elsewhere (invariant in the sense of Hurwicz (1966) ).

Consider an experiment that aims to tell us something about the treatment effects; this might or might not use randomization. Either way, we can represent the treatment group as having T i = 1 and the control group as having T i = 0. Given the study (or trial) sample, subtracting the average outcomes among the controls from the average outcomes among the treatments, we get

The first term on the far-right-hand side of (2) , which is the ATE in the treated population in the trial sample, is generally the quantity of interest in choosing to conduct an RCT, but the second term or error term, which is the sum of the net average balance of other causes across the two groups, will generally be non-zero and needs to be dealt with somehow. We get what we want when the means of all the other causes are identical in the two groups, or more precisely (and less onerously) when the sum of their net differences S ̄ 1 − S ̄ 0 is zero; this is the case of perfect balance . With perfect balance, the difference between the two means is exactly equal to the average of the treatment effects among the treated, so that we have the ultimate precision in that we know the truth in the trial sample, at least in this linear case. As always, the ‘truth’ here refers to the trial sample , and it is always important to be aware that the trial sample may not be representative of the population that is ultimately of interest, including the population from which the trial sample comes; any such extension requires further argument.

How do we get balance, or something close to it? In a laboratory experiment, where there is usually much prior knowledge of the other causes, the experimenter has a good chance of controlling (or subtracting away the effects of) the other causes, aiming to ensure that the last term in (1) is close to zero. Failing such knowledge and control, an alternative is matching , which is frequently used in non-randomized statistical, medical (case-control studies), and econometric studies, (see Heckman et al. (1997) ). For each subject, a matched subject is found that is as close as possible on all suspected causes, so that, once again, the last term in (1) can be kept small. When we have a good idea of the causes, matching may also deliver a precise estimate. Of course, when there are unknown or unobservable causes that have important effects, neither laboratory control nor matching offers protection.

What does randomization do? Suppose that no correlations of the x ’s with Y are introduced post-randomization, for example by subjects not accepting their assignment, or by treatment protocols differing from those used for controls. With this assumption, randomization provides orthogonality of the treatment to the other causes represented in equation (1) : Since the treatments and controls come from the same underlying distribution, randomization guarantees, by construction, that the last term on the right in (1) is zero in expectation . The expectation is taken over repeated randomizations on the trial sample, each with its own allocation of treatments and controls. Assuming that our caveat holds, the last term in (2) will be zero when averaged over this infinite number of (entirely hypothetical) replications, and the average of the estimated ATEs will be the true ATE in the trial sample. So β ̄ 1 is an unbiased estimate of the ATE among the treated in the trial sample, and this is so whether or not the causes are observed. Unbiasedness does not require us to know anything about covariates, confounders, or other causes, though it does require that they not change after randomization so as to make them correlated with the treatment, an important caveat to which we shall return.

In any one trial, the difference in means is the average treatment effect among those treated plus the term that reflects the randomly generated imbalance in the net effects of the other causes. We do not know the size of this error term, and there is nothing in randomization that limits its size though, as we discuss below, it will tend to be smaller in larger samples. In any single trial, the chance of randomization can over-represent an important excluded cause(s) in one arm over the other, in which case there will be a difference between the means of the two groups that is not caused by the treatment. In epidemiology, this is sometimes referred to as ‘random confounding’, or ‘realized confounding’, a phenomenon that was recognized by Fisher in his agricultural trials. (An instructive example of perfect random confounding is constructed by Greenland (1990) .)

If we were to repeat the trial many times, the over-representation of the unbalanced causes will sometimes be in the treatments and sometimes in the controls. The imbalance will vary over replications of the trial, and although we cannot see this from our single trial, we should be able to capture its effects on our estimate of the ATE from an estimated standard error. This was Fisher’s insight: not that randomization balanced covariates between treatments and controls but that, conditional on the caveat that no post-randomization correlation with covariates occurs, randomization provides the basis for calculating the size of the error. Getting the standard error and associated significance statements right are of the greatest importance; therein lies the virtue of randomization, not that it yields precise estimates through balance.

1.2 Misunderstandings: claiming too much

Exactly what randomization does is frequently lost in the practical and popular literature. There is often confusion between perfect control, on the one hand (as in a laboratory experiment or perfect matching with no unobservable causes), and control in expectation on the other, which is what randomization contributes. If we knew enough about the problem to be able to control well, that is what we would (and should) do. Randomization is an alternative when we do not know enough to control, but is generally inferior to good control when we do. We suspect that at least some of the popular and professional enthusiasm for RCTs, as well as the belief that they are precise by construction, comes from misunderstandings about balance or, in epidemiological language, about random or realized confounding on the one hand and confounding in expectation on the other. These misunderstandings are not so much among the researchers who will usually give a correct account when pressed. They come from imprecise statements by researchers that are taken literally by the lay audience that the researchers are keen to reach, and increasingly successfully.

Such a misunderstanding is well captured by a quote from the second edition of the online manual on impact evaluation jointly issued by the Inter-American Development Bank and the World Bank (the first, 2011 edition is similar):

We can be confident that our estimated impact constitutes the true impact of the program, since we have eliminated all observed and unobserved factors that might otherwise plausibly explain the difference in outcomes. Gertler et al. (2016 , 69)

This statement is false, because it confuses actual balance in any single trial with balance in expectation over many (hypothetical) trials. If it were true, and if all factors were indeed controlled (and no imbalances were introduced post randomization), the difference would be an exact measure of the average treatment effect among the treated in the trial population (at least in the absence of measurement error). We should not only be confident of our estimate but, as the quote says, we would know that it is the truth. Note that the statement contains no reference to sample size; we get the truth by virtue of balance, not from a large number of observations.

There are many similar quotes in the economics literature. From the medical literature, here is one from a distinguished psychiatrist who is deeply skeptical of the use of evidence from RCTs:

The beauty of a randomized trial is that the researcher does not need to understand all the factors that influence outcomes. Say that an undiscovered genetic variation makes certain people unresponsive to medication. The randomizing process will ensure—or make it highly probable—that the arms of the trial contain equal numbers of subjects with that variation. The result will be a fair test. Kramer (2016 , 18)

Claims are made that RCTs reveal knowledge without possibility of error. Judy Gueron, the long-time president of MDRC (originally known as the Manpower Development Research Corporation), which has been running RCTs on US government policy for 45 years, asks why federal and state officials were prepared to support randomization in spite of frequent difficulties and in spite of the availability of other methods and concludes that it was because “they wanted to learn the truth,” Gueron and Rolston (2013 , 429). There are many statements of the form “We know that [project X] worked because it was evaluated with a randomized trial,” Dynarski (2015) .

It is common to treat the ATE from an RCT as if it were the truth, not just in the trial sample but more generally. In economics, a famous example is Lalonde’s (1986) study of labor market training programs, whose results were at odds with a number of previous non-randomized studies. The paper prompted a large-scale re-examination of the observational studies to try to bring them into line, though it now seems just as likely that the differences lie in the fact that the different study results apply to different populations ( Heckman et al. (1999) ). With heterogeneous treatment effects, the ATE is only as good as the study sample from which it was obtained. (See Longford and Nelder (1999) who are concerned with the same issue in regulating pharmaceuticals. (We return to this in discussing support factors and moderator variables in Section 2.2) In epidemiology, Davey-Smith and Ibrahim (2002) state that “observational studies propose, RCTs dispose.” Another good example is the RCT of hormone replacement therapy (HRT) for post-menopausal women. HRT had previously been supported by positive results from a high-quality and long-running observational study, but the RCT was stopped in the face of excess deaths in the treatment group. The negative result of the RCT led to widespread abandonment of the therapy, which might (or might not) have been a mistake (see Vandenbroucke (2009) and Frieden (2017) ). Yet the medical and popular literature routinely states that the RCT was right and the earlier study wrong, simply because the earlier study was not randomized. The gold standard or ‘truth’ view does harm when it undermines the obligation of science to reconcile RCTs results with other evidence in a process of cumulative understanding.

The false belief in automatic precision suggests that we need pay no attention to the other causes in (1) or (2) . Indeed, Gerber and Green (2012 , 5), in their standard text for RCTs in political science, note that RCTs are the successful resolution of investigators’ need for “a research strategy that does not require them to identify, let alone measure, all potential confounders.” But the RCT strategy is only successful if we are happy with estimates that are arbitrarily far from the truth, just so long as the errors cancel out over a series of imaginary experiments. In reality, the causality that is being attributed to the treatment might, in fact, be coming from an imbalance in some other cause in our particular trial; limiting this requires serious thought about possible covariates.

1.3 Sample size, balance, and precision

The literature on the precision of ATEs estimated from RCTs goes back to the very beginning. Gosset (writing as ‘Student’) never accepted Fisher’s arguments for randomization in agricultural field trials and argued convincingly that his own non-random designs for the placement of treatment and controls yielded more precise estimates of treatment effects (see Student (1938) and Ziliak (2014) ). Gosset worked for Guinness where inefficiency meant lost revenue, so he had reasons to care, as should we. Fisher won the argument in the end, not because Gosset was wrong about efficiency, but because, unlike Gosset’s procedures, randomization provides a sound basis for statistical inference, and thus for judging whether an estimated ATE is different from zero by chance. Moreover, Fisher’s blocking procedures can limit the inefficiency from randomization (see Yates (1939) ). Gosset’s reservations were echoed much later in Savage’s (1962) comment that a Bayesian should not choose the allocation of treatments and controls at random but in such a way that, given what else is known about the topic and the subjects, their placement reveals the most to the researcher. We return to this below.

At the time of randomization and in the absence of post-randomization changes in other causes, a trial is more likely to be balanced when the sample size is large. As the sample size tends to infinity, the means of the x ’s in the treatment and control groups will become arbitrarily close. Yet this is of little help in finite samples. As Fisher (1926) noted: “Most experimenters on carrying out a random assignment will be shocked to find how far from equally the plots distribute themselves,” quoted in Morgan and Rubin (2012 , 1263). Even with very large sample sizes, if there is a large number of causes, balance on each cause may be infeasible. Vandenbroucke (2004) notes that there are three billion base pairs in the human genome, many or all of which could be relevant prognostic factors for the biological outcome that we are seeking to influence. It is true, as (2) makes clear, that we do not need balance on each cause individually, only on their net effect, the term S 1 ¯ − S 0 ¯ . But consider the human genome base pairs. Out of all those billions, only one might be important, and if that one is unbalanced, the results of a single trial can be ‘randomly confounded’ and far from the truth. Statements about large samples guaranteeing balance are not useful without guidelines about how large is large enough, and such statements cannot be made without knowledge of other causes and how they affect outcomes. Of course, lack of balance in the net effect of either observables or non-observables in (2) does not compromise the inference in an RCT in the sense of obtaining a standard error for the unbiased ATE (see Senn (2013) for a particularly clear statement), although it does clarify the importance of having credible standard errors, on which more below.

Having run an RCT, it makes good sense to examine any available covariates for balance between the treatments and controls; if we suspect that an observed variable x is a possible cause, and its means in the two groups are very different, we should treat our results with appropriate suspicion. In practice, researchers often carry out a statistical test for balance after randomization but before analysis, presumably with the aim of taking some appropriate action if balance fails. The first table of the paper typically presents the sample means of observable covariates for the control and treatment groups, together with their differences, and tests for whether or not they are significantly different from zero, either variable by variable, or jointly. These tests are appropriate for unbiasedness if we are concerned that the random number generator might have failed, or if we are worried that the randomization is undermined by non-blinded subjects who systematically undermine the allocation. Otherwise, supposing that no post-randomization correlations are introduced, unbiasedness is guaranteed by the randomization, whatever the test shows, and the test is not informative about the balance that would lead to precision ; Begg (1990 , 223) notes, “(I)t is a test of a null hypothesis that is known to be true. Therefore, if the test turns out to be significant it is, by definition, a false positive.” The Consort 2010 updated statement, guideline 15 notes “Unfortunately significance tests of baseline differences are still common; they were reported in half of 50 RCTs trials published in leading general journals in 1997.” We have not systematically examined the practice across other social sciences, but it is standard in economics, even in high-quality studies in leading journals, such as Banerjee et al. (2015) , published in Science .

Of course, it is always good practice to look for imbalances between observed covariates in any single trial using some more appropriate distance measure, for example the normalized difference in means ( Imbens and Wooldridge (2009 , equation (3) ). Similarly, it would have been good practice for Fisher to abandon a randomization in which there were clear patterns in the (random) distribution of plots across the field, even though the treatment and control plots were randomly selections that, by construction, could not differ ‘significantly’ using the standard (incorrect) balance test. Whether such imbalances should be seen as undermining the estimate of the ATE depends on our priors about which covariates are likely to be important, and how important, which is (not coincidentally) the same thought experiment that is routinely undertaken in observational studies when we worry about confounding.

One procedure to improve balance is to adapt the design before randomization, for example, by stratification. Fisher, who as the quote above illustrates, was well aware of the loss of precision from randomization argued for ‘blocking’ (stratification) in agricultural trials or for using Latin Squares, both of which restrict the amount of imbalance. Stratification, to be useful, requires some prior understanding of the factors that are likely to be important, and so it takes us away from the ‘no knowledge required’ or ‘no priors accepted’ appeal of RCTs; it requires thinking about and measuring confounders. But as Scriven (1974 , 69) notes: “(C)ause hunting, like lion hunting, is only likely to be successful if we have a considerable amount of relevant background knowledge.” Cartwright (1994 , Chapter 2) puts it even more strongly, “No causes in, no causes out.” Stratification in RCTs, as in other forms of sampling, is a standard method for using background knowledge to increase the precision of an estimator. It has the further advantage that it allows for the exploration of different ATEs in different strata which can be useful in adapting or transporting the results to other locations (see Section 2).

Stratification is not possible if there are too many covariates, or if each has many values, so that there are more cells than can be filled given the sample size. With five covariates, and ten values on each, and no priors to limit the structure, we would have 100,000 possible strata. Filling these is well beyond the sample sizes in most trials. An alternative that works more generally is to re-randomize . If the randomization gives an obvious imbalance on known covariates—treatment plots all on one side of the field, all the treatment clinics in one region, too many rich and too few poor in the control group—we try again, and keep trying until we get a balance measured as a small enough distance between the means of the observed covariates in the two groups. Morgan and Rubin (2012) suggest the Mahalanobis D –statistic be used as a criterion and use Fisher’s randomization inference (to be discussed further below) to calculate standard errors that take the re-randomization into account. An alternative, widely adapted in practice, is to adjust for covariates by running a regression (or covariance) analysis, with the outcome on the left-hand side and the treatment dummy and the covariates as explanatory variables, including possible interactions between covariates and treatment dummies. Freedman (2008) shows that the adjusted estimate of the ATE is biased in finite samples, with the bias depending on the correlation between the squared treatment effect and the covariates. Accepting some bias in exchange for greater precision will often make sense, though it certainly undermines any gold standard argument that relies on unbiasedness without consideration of precision.

1.4 Should we randomize?

The tension between randomization and precision that goes back to Fisher, Gosset, and Savage has been reopened in recent papers by Kasy (2016) , Banerjee et al. (BCS) (2016) and Banerjee et al. (BCMS) (2017) .

The trade-off between bias and precision can be formalized in several ways, for example by specifying a loss or utility function that depends on how a user is affected by deviations of the estimate of the ATE from the truth and then choosing an estimator or an experimental design that minimizes expected loss or maximizes expected utility. As Savage (1962 , 34) noted, for a Bayesian, this involves allocating treatments and controls in “the specific layout that promised to tell him the most,” but without randomization . Of course, this requires serious and perhaps difficult thought about the mechanisms underlying the ATE, which randomization avoids. Savage also notes that several people with different priors may be involved in an investigation and that individual priors may be unreliable because of “vagueness and temptation to self-deception,” defects that randomization may alleviate, or at least evade. BCMS (2017) provide a proof of a Bayesian no-randomization theorem, and BCS (2016) provide an illustration of a school administrator who has long believed that school outcomes are determined, not by school quality, but by parental background, and who can learn the most by placing deprived children in (supposed) high-quality schools and privileged children in (supposed) low-quality schools, which is the kind of study setting to which case study methodology is well attuned. As BCS note, this allocation would not persuade those with different priors, and they propose randomization as a means of satisfying skeptical observers. As this example shows, it is not always necessary to encode prior information into a set of formal prior probabilities, though thought about what we are trying to learn is always required.

Several points are important. First, the anti-randomization theorem is not a justification of any non-randomized design, for example, one that allows selection on unobservables, but only of the optimal design that is most informative. According to Chalmers (2001) and Bothwell and Podolsky (2016) , the development of randomization in medicine originated with Bradford-Hill, who used randomization in the first RCT in medicine—the streptomycin trial—because it prevented doctors selecting patients on the basis of perceived need (or against perceived need, leaning over backward as it were), an argument recently echoed by Worrall (2007) . Randomization serves this purpose, but so do other non-discretionary schemes; what is required is that hidden information should not be allowed to affect the allocation as would happen, for example, if subjects could choose their own assignments.

Second, the ideal rules by which units are allocated to treatment or control depend on the covariates and on the investigators’ priors about how they affect the outcomes. This opens up all sorts of methods of inference that are long familiar but that are excluded by pure randomization. For example, what philosophers call the hypothetico-deductive method works by using theory to make a prediction that can be taken to the data for potential falsification (as in the school example above). This is the way that physicists learn, as do other researchers when they use theory to derive predictions that can be tested against the data, perhaps in an RCT, but more frequently not. As Lakatos 1970 (among others) has stressed, some of the most fruitful research advances are generated by the puzzles that result when the data fail to match such theoretical predictions. In economics, good examples include the equity premium puzzle, various purchasing power parity puzzles, the Feldstein-Horioka puzzle, the consumption smoothness puzzle, the puzzle of why in India, where malnourishment is widespread, rapid income growth has been accompanied by s fall in calories consumed, and many others.

Third, randomization, by ignoring prior information from theory and from covariates, is wasteful and even unethical when it unnecessarily exposes people, or unnecessarily many people, to possible harm in a risky experiment. Worrall (2008) documents the (extreme) case of ECMO (Extracorporeal Membrane Oxygenation), a new treatment for newborns with persistent pulmonary hypertension that was developed in the 1970s by intelligent and directed trial and error within a well-understood theory of the disease and a good understanding of how the oxygenator should work. In early experimentation by the inventors, mortality was reduced from 80 to 20 percent. The investigators felt compelled to conduct an RCT, albeit with an adaptive ‘play-the-winner’ design in which each success in an arm increased the probability of the next baby being assigned to that arm. One baby received conventional therapy and died, 11 received ECMO and lived. Even so, a standard randomized controlled trial was thought necessary. With a stopping rule of four deaths, four more babies (out of ten) died in the control group and none of the nine who received ECMO.

Fourth, the non-random methods use prior information, which is why they do better than randomization. This is both an advantage and a disadvantage, depending on one’s perspective. If prior information is not widely accepted, or is seen as non-credible by those we are seeking to persuade, we will generate more credible estimates if we do not use those priors. Indeed, this is why BCS (2017) recommend randomized designs, including in medicine and in development economics. They develop a theory of an investigator who is facing an adversarial audience who will challenge any prior information and can even potentially veto results based on it (think of administrative agencies such as the FDA or journal referees). The experimenter trades off his or her own desire for precision (and preventing possible harm to subjects), which would require prior information, against the wishes of the audience, who wants nothing to do with those priors. Even then, the approval of the audience is only ex ante ; once the fully randomized experiment has been done, nothing stops critics arguing that, in fact, the randomization did not offer a fair test because important other causes were not balanced. Among doctors who use RCTs, and especially meta-analysis, such arguments are (appropriately) common (see Kramer (2016) ). We return to this topic in Section 2.1.

Today, when the public has come to question expert prior knowledge, RCTs will flourish. In cases where there is good reason to doubt the good faith of experimenters, randomization will indeed be an appropriate response. But we believe such a simplistic approach is destructive for scientific endeavor (which is not the purpose of the FDA) and should be resisted as a general prescription in scientific research. Previous knowledge needs to be built on and incorporated into new knowledge, not discarded. The systematic refusal to use prior knowledge and the associated preference for RCTs are recipes for preventing cumulative scientific progress. In the end, it is also self-defeating. To quote Rodrik (D. Rodrik, personal communication, April 6, 2016) “the promise of RCTs as theory-free learning machines is a false one.”

1.5 Statistical inference in RCTs

The estimated ATE in a simple RCT is the difference in the means between the treatment and control groups. When covariates are allowed for, as in most RCTs in economics, the ATE is usually estimated from the coefficient on the treatment dummy in a regression that looks like (1) , but with the heterogeneity in β ignored. Modern work calculates standard errors allowing for the possibility that residual variances may be different in the treatment and control groups, usually by clustering the standard errors, which is equivalent to the familiar two sample standard error in the case with no covariates. Statistical inference is done with t -values in the usual way. Unfortunately, these procedures do not always give the right standard errors and, to reiterate, the value of randomization is that it permits inference about estimates of ATEs, not that it guarantees the quality of these estimates, so credible standard errors are essential in any argument for RCTs.

Looking back at (1) , the underlying objects of interest are the individual treatment effects β i for each of the individuals in the trial sample. Neither they, nor their distribution G ( β ) is identified from an RCT; because RCTs make so few assumptions which, in many cases, is their strength, they can identify only the mean of the distribution. In many observational studies, researchers are prepared to make more assumptions on functional forms or on distributions, and for that price we are able to identify other quantities of interest. Without these assumptions, inferences must be based on the difference in the two means, a statistic that is sometimes ill-behaved, as we discuss below. This ill-behavior has nothing to do with RCTs, per se, but within RCTs, and their minimal assumptions, we cannot easily switch from the mean to some other quantity of interest.

Fisher proposed that statistical inference should be done using what has become known as ‘randomization inference’, a procedure that is as non-parametric as the RCT-based estimate of an ATE itself. To test the null hypothesis that β i = 0 for all i , note that, under the null that the treatment has no effect on any individual, an estimated nonzero ATE can only be a consequence of the particular random allocation that generated it (assuming no difference in the distributions of covariates post-randomization). By tabulating all possible combinations of treatments and controls in our trial sample, and the ATE associated with each, we can calculate the exact distribution of the estimated ATE under the null. This allows us to calculate the probability of calculating an estimate as large as our actual estimate when the treatment has no effect. This randomization test requires a finite sample, but it will work for any sample size (see Imbens and Wooldridge (2009) for an excellent account of the procedure).

Randomization inference can be used to test the null hypotheses that all of the treatment effects are zero, as in the above example, but it cannot be used to test the hypothesis that the average treatment effect is zero, which will often be of interest. In agricultural trials, and in medicine, the stronger (sharp) hypothesis that the treatment has no effect whatever is often of interest. In many public health applications, we are content with improving average health, and in economic applications that involve money, such as welfare experiments or cost-benefit analyses, we are interested in whether the net effect of the treatment is positive or negative, and in these cases, randomization inference cannot be used. None of which argues against its wider use in social sciences when appropriate.

In cases where randomization inference cannot be used, we must construct tests for the differences in two means. Standard procedures will often work well, but there are two potential pitfalls. One, the ‘Fisher-Behrens problem’, comes from the fact that, when the two samples have different variances—which we typically want to permit—the t –statistic as usually calculated does not have the t -distribution. The second problem, which is much harder to address, occurs when the distribution of treatment effects is not symmetric ( Bahadur and Savage (1956) ). Neither pitfall is specific to RCTs, but RCTs force us to work with means in estimating treatment effects and, with only a few exceptions in the literature, social scientists who use RCTs appear to be unaware of the difficulties.

In the simple case of comparing two means in an RCT, inference is usually based on the two–sample t –statistic which is computed by dividing the ATE by the estimated standard error whose square is given by

where 0 refers to controls and 1 to treatments, so that there are n 1 treatments and n 0 controls, and Ӯ 1 and Ӯ 0 are the two means. As has long been known, the “ t –statistic’ based on (3) is not distributed as Student’s t if the two variances (treatment and control) are not identical but has the Behrens–Fisher distribution. In extreme cases, when one of the variances is zero, the t –statistic has effective degrees of freedom half of that of the nominal degrees of freedom, so that the test-statistic has thicker tails than allowed for, and there will be too many rejections when the null is true.

Young (2017) argues that this problem is worse when the trial results are analyzed by regressing outcomes not only on the treatment dummy but also on additional covariates and when using clustered or robust standard errors. When the design matrix is such that the maximal influence is large, which is likely if the distribution of the covariates is skewed so that for some observations outcomes have large influence on their own predicted values, there is a reduction in the effective degrees of freedom for the t –value(s) of the average treatment effect(s) leading to spurious findings of significance. Young looks at 2,027 regressions reported in 53 RCT papers in the American Economic Association journals and recalculates the significance of the estimates using randomization inference applied to the authors’ original data. In 30 to 40 percent of the estimated treatment effects in individual equations with coefficients that are reported as significant, he cannot reject the null of no effect for any observation; the fraction of spuriously significant results increases further when he simultaneously tests for all results in each paper. These spurious findings come in part from issues of multiple-hypothesis testing, both within regressions with several treatments and across regressions. Within regressions, treatments are largely orthogonal, but authors tend to emphasize significant t –values even when the corresponding F -tests are insignificant. Across equations, results are often strongly correlated, so that, at worst, different regressions are reporting variants of the same result, thus spuriously adding to the ‘kill count’ of significant effects. At the same time, the pervasiveness of observations with high influence generates spurious significance on its own.

These issues are now being taken more seriously, at least in economics. In addition to Young (2017) , Imbens and Kolesár (2016) provide practical advice for dealing with the Fisher-Behrens problem, and the best current practice tries to be careful about multiple hypothesis testing. Yet it remains the case that many of the results reported in the literature are spuriously significant.

Spurious significance also arises when the distribution of treatment effects contains outliers or, more generally, is not symmetric. Standard t –tests break down in distributions with enough skewness (see Lehmann and Romano (2005 , 466–8)). How difficult is it to maintain symmetry? And how badly is inference affected when the distribution of treatment effects is not symmetric? One important example is expenditures on healthcare. Most people have zero expenditure in any given period, but among those who do incur expenditures, a few individuals spend huge amounts that account for a large share of the total. Indeed, in the famous Rand health experiment (see Manning, et al. (1987 , 1988) ), there is a single very large outlier. The authors realize that the comparison of means across treatment arms is fragile, and, although they do not see their problem exactly as described here, they obtain their preferred estimates using an approach that is explicitly designed to model the skewness of expenditures. Another example comes from economics, where many trials have outcomes valued in money. Does an anti-poverty innovation—for example microfinance—increase the incomes of the participants? Income itself is not symmetrically distributed, and this might also be true of the treatment effects if there are a few people who are talented but credit-constrained entrepreneurs and who have treatment effects that are large and positive, while the vast majority of borrowers fritter away their loans, or at best make positive but modest profits. A recent summary of the literature is consistent with this (see Banerjee, Karlan, and Zinman (2015) ).

In some cases, it will be appropriate to deal with outliers by trimming, transforming, or eliminating observations that have large effects on the estimates. But if the experiment is a project evaluation designed to estimate the net benefits of a policy, the elimination of genuine outliers, as in the Rand Health Experiment, will vitiate the analysis. It is precisely the outliers that make or break the program. Transformations, such as taking logarithms, may help to produce symmetry, but they change the nature of the question being asked; a cost benefit analysis or healthcare reform costing must be done in dollars, not log dollars.

We consider an example that illustrates what can happen in a realistic but simplified case; the full results are reported in the Appendix. We imagine a population of individuals, each with a treatment effect β i . The parent population mean of the treatment effects is zero, but there is a long tail of positive values; we use a left-shifted lognormal distribution. This could be a healthcare expenditure trial or a microfinance trial, where there is a long positive tail of rare individuals who incur very high costs or who can do amazing things with credit while most people cost nothing in the period studied or cannot use the credit effectively. A trial sample of 2 n individuals is randomly drawn from the parent population and is randomly split between n treatments and n controls. Within each trial sample, whose true ATE will generally differ from zero because of the sampling, we run many RCTs and tabulate the values of the ATE for each.

Using standard t –tests, the (true in the parent distribution) hypothesis that the ATE is zero is rejected between 14 ( n = 25) and 6 percent ( n = 500) of the time. These rejections come from two separate issues, both of which are relevant in practice: (a) that the ATE in the trial sample differs from the ATE in the parent population of interest, and (b) that the t –values are not distributed as t in the presence of outliers. The problem cases are when the trial sample happens to contain one or more outliers, something that is always a risk given the long positive tail of the parent distribution. When this happens, everything depends on whether the outlier is among the treatments or the controls; in effect, the outliers become the sample, reducing the effective number of degrees of freedom. In extreme cases, one of which is illustrated in Figure A.1 , the distribution of estimated ATEs is bimodal, depending on the group to which the outlier is assigned. When the outlier is in the treatment group, the dispersion across outcomes is large, as is the estimated standard error, and so those outcomes rarely reject the null using the standard table of t –values. The over-rejections come from cases when the outlier is in the control group, the outcomes are not so dispersed, and the t –values can be large, negative, and significant. While these cases of bimodal distributions may not be common and depend on the existence of large outliers, they illustrate the process that generates the over-rejections and spurious significance. Note that there is no remedy through randomization inference here, given that our interest is in the hypothesis that the average treatment effect is zero.

Our reading of the literature on RCTs in social and public health policy areas suggests that they are not exempt from these concerns. Many trials are run on (sometimes very) small samples, they have treatment effects where asymmetry is hard to rule out—especially when the outcomes are in money—and they often give results that are puzzling, or at least not easily interpreted theoretically. In the context of development studies, neither Banerjee and Duflo (2012) nor Karlan and Appel (2011) , who cite many RCTs, raise concerns about misleading inference, implicitly treating all results as reliable. Some of these results contradict standard theory. No doubt there are behaviors in the world that are inconsistent with conventional economics, and some can be explained by standard biases in behavioral economics, but it would also be good to be suspicious of the significance tests before accepting that an unexpected finding is well-supported and that theory must be revised. Replication of results in different settings may be helpful, if they are the right kind of places (see our discussion in Section 2). Yet it hardly solves the problem given that the asymmetry may be in the same direction in different settings, that it seems likely to be so in just those settings that are sufficiently like the original trial setting to be of use for inference about the population of interest, and that the ‘significant’ t –values will show departures from the null in the same direction. This, then, replicates the spurious findings.

1.6 Familiar threats to unbiasedness

It is of great importance to note that randomization, by itself, is not sufficient to guarantee unbiasedness if post-randomization differences are permitted to affect the two groups. This requires ‘policing’ of the experiment, for example by requiring that subjects, experimenters, and analysts are blinded and that differences in treatments or outcomes do not reveal their status to subjects. Familiar concerns about selection bias and the placebo, Pygmalion, Hawthorne, John Henry, and 'teacher/therapist' effects are widespread across studies of medical and social interventions. The difficulty of controlling for placebo effects can be especially acute in testing medical interventions (see Howick (2011) , Chapter 7 for a critical review), as is the difficulty in controlling both for placebo effects and the effects of therapist variables in testing psychological therapies. For instance, Pitman, et al. (2017) suggest how difficult it will be to identify just what a psychological therapy consists of; Kramer and Stiles (2015) treat the ‘responsiveness’ problem of categorizing therapist responses to emerging context; and there has been a lively debate about whether cognitive mechanisms of change are responsible for the effectiveness of cognitive therapy for depression based on data that shows the changes in symptoms occur mainly before the cognitive techniques are brought into play ( Ilardi and Craighead (1999) , Vittengl et al. (2014) ).

Many social and economic trials, medical trials, and public health trials are not blinded nor sufficiently controlled for other sources of bias, and indeed many cannot be, and a sufficient defense is rarely offered that unbiasedness is not undermined. Generally, it is recommended to extend blinding beyond participants and investigators to include those who measure outcomes and those who analyze the data, all of whom may be affected by both conscious and unconscious bias. The need for blinding in those who assess outcomes is particularly important in cases where outcomes are not determined by strictly prescribed procedures whose application is transparent and checkable but requires elements of judgment.

Beyond the need to control for ‘psychological’ or ‘placebo’ effects, blinding of trial participants is important in cases where there is no compulsion, so that people who are randomized into the treatment group are free to choose to refuse treatment. In many cases it is reasonable to suppose that people choose to participate if it is in their interest to do so. In consequence, those who estimate (consciously or unconsciously) that their gain is not high enough to offset the perceived drawbacks of compliance with the treatment protocol may avoid it. The selective acceptance of treatment limits the analyst’s ability to learn about people who decline treatment but who would have to accept it if the policy were implemented. In these cases, both the intention-to-treat estimator and the ‘as treated’ estimator that compares the treated and the untreated are affected by the kind of selection effects that randomization is designed to eliminate.

So, blinding matters for unbiasedness and is very often missing (see also Hernán et al. (2013) ). This is not to say that one should assume without argument that non-blinding at any point will introduce bias. That is a matter to be assessed case-by-case. But the contrary cannot be automatically assumed. This brings to the fore the trade-off between using an RCT-based estimate that may well be biased, and in ways we do not have good ideas how to deal with, versus one from an observational study where blinding may have been easier, or some of these sources of bias may be missing or where we may have a better understanding of how to correct for them. For instance, blinding is sometimes automatic in observational studies, e.g. from administrative records. (See for example Horwitz et al. 2017 for a discussion of the complications of analyzing the result in the large Women’s Health Trial when it was noted that due to the presence of side effects of the treatment “blinding was broken for nearly half of the HRT users but only a small percentage of the placebo users” [1248].)

Lack of blinding is not the only source of post-randomization bias. Subsequent treatment decisions can differ, and treatments and controls may be handled in different places, or by differently trained practitioners, or at different times of day, and these differences can bring with them systematic differences in the other causes to which the two groups are exposed. These can, and should, be guarded against. But doing so requires an understanding of what these causally relevant factors might be.

1.7 A summary

What do the arguments of this section mean about the importance of randomization and the interpretation that should be given to an estimated ATE from a randomized trial?

First, we should be sure that an unbiased estimate of an ATE for the trial population is likely to be useful enough to warrant the costs of running the trial.

Second, since randomization does not ensure orthogonality, to conclude that an estimate is unbiased, warrant is required that there are no significant post-randomization correlates with the treatment.

Third, the inference problems reviewed here cannot just be presumed away. When there is substantial heterogeneity, the ATE in the trial sample can be quite different from the ATE in the population of interest, even if the trial is randomly selected from that population; in practice, the relationship between the trial sample and the population is often obscure (see Longford and Nelder (1999) ).

Fourth, beyond that, in many case the statistical inference will be fine, but serious attention should be given to the possibility that there are outliers in treatment effects, something that knowledge of the problem can suggest and where inspection of the marginal distributions of treatments and controls may be informative. For example, if both are symmetric, it seems unlikely (though certainly not impossible) that the treatment effects are highly skewed. Measures to deal with Fisher-Behrens should be used and randomization inference considered when appropriate to the hypothesis of interest.

All of this can be regarded as recommendations for improvement to current practice, not a challenge to it. More fundamentally, we strongly contest the often-expressed idea that the ATE calculated from an RCT is automatically reliable, that randomization automatically controls for unobservables, or worst of all, that the calculated ATE is true. If, by chance, it is close to the truth, the truth we are referring to is the truth in the trial sample only . To make any inference beyond that requires arguments of the kind we consider in the next section. We have also argued that, depending on what we are trying to measure and what we want to use that measure for, there is no presumption that an RCT is the best means of estimating it. That too requires an argument, not a presumption.

Section 2: Using the results of randomized controlled trials

2.1 introduction.

Suppose we have estimated an ATE from a well-conducted RCT on a trial sample, and our standard error gives us reason to believe that the effect did not come about by chance. We thus have good warrant that the treatment causes the effect in our trial sample, up to the limits of statistical inference. What are such findings good for? The literature discussing RCTs has paid more attention to obtaining results than to considering what can justifiably be done with them. There is insufficient theoretical and empirical work to guide us how and for what purposes to use the findings. What there is tends to focus on the conditions under which the same results hold outside of the original settings or how they might be adapted for use elsewhere, with almost no attention to how they might be used for formulating, testing, understanding, or probing hypotheses beyond the immediate relation between the treatment and the outcome investigated in the study. Yet it cannot be that knowing how to use results is less important than knowing how to demonstrate them. Any chain of evidence is only as strong as it weakest link, so that a rigorously established effect whose applicability is justified by a loose declaration of simile warrants little. If trials are to be useful, we need paths to their use that are as carefully constructed as are the trials themselves.

The argument for the ‘primacy of internal validity’ made by Shadish, Cook, and Campbell (2002) may be reasonable as a warning that bad RCTs are unlikely to generalize, although as Cook (2014) notes “inferences about internal validity are inevitability probabilistic.” Moreover, the primacy statement is sometimes incorrectly taken to imply that results of an internally valid trial will automatically, or often, apply ‘as is’ elsewhere, or that this should be the default assumption failing arguments to the contrary, as if a parameter, once well established, can be expected to be invariant across settings. The invariance assumption is often made in medicine, for example, where it is sometimes plausible that a particular procedure or drug works the same way everywhere, though its effects cannot be the same at all stages of the disease. More generally, Horton (2000) gives a strong dissent and Rothwell (2005) provides arguments on both sides of the question. We should also note the recent movement to ensure that testing of drugs includes women and minorities because members of those groups suppose that the results of trials on mostly healthy young white males do not apply to them, as well as the increasing call for pragmatic trials, as in Williams et al. (2015) : “[P]ragmatic trials … ask ‘we now know it can work, but how well does it work in real world clinical practice?’”

Our approach to the use of RCT results is based on the observation that whether, and in what ways, an RCT result is evidence depends on exactly what the hypothesis is for which the result is supposed to be evidence, and that what kinds of hypotheses these will be depends on the purposes to be served. This should in turn affect the design of the trial itself. This is recognized in the medical literature in the distinction between explanatory and pragmatic trials and the proposals to adapt trial design to the question asked, as for example in Patsopoulos (2011 , 218): “The explanatory trial is the best design to explore if and how an intervention works” whereas “The research question under investigation is whether an intervention actually works in real life .” It is also reflected in, for example, Rothman et l. (2013 , 1013), whom we echo in arguing that simple extrapolation is not the sole purpose to which RCT results can be put: “The mistake is to think that statistical inference is the same as scientific inference.” We shall distinguish a number of different purposes and discuss how, and when, RCTs can serve them: (a) simple extrapolation and simple generalization, (b) drawing lessons about the population enrolled in the trial, (c) extrapolation with adjustment, (d) estimating what happens if we scale up, (e) predicting the results of treatment on the individual, and (f) building and testing theory.

This list is hardly exhaustive. We noted in Section 1.4 one further use that we do not pursue here: The widespread and largely uncritical belief that RCTs give the right answer permits them to be used as dispute-reconciliation mechanisms to resolve political conflicts. For example, at the Federal level in the US, prospective policies are vetted by the non-partisan Congressional Budget Office (CBO), which makes its own estimates of budgetary implications. Ideologues whose programs are scored poorly by the CBO have an incentive to support an RCT, not to convince themselves, but to convince opponents. Once again, RCTs are valuable when your opponents do not share your prior.

2.2 Simple extrapolation and simple generalization

Suppose a trial has (probabilistically) established a result in a specific setting. If ‘the same’ result holds elsewhere, it is said to have external validity . External validity may refer just to the replication of the causal connection or go further and require replication of the magnitude of the ATE. Either way, the result holds—everywhere, or widely, or in some specific elsewhere—or it does not.

This binary concept of external validity is often unhelpful because it asks the results of an RCT to satisfy a condition that is neither necessary nor sufficient for trials to be useful, and so both overstates and understates their value. It directs us toward simple extrapolation —whether the same result holds elsewhere—or simple generalization —it holds universally or at least widely—and away from more complex but equally useful applications of the results. The failure of external validity interpreted as simple generalization or extrapolation says little about the value of the results of the trial.

There are several uses of RCTs that do not require applying their results beyond the original context; we discuss these in Section 2.4. Beyond that, there are often good reasons to expect that the results from a well-conducted, informative, and potentially useful RCT will not apply elsewhere in any simple way. Without further understanding and analysis, even successful replication tells us little either for or against simple generalization nor does much to support the conclusion that the next will work in the same way. Nor do failures of replication make the original result useless. We often learn much from coming to understand why replication failed and can use that knowledge in looking for how the factors that caused the original result might operate differently in different settings. Third, and particularly important for scientific progress, the RCT result can be incorporated into a network of evidence and hypotheses that test or explore claims that look very different from the results reported from the RCT. We shall give examples below of valuable uses for RCTs that are not externally valid in the (usual) sense that their results do not hold elsewhere, whether in a specific target setting or in the more sweeping sense of holding everywhere, or everywhere in some specified domain.

The RAND health experiment ( Manning et al. (1987 , 88)) provides an instructive story if only because its results have permeated the academic and policy discussions about healthcare ever since. It was originally designed to test whether more generous insurance causes people to use more medical care and, if so, by how much. The incentive effects are hardly in doubt today; the immortality of the study comes rather from the fact that its multi-arm (response surface) design allowed the calculation of an elasticity for the study population, that medical expenditures decreased by −0.1 to −0.2 percent for every percentage increase in the copayment. According to Aron-Dine et al. (2013) , it is this dimensionless and thus apparently exportable number that has been used ever since to discuss the design of healthcare policy; the elasticity has come to be treated as a universal constant. Ironically, they argue that the estimate cannot be replicated in recent studies, and that it is unclear that it is firmly based on the original evidence. The simple direct exportability of the result was perhaps illusory.

The drive to export and generalize RCTs results is at the core of the influential ‘what works’ movement across the medical and social sciences. At its most ambitious, this aims for universal reach. For example, in the development economics literature, Duflo and Kremer (2008 , 93) argue that “credible impact evaluations are global public goods in the sense that they can offer reliable guidance to international organizations, governments, donors, and nongovernmental organizations (NGOs) beyond national borders.” Sometimes the results of a single RCT are advocated as having wide applicability, with especially strong endorsement when there is at least one replication.

Simple extrapolation is often used to move RCT results from one setting to another. Much of what is written in the ‘what works’ literature suggests that, unless there is evidence to the contrary, the direction and size of treatment effects can be transported from one place to another without serious adjustment. The Abdul Latif Jameel Poverty Action Lab (J-PAL) conducts RCTs around the world and summarizes findings in an attempt to reduce poverty by the use of “scientific evidence to inform policy.” Some of their reports convert results into a common cost-effectiveness measure. For example, Improving Student Participation--Which programs most effectively get children into school? classifies results into six categories: school time travel, subsidies and transfers, health, perceived returns, education quality, and gender specific barriers; results are reported in the common unit, “additional years of education for US$100 spent.” “Health”, which top-rated by far, includes two studies, “deworming” in Kenya (11.91) and “iron & vitamin A” in India (2.61); “perceived returns” to education has one study in the Dominican Republic (0.23); “subsidies and transfers” includes the most studies—six, with results ranging from 0.17 for “secondary scholarships” in Ghana to 0.01, for “CCT” (Conditional Cash Transfers) in Mexico and 0.09 and 0.07 for “CCT” in Malawi.

What can we conclude from such comparisons? A philanthropic donor interested in education, who assumes that marginal and average effects are the same, might learn that the best place to devote a marginal dollar is in Kenya, where it would be used for deworming. This is certainly useful, but it is not as useful as statements that deworming programs are everywhere more cost-effective than programs involving vitamin A or scholarships, or if not everywhere, at least over some domain, and it is these second kinds of comparison that would genuinely fulfill the promise of ‘finding out what works.’ But such comparisons only make sense if the results from one place can be relied on to apply in another, if the Kenyan results also hold in the Dominican Republic, Mexico, Ghana, or in some specific list of places.

What does J-PAL conclude? Here are two of their reported “Practical Implications”: “Conditional and unconditional cash transfers can increase school enrolment and attendance, but are expensive to implement...Eliminating small costs can have substantial impacts on school participation.” ‘Can’ here is admittedly an ambiguous word. It is certainly true in a logical sense that if a program has achieved a given result, then it can do so. But we suspect that the more natural sense for readers to take away is that the program ‘may well’ do so most other places, in the absence of special problems, or that that is at least the default assumption.

Trials, as is widely noted, often take place in artificial environments which raises well recognized problems for extrapolation. For instance, with respect to economic development, Drèze (J. Drèze, personal communications, November 8, 2017) notes, based on extensive experience in India, that “when a foreign agency comes in with its heavy boots and deep pockets to administer a ‘treatment,’ whether through a local NGO or government or whatever, there tends to be a lot going on other than the treatment.” There is also the suspicion that a treatment that works does so because of the presence of the ‘treators,’ often from abroad, and may not do so with the people who will work it in practice.

J-PAL’s manual for cost-effectiveness ( Dhaliwal et al. (2012) ) explains in (entirely appropriate) detail how to handle variation in costs across sites, noting variable factors such as population density, prices, exchange rates, discount rates, inflation, and bulk discounts. But it gives short shrift to cross-site variation in the size of ATEs, which also play a key part in the calculations of cost effectiveness. The manual briefly notes that diminishing returns (or the last-mile problem) might be important in theory but argues that the baseline levels of outcomes are likely to be similar in the pilot and replication areas, so that the ATE can be safely assumed to apply as is. All of this lacks a justification for extrapolating results, some understanding of when results can be extrapolated, when they cannot, or better still, how they should be modified to make them applicable in a new setting. Without well substantiated assumptions to support the projection of results, this is just induction by simple enumeration—swan 1 is white, swan 2 is white, …, so all swans are white; and, as Francis Bacon (1859 , 1.105) taught, “…the induction that proceeds by simple enumerations is childish.”

Bertrand Russell’s chicken ( Russell (1912) ) provides an excellent example of the limitations to simple extrapolation from repeated successful replication. The bird infers, on repeated evidence, that when the farmer comes in the morning, he feeds her. The inference serves her well until Christmas morning, when he wrings her neck and serves her for dinner. Though this chicken did not base her inference on an RCT, had we constructed one for her, we would have obtained the same result that she did. Her problem was not her methodology, but rather that she did not understand the social and economic structure that gave rise to the causal relations that she observed. (We shall return to the importance of the underlying structure for understanding what causal pathways are likely and what are unlikely below.)

The problems with simple extrapolation and simple generalization extend beyond RCTs, to both ‘fully controlled’ laboratory experiments and to most non-experimental findings. Our argument here is that evidence from RCTs is not automatically simply generalizable, and that its superior internal validity, if and when it exists, does not provide it with any unique invariance across context. That simple extrapolation and simple generalization are far from automatic also tells us why (even ideal) RCTs of similar interventions give different answers in different settings and the results of large RCTs may differ from the results of meta-analyses on the same treatment (as in LeLorier et al. (1997) ). Such differences do not necessarily reflect methodological failings and will hold across perfectly executed RCTs just as they do across observational studies.

Our arguments are not meant to suggest that extrapolation or even generalization is never reasonable. For instance, conditional cash transfers have worked for a variety of different outcomes in different places; they are often cited as a leading example of how an evaluation with strong internal validity leads to a rapid spread of the policy. Think through the causal chain that is required for CCTs to be successful: People must like money, they must like (or do not object too much) to their children being educated and vaccinated, there must exist schools and clinics that are close enough and well enough staffed to do their job, and the government or agency that is running the scheme must care about the wellbeing of families and their children. That such conditions hold in a wide range of (although certainly not all) countries makes it unsurprising that CCTs ‘work’ in many replications, though they certainly will not work in places where the schools and clinics do not exist, e.g. Levy (2006) , nor in places where people strongly oppose education or vaccination. So, there are structural reasons why CCT results export where they do. Our objection is to the assumption that it is ‘natural’ that well-established results export; to the contrary, good reasons are needed to justify that they do.

To summarize. Establishing causality does nothing in and of itself to guarantee that the causal relation will hold in some new case, let alone in general. Nor does the ability of an ideal RCT to eliminate bias from selection or from omitted variables mean that the resulting ATE from the trial sample will apply anywhere else. The issue is worth mentioning only because of the enormous weight that is currently attached to policing the rigor with which causal claims are established by contrast with the rigor devoted to all those further claims—often unstated even—that go into warranting extrapolating or generalizing the relations.

2.3 Support factors and the ATE

The operation of a cause generally requires the presence of support factors (also known as ‘interactive variables’ or ‘moderators’), factors without which a cause that produces the targeted effect in one place, even though it may be present and have the capacity to operate elsewhere, will remain latent and inoperative. What Mackie (1974) called INUS causality (Insufficient but Non-redundant parts of a condition that is itself Unnecessary but Sufficient for a contribution to the outcome) is the kind of causality reflected in equation (1) . (See Rothman (1976 , 2012) for the same idea in epidemiology, which uses the term ‘causal pie’ to refer to a set of causes that are jointly but not separately sufficient for a contribution to an effect.) A standard example is a house burning down because the television was left on, although televisions do not operate in this way without support factors, such as wiring faults, the presence of tinder, and so on.

The value of the ATE depends on the distribution of the values of the ‘support factors’ necessary for T to contribute to Y . This becomes clear if we rewrite (1) in the form

where the function θ (․) controls how a k -vector w i of k ‘support factors’ affect individual i ’s treatment effect β i . The support factors may include some of the x ’s. Since the ATE is the average of the β i s , two populations will have the same ATE if and only if they have the same average for the net effect of the support factors necessary for the treatment to work, i.e. for the quantity in front of T i . These are however just the kind of factors that are likely to be differently distributed in different populations.

Given that support factors will operate with different strengths and effectiveness in different places, it is not surprising that the size of the ATE differs from place to place; for example, Vivalt’s AidGrade website lists 29 estimates from a range of countries of the standardized (divided by local standard deviation of the outcome) effects of CCTs on school attendance; all but four show the expected positive effect, and the range runs from −8 to +38 percentage points ( Vivalt (2016) ). Even in this leading case, where we might reasonably conclude that CCTs ‘work’ in getting children into school, it would be hard to calculate credible cost-effectiveness numbers or to come to a general conclusion about whether CCTs are more or less cost effective than other possible policies. Both costs and effect sizes can be expected to differ in new settings, just as they have in observed ones, making these predictions difficult.

AidGrade uses standardized measures of effect size divided by standard deviation of outcome at baseline, as does the major multi-country study by Banerjee et al. (2015) . But we might prefer measures that have an economic interpretation, such as J-PAL’s ‘additional months of schooling per US$100 spent’ (for example if a donor is trying to decide where to spend, as we noted). Nutrition might be measured by height, or by the log of height. Even if the ATE by one measure carries across, it will only do so using another measure if the relationship between the two measures is the same in both situations. This is exactly the sort of thing that a formal analysis of what reasons justify simple extrapolation and how to adjust predictions when simple extrapolation is not justified forces us to think about. (Note also that the ATE in the original RCT can differ depending on whether the outcome is measured in levels or in logs; it is easy to construct examples where the two ATEs have different signs.)

The worry is not just that the distribution of values for the support factors in a new setting will differ from the distribution in the trial but that what those support factors are will differ, or indeed whether there are any at all in the new setting that can get the treatment to work there. Causal processes often require highly specialized economic, cultural, or social structures to enable them to work. Different structures will enable different processes with different causes and different support factors. Consider the Rube Goldberg machine that is rigged up so that flying a kite sharpens a pencil ( Cartwright and Hardie (2012 , 77)). The underlying structure affords a very specific form of (4) that will not describe causal processes elsewhere. The Rube Goldberg machine is an exaggerated example, but it makes transparent how unreliable simple extrapolation is likely to be when little knowledge of causal structure is available.

For more typical examples, consider systems design, where we aim to construct systems that will generate causal relations that we like and that will rule out causal relations that we do not like. Healthcare systems are designed to prevent nurses and doctors making errors; cars are designed so that drivers cannot start them in reverse; work schedules for pilots are designed so they do not fly too many consecutive hours without rest because alertness and performance are compromised. In philosophy, a system of interacting parts that underpins causal processes and makes some possible and some impossible, some likely and some unlikely is labelled a mechanism . (Note that this is only one of many meanings in philosophy and elsewhere for the term ‘mechanism’; in particular it is not ‘mechanism’ in the sense of the causal pathway from treatment to outcomes, which is another common use, for example in Suzuki et al. (2011) ). Mechanisms are particularly important in understanding the explanation of causal processes in biology and the philosophical literature is rife with biological examples, as in the account in the seminal Machamer et al. (2000) of how Shepherd (1988) uses biochemical mechanisms at chemical synapses to explain the process of transmitting electrical signals from one neuron to another. (See also Bechtel (2006) , Craver (2007) .) ‘Mechanism’ in this sense is nor restricted to physical parts and their interactions and constraints but includes social, cultural, and economic arrangements, institutions, norms, habits, and individual psychology. (See, for example, Seckinelgin (2016) on the importance of context in determining the effectiveness of HIV-AIDs therapies.)

As in the Rube Goldberg machine and in the design of cars and work schedules, the physical, social, and economic structure and equilibrium may differ in ways that support, permit, or block different kinds of causal relations and thus render a trial in one setting useless in another. For example, a trial that relies on providing incentives for personal promotion is of no use in a state in which a political system locks people into their social and economic positions. Cash transfers that are conditional on parents taking their children to clinics cannot improve child health in the absence of functioning clinics. Policies targeted at men may not work for women. We use a lever to toast our bread, but levers only operate to toast bread in a toaster; we cannot brown toast by pressing an accelerator, even if the principle of the lever is the same in both a toaster and a car. If we misunderstand the setting, if we do not understand why the treatment in our RCT works, we run the same risks as Russell’s chicken. (See Little (2007) and Howick et al. (2013) for many of the difficulties in using claims about mechanistic structure to support extrapolation, and Parkkinen et al. (2018) defending the importance of mechanistic reasoning both for internal validity and for extrapolation.)

2.4 When RCTs speak for themselves: no extrapolation or generalization required

For some things we want to learn, an RCT is enough by itself. An RCT may provide a counterexample to a general theoretical proposition, either to the proposition itself (a simple refutation test) or to some consequence of it (a complex refutation test). An RCT may also confirm a prediction of a theory, and although this does not confirm the theory, it is evidence in its favor, especially if the prediction seems inherently unlikely in advance. This is all familiar territory, and there is nothing unique about an RCT; it is simply one among many possible testing procedures. Even when there is no theory, or very weak theory, an RCT, by demonstrating causality in some population can be thought of as proof of concept , that the treatment is capable of working somewhere (as in the remark from Curtis Meinert, prominent expert on clinical trial methodology: “There is no point in worrying whether a treatment works the same or differently in men and women until it has been shown to work in someone” (quoted in Epstein (2007 , 108))). This is one of the arguments for the importance of internal validity.

Nor is extrapolation called for when an RCT is used for evaluation, for example to satisfy donors that the project they funded achieved its aims in the population in which it was conducted. Even so, for such evaluations, say by the World Bank, to be useful to the world at large (to be global public goods) requires arguments and guidelines that justify using the results in some way elsewhere; the global public good is not an automatic by-product of the Bank fulfilling its fiduciary responsibility. We need something, some regularity or invariance, and that something can rarely be recovered by simply generalizing across trials.

A third non-problematic and important use of an RCT is when the parameter of interest is the ATE in a well-defined population from which the trial sample is itself a random sample. In this case the sample average treatment effect (SATE) is an unbiased estimator of the population average treatment effect (PATE) that, by assumption, is our target (see Imbens (2004) for these terms). We refer to this as the ‘public health’ case; like many public health interventions, the target is the average, ‘population health,’ not the health of individuals. One major (and widely recognized) danger of this use of RCTs is that exporting results from (even a random) sample to the population will not go through in any simple way if the outcomes of individuals or groups of individuals change the behavior of others—which is common in social examples and in public health whenever there is a possibility of contagion.

2.5 Reweighting and stratifying

Many advocates of RCTs understand that ‘what works’ needs to be qualified to ‘what works under which circumstances’ and try to say something about what those circumstances might be, for example, by replicating RCTs in different places and thinking intelligently about the differences in outcomes when they find them. Sometimes this is done in a systematic way, for example by having multiple treatments within the same trial so that it is possible to estimate a ‘response surface’ that links outcomes to various combinations of treatments (see Greenberg and Schroder (2004) or Shadish et al. (2002) ). For example, the RAND health experiment had multiple treatments, allowing investigation of how much health insurance increased expenditures under different circumstances. Some of the negative income tax experiments (NITs) in the 1960s and 1970s were designed to estimate response surfaces, with the number of treatments and controls in each arm optimized to maximize precision of estimated response functions subject to an overall cost limit (see Conlisk (1973) ). Experiments on time-of-day pricing for electricity had a similar structure (see Aigner (1985) ).

The experiments by MDRC have also been analyzed across cities in an effort to link city features to the results of the RCTs within them (see Bloom et al. (2005) ). Unlike the RAND and NIT examples, these are ex post analyses of completed trials; the same is true of Vivalt (2015), who finds, for the collection of trials she studied, that development-related RCTs run by government agencies typically find smaller (standardized) effect sizes than RCTs run by academics or by NGOs. Bold et al. (2013) , who ran parallel RCTs on an intervention implemented either by an NGO or by the government of Kenya, found similar results there. Note that these analyses have a different purpose from meta-analyses that assume that different trials estimate the same parameter up to noise and average in order to increase precision.

Statistical approaches are also widely used to adjust the results from a trial population to predict those in a target population; these are designed to deal with the fact that treatment effects vary systematically with variations in the support factors. One procedure to deal with this is post-experimental stratification , which parallels post-survey stratification in sample surveys. The trial is broken up into sub-groups that have the same combination of known, observable w ’s (age, race, gender, co-morbidities for example), then the ATEs within each of the subgroups are calculated, and then they are reassembled according to the configuration of w ’s in the new context. This can be used to estimate the ATE in a new context, or to correct estimates to the parent population when the trial sample is not a random sample of the parent. Other methods can be used when there are too many w ’s for stratification, for example by estimating the probability of each observation in the population included in the trial sample as a function of the w ’s, then weighting each observation by the inverse of these propensity scores. A good reference for these methods is Stuart et al. (2011) , or in economics, Angrist (2004) and Hotz et al. (2005) .)

These methods are often not applicable, however. First, reweighting works only when the observable factors used for reweighting include all (and only) genuine interactive causes (support/moderator factors). Second, as with any form of reweighting, the variables used to construct the weights must be present in both the original and new context. For example, if we are to carry a result forward in time, we may not be able to extrapolate from a period of low inflation to a period of high inflation; medical treatments that work in cold climates may not work in the tropics. As Hotz et al. (2005) note, it will typically be necessary to rule out such ‘macro’ effects, whether over time, or over locations. Third, reweighting also depends on the assumption that the same governing equation (4) covers both the trial and the target population.

Pearl and Bareinboim (2011 , 2014) and Bareinboim and Pearl (2013 , 2014) provide strategies for inferring information about new populations from trial results that are more general than reweighting. They suppose we have available both causal information and probabilistic information for population A (e.g. the experimental one), while for population B (the target) we have only (some) probabilistic information, and also that we know that certain probabilistic and causal facts are shared between the two and certain ones are not. They offer theorems describing what causal conclusions about population B are thereby fixed. Their work underlines the fact that exactly what conclusions about one population can be supported by information about another depends on exactly what causal and probabilistic facts they have in common. But as Muller (2015) notes, this, like the problem with simple reweighting, takes us back to the situation that RCTs are designed to avoid, where we need to start from a complete and correct specification of the causal structure. RCTs can avoid this in estimation—which is one of their strengths, supporting their credibility—but the benefit vanishes as soon as we try to carry their results to a new context.

This discussion leads to a number of points. First it underlines our previous arguments that we cannot get to general claims by simple generalization; there is no warrant for the convenient assumption that the ATE estimated in a specific RCT is an invariant parameter, nor that the kinds of interventions and outcomes we measure in typical RCTs participate in general causal relations.

Second, thoughtful pre-experimental stratification in RCTs is likely to be valuable, or failing that, subgroup analysis, because it can provide information that may be useful for generalization or extrapolation. For example, Kremer and Holla (2009) note that, in their trials, school attendance is surprisingly sensitive to small subsidies, which they suggest is because there are a large number of students and parents who are on the (financial) margin between attending and not attending school; if this is indeed the mechanism for their results, a good variable for stratification would be distance from the relevant cutoff. We also need to know that this same mechanism works in any new target setting, as discussed at the end of Section 2.3.

Third, we need to be explicit about causal structure, even if that means more model building and more—or different—assumptions than advocates of RCTs are often comfortable with. We need something, some regularity or invariance, and that something can rarely be recovered by simply generalizing across trials. To be clear, modeling causal structure does not commit us to the elaborate and often incredible assumptions that characterize some structural modeling in economics, but there is no escape from thinking about the way things work; the why as well as the what.

Fourth, to use these techniques for reweighting and stratifying, we will need to know more than the results of the RCT itself, for example about differences in social, economic, and cultural structures and about the joint distributions of causal variables, knowledge that will often only be available through observational studies. We will also need external information, both theoretical and empirical, to settle on an informative characterization of the population enrolled in the RCT because how that population is described is commonly taken to be some indication of which other populations would yield similar results.

Many medical and psychological journals are explicit about this. For instance, the rules for submission recommended by the International Committee of Medical Journal Editors, ICMJE (2015 , 14) insist that article abstracts “Clearly describe the selection of observational or experimental participants (healthy individuals or patients, including controls), including eligibility and exclusion criteria and a description of the source population.” An RCT is conducted on a specific trial sample, somehow drawn from a population of specific individuals. The results obtained are features of that sample, of those very individuals at that very time, not any other population with any different individuals that might, for example, satisfy one of the infinite set of descriptions that the trial sample satisfies. If following the ICMJE advice is to produce warrantable extrapolation—simple or adjusted—from a trial population to some other, the descriptors for the trial population must be correctly chosen. As we have argued, they must pick out populations where the same form of equation (4) holds and that have approximately the same mean (or one that we know how to adjust) for the net effect of the support factors in the two populations.

This same issue is confronted already in study design. Apart from special cases, like post hoc evaluation for payment-for-results, we are not especially concerned to learn about the very individuals enrolled in the trial. Most experiments are, and should be, conducted with an eye to what the results can help us learn about other populations. This cannot be done without substantial assumptions about what might and what might not be relevant to the production of the outcome studied. So both intelligent study design and responsible reporting of study results involve substantial background assumptions.

Of course, this is true for all studies. But RCTs require special conditions if they are to be conducted at all and especially if they are to be conducted successfully—for example, local agreements, compliant subjects, affordable administrators, multiple blinding, people competent to measure and record outcomes reliably, a setting where random allocation is morally and politically acceptable, etc.—whereas observational data are often more readily and widely available. In the case of RCTs, there is danger that these kinds of considerations have too much effect. This is especially worrisome where the features that the trial sample should have are not justified, made explicit, or subjected to serious critical review.

The need for observational knowledge is one of many reasons why it is counter-productive to insist that RCTs are the gold standard or that some categories of evidence should be prioritized over others; these strategies leave us helpless in using RCTs beyond their original context. The results of RCTs must be integrated with other knowledge, including the practical wisdom of policymakers, if they are to be useable outside the context in which they were constructed.

Contrary to much practice in medicine as well as in economics, conflicts between RCTs and observational results need to be explained, for example by reference to the different characteristics of the different populations studied in each, a process that will sometimes yield important evidence, including on the range of applicability of the RCT results themselves. While the validity of the RCT will sometimes provide an understanding of why the observational study found a different answer, there is no basis (or excuse) for the common practice of dismissing the observational study simply because it was not an RCT and therefore must be invalid. It is a basic tenet of scientific advance that, as collective knowledge advances, new findings must be able to explain and be integrated with previous results, even results that are now thought to be invalid; methodological prejudice is not an explanation.

2.6 Using RCTs to build and test theory

RCT results, as with any well-established scientific claims, can be used in the familiar hypothetico-deductive way to test theory.

For example, one of the largest and most technically impressive of the development RCTs is by Banerjee et al. (2015) , which tests a ‘graduation’ program designed to permanently lift extremely poor people from poverty by providing them with a gift of a productive asset (from guinea-pigs, (regular-) pigs, sheep, goats, or chickens depending on locale), training and support, and life-skills coaching, as well as support for consumption, saving, and health services. The idea is that this package of aid can help people break out of poverty traps in a way that would not be possible with one intervention at a time. Comparable versions of the program were tested in Ethiopia, Ghana, Honduras, India, Pakistan, and Peru and, excepting Honduras (where the chickens died) find largely positive and persistent effects—with similar (standardized) effect sizes—for a range of outcomes (economic, mental and physical health, and female empowerment). One site apart, essentially everyone accepted their assignment. Replication of positive ATEs over such a wide range of places certainly provides proof of concept for such a scheme. Yet Bauchet et al. (2015) fail to replicate the result in South India, where the control group got access to much the same benefits. ( Heckman, et al. (2000) call this ‘substitution’ bias). Even so, the results are important because, although there is a longstanding interest in poverty traps, many economists have been skeptical of their existence or that they could be sprung by such aid-based policies. In this sense, the study is an important contribution to the theory of economic development; it tests a theoretical proposition and will (or should) change minds about it.

Economists have been combining theory and randomized controlled trials in a variety of other ways since the early experiments. The trials help build and test theory and theory in turn can answer questions about new settings and populations that we cannot answer by simple extrapolation or generalization of the trial results. We will outline a few economics examples to give a sense of how the interweaving of theory and results can work.

Orcutt and Orcutt (1968) laid out the inspiration for the income tax trials using a simple static theory of labor supply. According to this, people choose how to divide their time between work and leisure in an environment in which they receive a minimum G if they do not work, and where they receive an additional amount (1− t ) w for each hour they work, where w is the wage rate, and t is a tax rate. The trials assigned different combinations of G and t to different trial groups, so that the results traced out the labor supply function, allowing estimation of the parameters of preferences, which could then be used in a wide range of policy calculations, for example to raise revenue at minimum utility loss to workers.

Following these early trials, there has been a continuing tradition of using trial results, together with the baseline data collected for the trial, to fit structural models that are to be used more generally. (Early examples include Moffitt (1979) on labor supply and Wise (1985) on housing; a more recent example is Heckman et al. (2013) for the Perry pre-school program. Development economics examples include Attanasio et al. (2012) , Attanasio et al. (2015) , Todd and Wolpin (2006) , Wolpin (2013) , and Duflo et al. (2012) .) These structural models sometimes require formidable auxiliary assumptions on functional forms or the distributions of unobservables, but they have compensating advantages, including the ability to integrate theory and evidence, to make out-of-sample predictions, and to analyze welfare, and the use of RCT evidence allows the relaxation of at least some of the assumptions that are needed for identification. In this way, the structural models borrow credibility from the RCTs and in return help set the RCT results within a coherent framework. Without some such interpretation, the welfare implications of RCT results can be problematic; knowing how people in general (let alone just people in the trial population) respond to some policy is rarely enough to tell whether or not they are made better off, Harrison (2014a , b) . Traditional welfare economics draws a link from preferences to behavior, a link that is respected in structural work but often lost in the ‘what works’ literature, and without which we have no basis for inferring welfare from behavior. What works is not equivalent to what should be.

Even simple theory can do much to interpret, to extend, and to use RCT results. In both the RAND Health Experiment and negative income tax experiments, an immediate issue concerned the difference between short and long-run responses; indeed, differences between immediate and ultimate effects occur in a wide range of RCTs. Both health and tax RCTs aimed to discover what would happen if consumers/workers were permanently faced with higher or lower prices/wages, but the trials could only run for a limited period. A temporarily high tax rate on earnings is effectively a ‘fire sale’ on leisure, so that the experiment provided an opportunity to take a vacation and make up the earnings later, an incentive that would be absent in a permanent scheme. How do we get from the short-run responses that come from the trial to the long-run responses that we want to know? Metcalf (1973) and Ashenfelter (1978) provided answers for the income tax experiments, as did Arrow (1975) for the Rand Health Experiment.

Arrow’s analysis illustrates how to use both structure and observational data in combination with results from one setting to predict results in another. He models the health experiment as a two-period model in which the price of medical care is lowered in the first period only, and shows how to derive what we want, which is the response in the first period if prices were lowered by the same proportion in both periods. The magnitude that we want is S , the compensated price derivative of medical care in period 1 in the face of identical increases in p 1 and p 2 in both periods 1 and 2. This is equal to s 11 + s 12 , the sum of the derivatives of period 1’s demand with respect to the two prices. The trial gives only s 11 . But if we have post-trial data on medical services for both treatments and controls, we can infer s 21 , the effect of the experimental price manipulation on post-experimental care. Choice theory, in the form of Slutsky symmetry says that s 12 = s 21 and so allows Arrow to infer s 12 and thus S. He contrasts this with Metcalf’s alternative solution, which makes different assumptions—that two period preferences are intertemporally additive, in which case the long-run elasticity can be obtained from knowledge of the income elasticity of post-experimental medical care, which would have to come from an observational analysis.

These two alternative approaches show how we can choose, based on our willingness to make assumptions and on the data that we have, a suitable combination of (elementary and transparent) theoretical assumptions and observational data in order to adapt and use trial results. Such analysis can also help design the original trial by clarifying what we need to know in order to use the results of a temporary treatment to estimate the permanent effects that we need. Ashenfelter provides a third solution, noting that the two- period model is formally identical to a two- person model, so that we can use information on two-person labor supply to tell us about the dynamics. In the Rand case, internal evidence suggests that short-run and long-run responses were not in fact very different, but Arrow’s analysis provides an illustration of how theory can form a bridge from what we get to what we want.

Theory can often allow us to reclassify new or unknown situations as analogous to situations where we already have background knowledge. In economics, one frequently useful way of doing this is when the new policy can be recast as equivalent to a change in the prices and incomes faced by respondents. The consequences of a new policy may be easier to predict if we can reduce it to equivalent changes in income and prices, whose effects are often well understood and well-studied. Todd and Wolpin (2008) and Wolpin (2013) make this point and provide examples. In the labor supply case, an increase in the tax rate has the same effect as a decrease in the wage rate, so that we can rely on previous literature to predict what will happen when tax rates are changed. In the case of Mexico’s PROGRESA conditional cash transfer program, Todd and Wolpin note that the subsidies paid to parents if their children go to school can be thought of as a combination of reduction in children’s wages and an increase in parents’ income, which allows them to predict the results of the conditional cash experiment with limited additional assumptions. If this works, as it partially does in their analysis, the trial helps consolidate previous knowledge and contributes to an evolving body of theory and empirical, including trial, evidence.

The program of thinking about policy changes as equivalent to price and income changes has a long history in economics; much of rational choice theory can be so interpreted (see Deaton and Muellbauer (1980) for many examples). When this conversion is credible, and when a trial on some apparently unrelated topic can be modeled as equivalent to a change in prices and incomes, and when we can assume that people in different settings respond similarly to changes in prices and incomes, we have a readymade framework for incorporating the trial results into previous knowledge, as well as for extending the trial results and using them elsewhere. Of course, all depends on the validity and credibility of the theory; people may not in fact treat a tax increase as a decrease in the price of leisure, and behavioral economics is full of examples where apparently equivalent stimuli generate non-equivalent outcomes. The embrace of behavioral economics by many of the current generation of researchers may account for their limited willingness to use conventional choice theory in this way. Unfortunately, behavioral economics does not yet offer a replacement for the general framework of choice theory that is so useful in this regard.

Theory can also help with the problems we raised in the summary of Section 1., that people who are randomized into the treatment group may refuse treatment. When theory is good enough to indicate how to represent the gain and losses that trial participants are likely to base compliance on, then analysis can sometimes help us adjust the trial estimates back to what we would like to know.

2.6 Scaling up: using the average for populations

Many RCTs are small-scale and local, for example in a few schools, clinics, or farms in a particular geographic, cultural, socio-economic setting. If successful according to a cost-effectiveness criterion, for example, it is a candidate for scaling-up, applying the same intervention for a much larger area, often a whole country, or sometimes even beyond, as when some treatment is considered for all relevant World Bank projects. Predicting the same results at scale as in the trial is a case of simple extrapolation. We discuss it separately, however, because it can raise special problems. The fact that the intervention might work differently at scale has long been noted in the economics literature, e.g. Garfinkel and Manski (1992) , Heckman (1992) , and Moffitt (1992) , and is recognized in the recent review by Banerjee and Duflo (2009) .

In medicine, where biological interactions between people are less common than are social interactions in social science, they can still be important. Infectious diseases are a well-known example, where immunization programs affect the dynamics of disease transmission through herd immunity (see Fine and Clarkson (1986) and Manski (2013 , 52)). The social and economic setting also affects how drugs are actually used and the same issues can arise; the distinction between efficacy and effectiveness in clinical trials is in part recognition of the fact. We want here to emphasize the pervasiveness of such effects as well as to note again that this should not be taken as an argument against using RCTs but only against the idea that effects at scale are likely to be the same as in the trial.

An example of what are often called ‘general equilibrium effects’ comes from agriculture. Suppose an RCT demonstrates that in the study population a new way of using fertilizer had a substantial positive effect on, say, cocoa yields, so that farmers who used the new methods saw increases in production and in incomes compared to those in the control group. If the procedure is scaled up to the whole country, or to all cocoa farmers worldwide, the price will drop, and if the demand for cocoa is price inelastic—as is usually thought to be the case, at least in the short run—cocoa farmers’ incomes will fall. Indeed, the conventional wisdom for many crops is that farmers do best when the harvest is small, not large. In this case, the scaled-up effect is opposite in sign to the trial effect. The problem is not with the trial results, which can be usefully incorporated into a more comprehensive market model that incorporates the responses estimated by the trial. The problem is only if we assume that the aggregate looks like the individual. That other ingredients of the aggregate model must come from observational studies should not be a criticism, even for those who favor RCTs; it is simply the price of doing serious analysis.

There are many possible interventions that alter supply or demand whose effect, in aggregate, will change a price or a wage that is held constant in the original RCT. Indeed, any trial that changes the quantities that people demand or supply—including labor supply—must, as a matter of logic, affect other people because the new demand has to be met, or the new supply accommodated. In the language of the Rubin causal model, this is a failure of SUTVA, the stable unit treatment value assumption. Of course, each unit may be too small to have any perceptible effect by itself, so SUTVA holds to a high degree of approximation in the trial, but once we aggregate to the population, the effects will often be large enough to modify or reverse the result from the trial. Examples include that education will change the supplies of skilled versus unskilled labor, with implications for relative wage rates. Conditional cash transfers increase the demand for (and perhaps supply of) schools and clinics, which will change prices or waiting lines, or both. There are interactions between people that will operate only at scale. Giving one child a voucher to go to private school might improve her future, but doing so for everyone can decrease the quality of education for those children who are left in the public schools (see the contrasting studies of Angrist et al. (2002) and Hsieh and Urquiola (2006) ). Educational or training programs may benefit those who are treated but harm those left behind; Crépon et al. (2014) recognize the issue and show how to adapt an RCT to deal with it.

Much of economics is concerned with analyzing equilibria, most obviously in the equilibrium of supply and demand. Multiple causal mechanisms are reconciled by the adjustment of some variable, such as a price. RCTs will often be useful in analyzing one or other mechanism, in which the equilibrating variable is held constant, and the results of those RCTs can be used to analyze and predict the equilibrium effects of policies. But the results of implementing policies will often look very different from the trial results, as in the cocoa example above. If, as is often argued, economics is about the analysis of equilibrium, simple extrapolation of the results of an RCT will rarely be useful. Note that we are making no claim about the success of economic models, either in analysis or prediction. But the analysis of equilibrium is a matter of logical consistency without which we are left with contradictory propositions.

2.7 Drilling down: using the average for individuals

Just as there are issues with scaling-up, it is not obvious how to use the results from RCTs at the level of individual units, even individual units that were included in the trial. A well-conducted RCT delivers an ATE for the trial population but, in general, that average does not apply to everyone. It is not true, for example, as argued in the American Medical Association’s Users’ guide to the medical literature that “if the patient would have been enrolled in the study had she been there—that is she meets all of the inclusion criteria and doesn’t violate any of the exclusion criteria—there is little question that the results are applicable” (see Guyatt et al. (1994 , 60)). Even more misleading are the often-heard statements that an RCT with an average treatment effect insignificantly different from zero has shown that the treatment works for no one.

These issues are familiar to physicians practicing evidence-based medicine whose guidelines require “integrating individual clinical expertise with the best available external clinical evidence from systematic research” Sackett et al. (1996 , 71)). Exactly what this means is unclear; physicians know much more about their patients than is allowed for in the ATE from the RCT (though, once again, stratification in the trial is likely to be helpful) and they often have intuitive expertise from long practice that can help them identify features in a particular patient that may influence the effectiveness of a given treatment for that patient (see Horwitz (1996) ). But there is an odd balance struck here. These judgments are deemed admissible in discussion with the individual patient, but they don’t add up to evidence to be made publicly available, with the usual cautions about credibility, by the standards adopted by most EBM sites. It is also true that physicians can have prejudices and ‘knowledge’ that might be anything but. Clearly, there are situations where forcing practitioners to follow the average will do better, even for individual patients, and others where the opposite is true (see Kahneman and Klein (2009) ). Horwitz et al. (2017) propose that medical practice should move from evidence-based medicine to what they call medicine-based evidence in which all individual case histories are assembled and matched to provide a basis for deviation from the means of RCTs.

Whether or not averages are useful to individuals raises the same issue throughout social science research. Imagine two schools, St Joseph’s and St. Mary’s, both of which were included in an RCT of a classroom innovation. The innovation is successful on average, but should the schools adopt it? Should St Mary’s be influenced by a previous attempt in St Joseph’s that was judged a failure? Many would dismiss this experience as anecdotal and ask how St Joseph’s could have known that it was a failure without benefit of ‘rigorous’ evidence. Yet if St Mary’s is like St Joseph’s, with a similar mix of pupils, a similar curriculum, and similar academic standing, might not St Joseph’s experience be more relevant to what might happen at St Mary’s than is the positive average from the RCT? And might it not be a good idea for the teachers and governors of St Mary’s to go to St Joseph’s and find out what happened and why? They may be able to observe the mechanism of the failure, if such it was, and figure out whether the same problems would apply for them, or whether they might be able to adapt the innovation to make it work for them, perhaps even more successfully than the positive average in the trial.

Once again, these questions are unlikely to be easily answered in practice; but, as with exportability, there is no serious alternative to trying. Assuming that the average works for you will often be wrong, and it will at least sometimes be possible to do better; for instance, by judicious use of theory, reasoning by analogy, process tracing, identification of mechanisms, sub-group analysis, or recognizing various symptoms that a causal pathway is possible, as in Bradford- Hill (1965) (see also Cartwright (2015) , Reiss (2017) , and Humphreys and Jacobs (2017) . As in the medical case, the advice to individual schools often lacks specificity. For example, the U.S. Institute of Education Sciences has provided a “user-friendly” guide to practices supported by rigorous evidence ( U.S. Department of Education (2003) ). The advice, which is similar to recommendations throughout evidence-based social and health policy literature, is that the intervention be demonstrated effective through well-designed RCTs in more than one site and that “the trials should demonstrate the intervention’s effectiveness in school settings similar to yours” (2003, 17). No operational definition of “similar” is provided.

Conclusions

It is useful to respond to two challenges that are often put to us, one from medicine and one from social science. The medical challenge is, “If you are being prescribed a new drug, wouldn’t you want it to have been through an RCT?” The second (related) challenge is, “OK, you have highlighted some of the problems with RCTs, but other methods have all of those problems, plus problems of their own.” We believe that we have answered both of these in the paper but that it is helpful to recapitulate.

The medical challenge is about you , a specific person, so that one answer would be that you may be different from the average, and you are entitled to and ought to ask about theory and evidence about whether it will work for you . This would be in the form of a conversation between you and your physician, who knows a lot about you . You would want to know how this class of drug is supposed to work and whether that mechanism is likely to work for you . Is there any evidence from other patients, especially patients like you , with your condition and in your circumstances, or are there suggestions from theory? What scientific work has been done to identify what support factors matter for success with this kind of drug? If the only information available is from the pharmaceutical company whose priors and financial interests might have somehow influenced the results, an RCT might seem like a good idea. But even then, and although knowledge of the mean effect among some group is certainly of value, you might give little weight to an RCT whose participants are selected in the way they were selected in the trial, or where there is little information about whether the outcomes are relevant to you . Recall that many new drugs are prescribed ‘off-label’, for a purpose for which they were not tested, and beyond that, that many new drugs are administered in the absence of an RCT because you are actually being enrolled in one. For patients whose last chance is to participate in a trial of some new drug, this is exactly the sort of conversation you should have with your physician (followed by one asking her to reveal whether you are in the active arm, so that you can switch if not), and such conversations need to take place for all prescriptions that are new to you. In these conversations, the results of an RCT may have marginal value. If your physician tells you that she endorses evidence-based medicine, and that the drug will work for you because an RCT has shown that ‘it works’, it is time to find a physician who knows that you and the average are not the same.

The second challenge claims that other methods are always dominated by an RCT. That, as one of our referees challenged us, echoing Churchill, “that RCTs are horrible, except when compared to the alternatives.” We believe that this challenge is not well-formulated. Dominated for answering what question, for what purposes? The chief advantage of the RCT is that it can, if well-conducted, give an unbiased estimate of an ATE in a study (trial) sample and thus provide evidence that the treatment caused the outcome in some individuals in that sample. Note that ‘well-conducted”’ rules out all of the things that almost always occur in practice, including attrition, intentional lack of blinding or unintentional unblinding, and other post-randomization confounding and selection biases (see Hernán et al. (2013) ). If an unbiased estimate of the ATE is what you want and there’s little background knowledge available and the price is right, then an RCT may be the best choice. As to other questions, the RCT result can be part—but usually only a small part—of the defense of ( a) a general claim, ( b ) a claim that the treatment will cause that outcome for some other individuals, ( c ) a claim about what the ATE will be in some other population, or even ( d ) a claim about something very different that the RCT results tests. But they do little for these enterprises on their own. What is the best overall package of research work for tackling these questions—most cost-effective and most likely to produce correct results—depends on what we know and what different kinds of research will cost.

There are examples where an RCT does better than an observational study, and these seem to be the cases that come to mind for defenders of RCTs. For example, regressions of whether people who get Medicaid do better or worse than people with private insurance are vitiated by gross differences in the other characteristics of the two populations. But it is a long step from that to saying that an RCT can solve the problem, let alone that it is the only way to solve the problem. It will not only be expensive per subject, but it can only enroll a selected and almost certainly unrepresentative study sample, it can be run only temporarily, and the recruitment to the experiment will necessarily be different from recruitment in a scheme that is permanent and open to the full qualified population. The subjects in the trial are likely to find out whether or not they are in the treatment arm, either because the treatment itself prevents blinding, or because side-effects or differences in protocol reveal their status; subjects may differentially leave the trial given this information. None of this removes the blemishes of the observational study, but there are many methods of mitigating its difficulties, so that, in the end, an observational study with credible corrections and a more relevant and much larger study sample—today often the complete population of interest through administrative records, where blinding and selection issues are absent—may provide a better estimate.

The medical community seems slow and reluctant to embrace other reliable methods of causal inference. The Academy of Medical Sciences (2017 , 4) in its review of sources of evidence on the efficacy and effectiveness of medicine agrees with us that “The type of evidence, and the methods needed to analyse that evidence, will depend on the research question being asked.” Still, it does not mention methods widely used in social and economic sciences such as instrumental variables, econometric modelling, deduction from theory, causal Bayesian nets, process tracing, or qualitative comparative analysis. Each of these has its strengths and weaknesses, each allows causal inference though not all allow an estimate of effect size, and each—as with every method—requires casual background knowledge as input in order to draw causal conclusions. But in the face of widespread unbinding and the increasing cost of RCTs, it is wasteful not to make use of these. Everything has to be judged on a case -by-case basis. There is no valid argument for a lexicographic preference for RCTs.

There is also an important line of enquiry that goes, not only beyond RCTs, but beyond the ‘method of differences’ that is common to RCTs, regressions, or any form of controlled or uncontrolled comparison. The hypothetico-deductive method confronts theory-based deductions with the data—either observational or experimental. As noted above, economists routinely use theory to tease out a new implication that can be taken to the data, and there are also good examples in medicine. One is Bleyer and Welch (2012) ’s demonstration of the limited effectiveness of mammography screening; the data do not show the compensating changes in early and late stage breast-cancer incidence that would accompany the large-scale introduction of successful screening. This is a topic where RCTs have been indecisive and controversial, if only because they are 20–30 years old and therefore outdated relative to the current rapidly-changing environment (see Marmot et al. (2013) ). Such uses of the hypothetico-deductive method are different from what seems to be usually meant by an ‘observational study,’ in which groups are compared with questionable controls for confounders, and where randomization, in spite of its inadequacies, is arguably better.

RCTs are the ultimate in non-parametric estimation of average treatment effects in trial samples because they make so few assumptions about heterogeneity, causal structure, choice of variables, and functional form. RCTs are often convenient ways to introduce experimenter-controlled variance—if you want to see what happens, then kick it and see, twist the lion’s tail—but note that many experiments, including many of the most important (and Nobel Prize winning) experiments in economics, do not and did not use randomization (see Harrison (2013) , Svorencik (2015) ). But the credibility of the results, even internally, can be undermined by unbalanced covariates and by excessive heterogeneity in responses, especially when the distribution of effects is asymmetric, where inference on means can be hazardous. Ironically, the price of the credibility in RCTs is that we can only recover the mean of the distribution of treatment effects, and that only for the trial sample. Yet, in the presence of outliers in treatment effects or in covariates, reliable inference on means is difficult. And randomization in and of itself does nothing unless the details are right; purposive selection into the experimental population, like purposive selection into and out of assignment, undermines inference in just the same way as does selection in observational studies. Lack of blinding, whether of participants, investigators, data collectors, or analysts, undermines inference, akin to a failure of exclusion restrictions in instrumental variable analysis.

The lack of structure can be seriously disabling when we try to use RCT results outside of a few contexts, such as program evaluation, hypothesis testing, or establishing proof of concept. Beyond that, the results cannot be used to help make predictions beyond the trial sample without more structure, without more prior information, and without having some idea of what makes treatment effects vary from place to place or time to time. There is no option but to commit to some causal structure if we are to know how to use RCT evidence out of the original context. Simple generalization and simple extrapolation do not cut the mustard. This is true of any study, experimental or observational. But observational studies are familiar with, and routinely work with, the sort of assumptions that RCTs claim to (but do not) avoid, so that if the aim is to use empirical evidence, any credibility advantage that RCTs have in estimation is no longer operative. And because RCTs tell us so little about why results happen, they have a disadvantage over studies that use a wider range of prior information and data to help nail down mechanisms.

Yet once that commitment has been made, RCT evidence can be extremely useful, pinning down part of a structure, helping to build stronger understanding and knowledge, and helping to assess welfare consequences. As our examples show, this can often be done without committing to the full complexity of what are often thought of as structural models. Yet without the structure that allows us to place RCT results in context, or to understand the mechanisms behind those results, not only can we not transport whether ‘it works’ elsewhere, but we cannot do one of the standard tasks of economics, which is to say whether the intervention is actually welfare improving. Without knowing why things happen and why people do things, we run the risk of worthless casual (‘fairy story’) causal theorizing and have given up on one of the central tasks of economics and other social sciences.

We must back away from the refusal to theorize, from the exultation in our ability to handle unlimited heterogeneity, and actually SAY something. Perhaps paradoxically, unless we are prepared to make assumptions, and to say what we know, making statements that will be incredible to some, the credibility of the RCT does us very little good.

Supplementary Material

Acknowledgments.

We acknowledge helpful discussions with many people over the several years this paper has been in preparation. We would particularly like to note comments from seminar participants at Princeton, Columbia, and Chicago, the CHESS research group at Durham, as well as discussions with Orley Ashenfelter, Anne Case, Nick Cowen, Hank Farber, Jim Heckman, Bo Honoré, Chuck Manski, and Julian Reiss. Ulrich Mueller had a major influence on shaping Section 1. We have benefited from generous comments on an earlier version by Christopher Adams, Tim Besley, Chris Blattman, Sylvain Chassang, Jishnu Das, Jean Drèze, William Easterly, Jonathan Fuller, Lars Hansen, Jeff Hammer, Glenn Harrison, Macartan Humphreys, Michal Kolesár, Helen Milner, Tamlyn Munslow, Suresh Naidu, Lant Pritchett, Dani Rodrik, Burt Singer, Richard Williams, Richard Zeckhauser, and Steve Ziliak. We are also grateful for editorial assistance from Donal Khosrowski, Cheryl Lancaster, and Tamlyn Munslow. Cartwright’s research for this paper has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 667526 K4U), the Spencer Foundation, and the National Science Foundation (award 1632471). Deaton acknowledges financial support from the National Institute on Aging through the National Bureau of Economic Research, Grants 5R01AG040629 and P01AG05842 and through Princeton University’s Roybal Center, Grant P30 AG024928.

Contributor Information

Angus Deaton, Princeton University, NBER, and University of Southern California.

Nancy Cartwright, Durham University and UC San Diego.

  • Abdul Latif Jameel Poverty Action Lab, MIT. [Retrieved August 21, 2017]; 2017 from: https://www.povertyactionlab.org/about-j-pal .
  • Academy of Medical Sciences. Sources of evidence for assessing the safety, efficacy, and effectiveness of medicines. 2017 Retrieved from https://acmedsci.ac.uk/file-download/86466482 .
  • Aigner DJ. The residential electricity time-of-use pricing experiments. What have we learned? In: Wise DA, Hausman JA, editors. Social experimentation. Chicago, Il: Chicago University Press for National Bureau of Economic Research; 1985. pp. 11–54. [ Google Scholar ]
  • Angrist JD. Treatment effect heterogeneity in theory and practice. Economic Journal. 2004; 114 :C52–C83. [ Google Scholar ]
  • Angrist JD, Bettinger E, Bloom E, King E, Kremer M. Vouchers for private schooling in Colombia: evidence from a randomized natural experiment. American Economic Review. 2002; 92 (5):1535–58. [ Google Scholar ]
  • Aron-Dine A, Einav L, Finkelstein A. The RAND health insurance experiment, three decades later. Journal of Economic Perspectives. 2013; 27 (1):197–222. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Arrow KJ. Document No. P-5546. Santa Monica, CA: Rand Corporation; 1975. Two notes on inferring long run behavior from social experiments. [ Google Scholar ]
  • Ashenfelter O. The labor supply response of wage earners. In: Palmer JL, Pechman JA, editors. Welfare in rural areas: the North Carolina–Iowa Income Maintenance Experiment. Washington, DC: The Brookings Institution; 1978. pp. 109–38. [ Google Scholar ]
  • Attanasio O, Meghir C, Santiago A. Education choices in Mexico: using a structural model and a randomized experiment to evaluate PROGRESA. Review of Economic Studies. 2012; 79 (1):37–66. [ Google Scholar ]
  • Attanasio O, Cattan S, Fitzsimons E, Meghir C, Rubio-Codina M. Estimating the production function for human capital: results from a randomized controlled trial in Colombia (Working Paper W15/06) London: Institute for Fiscal Studies; 2015. [ Google Scholar ]
  • Bacon F. Novum Organum. In: Ellis RL, Spedding J, editors. The Philosophical Works of Francis Bacon. London, England: Longmans; 1859. [ Google Scholar ]
  • Bahadur RR, Savage LJ. The non-existence of certain statistical procedures in nonparametric problems. Annals of Mathematical Statistics. 1956; 25 :1115–22. [ Google Scholar ]
  • Banerjee A, Chassang S, Montero S, Snowberg E. A theory of experimenters (Working Paper 23867) Cambridge, MA: National Bureau of Economic Research; 2017. [ Google Scholar ]
  • Banerjee A, Chassang S, Snowberg E. Decision theoretic approaches to experiment design and external validity (Working Paper 22167) Cambridge, MA: National Bureau of Economic Research; 2016. [ Google Scholar ]
  • Banerjee A, Duflo E. The experimental approach to development economics. Annual Review of Economics. 2009; 1 :151–78. [ Google Scholar ]
  • Banerjee A, Duflo E. Poor economics: a radical rethinking of the way to fight global poverty. New York, NY: Public Affairs; 2012. [ Google Scholar ]
  • Banerjee A, Duflo E, Goldberg N, Karlan D, Osei R, Parienté W, Shapiro J, Thuysbaert B, Udry C. A multifaceted program causes lasting progress for the very poor: Evidence from six countries. Science. 2015; 348 (6236):1260799. [ PubMed ] [ Google Scholar ]
  • Banerjee A, Karlan D, Zinman J. Six randomized evaluations of microcredit: Introduction and further steps. American Economic Journal: Applied Economics. 2015; 7 (1):1–21. [ Google Scholar ]
  • Bareinboim E, Pearl J. A general algorithm for deciding transportability of experimental results. Journal of Causal Inference. 2013; 1 (1):107–34. [ Google Scholar ]
  • Bareinboim E, Pearl J. Transportability from multiple environments with limited experiments: Completeness results. In: Welling M, Ghahramani Z, Cortes C, Lawrence N, editors. Advances in neural information processing systems. Vol. 27. 2014. pp. 280–8. [ Google Scholar ]
  • Bauchet J, Morduch J, Ravi S. Failure vs displacement: Why an innovative anti-poverty program showed no net impact in South India. Journal of Development Economics. 2015; 116 :1–16. [ Google Scholar ]
  • Bechtel W. Discovering cell mechanisms: The creation of modern cell biology. Cambridge, England: Cambridge University Press; 2006. [ Google Scholar ]
  • Begg CB. Significance tests of covariance imbalance in clinical trials. Controlled Clinical Trials. 1990; 11 (4):223–5. [ PubMed ] [ Google Scholar ]
  • Bhattacharya D, Dupas P. Inferring welfare maximizing treatment assignment under budget constraints. Journal of Econometrics. 2012; 167 (1):168–96. [ Google Scholar ]
  • Bitler MP, Gelbach JB, Hoynes HW. What mean impacts miss: Distributional effects of welfare reform experiments. American Economic Review. 2006; 96 (4):988–1012. [ Google Scholar ]
  • Bleyer A, Welch HG. Effect of three decades of screening mammography on breast-cancer incidence. New England Journal of Medicine. 2012; 367 :1998–2005. [ PubMed ] [ Google Scholar ]
  • Bloom HS, Hill CJ, Riccio JA. Modeling cross-site experimental differences to find out why program effectiveness varies. In: Bloom HS, editor. Learning more from social experiments: Evolving analytical approaches. New York, NY: Russell Sage; 2005. [ Google Scholar ]
  • Bold T, Kimenyi M, Mwabu G, Ng’ang’a A, Sandefur J. Scaling up what works: Experimental evidence on external validity in Kenyan education (Working Paper 321) Washington, DC: Center for Global Development; 2013. [ Google Scholar ]
  • Bothwell LE, Podolsky SH. The emergence of the randomized, controlled trial. New England Journal of Medicine. 2016; 375 (6):501–4. [ PubMed ] [ Google Scholar ]
  • Cartwright N. Nature’s capacities and their measurement. Oxford, England: Clarendon Press; 1994. [ Google Scholar ]
  • Cartwright N. Single case causes: What is evidence and why (CHESS Working paper 2015-02) 2015 Retrieved from: https://www.dur.ac.uk/resources/chess/CHESSWP_2015_02.pdf .
  • Cartwright N, Hardie J. Evidence based policy: A practical guide to doing it better. Oxford, England: Oxford University Press; 2012. [ Google Scholar ]
  • Chalmers I. Comparing like with like: Some historical milestones in the evolution of methods to create unbiased comparison groups in therapeutic experiments. International Journal of Epidemiology. 2001; 30 :1156–64. [ PubMed ] [ Google Scholar ]
  • Concato J. Study design and ‘evidence’ in patient-oriented research. American Journal of Respiratory and Critical Care Medicine. 2013; 187 (11):1167–72. [ PubMed ] [ Google Scholar ]
  • Concato J, Shah N, Horwitz RI. Randomized, controlled, trials, observational studies, and the hierarchy of research designs. New England Journal of Medicine. 2000; 342 (25):1887–92. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Conlisk J. Choice of response functional form in designing subsidy experiments. Econometrica. 1973; 41 (4):643–56. [ Google Scholar ]
  • CONSORT 2010, 15. Baseline data. [Retrieved November 9, 2017]; from: http://www.consort-statement.org/checklists/view/32--consort-2010/510-baseline-data .
  • Cook TD. Generating causal knowledge in the policy sciences: External validity as a task of both multi-attribute representation and multi-attribute extrapolation. Journal of Policy Analysis and Management. 2014; 33 (2):527–36. [ Google Scholar ]
  • Craver C. Explaining the brain: Mechanisms and the mosaic unity of neuroscience. Oxford, England: Clarendon Press; 2007. [ Google Scholar ]
  • Crépon B, Duflo E, Gurgand M, Rathelot R, Zamora P. Do labor market policies have displacement effects? Evidence from a clustered randomized experiment. Quarterly Journal of Economics. 2014; 128 (2):531–80. [ Google Scholar ]
  • Davey-Smith G, Ibrahim S. Data dredging, bias, or confounding. British Medical Journal. 2002; 325 :1437–8. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Dawid AP. Causal inference without counterfactuals. Journal of the American Statistical Association. 2000; 95 (450):407–24. [ Google Scholar ]
  • Deaton A. Instruments, randomization, and learning about development. Journal of Economic Literature. 2010; 48 (2):424–55. [ Google Scholar ]
  • Deaton A, Muellbauer J. Economics and consumer behavior. New York. NY: Cambridge University Press; 1980. [ Google Scholar ]
  • Dhaliwal I, Duflo E, Glennerster R, Tulloch C. Comparative cost-effectiveness analysis to inform policy in developing countries: A general framework with applications for education. Abdul Latif Jameel Poverty Action Lab, MIT. 2012 Dec 3; Retrieved from: http://www.povertyactionlab.org/publication/cost-effectiveness .
  • Duflo E, Hanna R, Ryan SP. Incentives work: Getting teachers to come to school. American Economic Review. 2012; 102 (4):1241–78. [ Google Scholar ]
  • Duflo E, Kremer M. Use of randomization in the evaluation of development effectiveness. In: Easterly W, editor. Reinventing foreign aid. Washington, DC: Brookings; 2008. pp. 93–120. [ Google Scholar ]
  • Dynarski S. Helping the poor in education: The power of a simple nudge. New York Times. 2015 Jan 18; p BU6. Retrieved from: https://www.nytimes.com/2015/01/18/upshot/helping-the-poor-in-higher-education-the-power-of-a-simple-nudge.html .
  • Epstein S. Inclusion: The politics of difference in medical research. Chicago, Il: Chicago University Press; 2007. [ Google Scholar ]
  • Feinstein AR, Horwitz RI. Problems in the ‘evidence’ of ‘evidence-based medicine. American Journal of Medicine. 1997; 103 :529–35. [ PubMed ] [ Google Scholar ]
  • Fine PEM, Clarkson JA. Individual versus public priorities in the determination of optimal vaccination policies. American Journal of Epidemiology. 1986; 124 (6):1012–20. [ PubMed ] [ Google Scholar ]
  • Fisher RA. The arrangement of field experiments. Journal of the Ministry of Agriculture of Great Britain. 1926; 33 :503–13. [ Google Scholar ]
  • Freedman DA. Statistical models for causation: What inferential leverage do they provide? Evaluation Review. 2006; 30 (6):691–713. [ PubMed ] [ Google Scholar ]
  • Freedman DA. On regression adjustments to experimental data. Advances in Applied Mathematics. 2008; 40 :180–93. [ Google Scholar ]
  • Frieden TR. Evidence for health decision making—beyond randomized, controlled trials. New England Journal of Medicine. 2017; 377 :465–75. [ PubMed ] [ Google Scholar ]
  • Garfinkel I, Manski CF. Introduction. In: Garfinkel I, Manski CF, editors. Evaluating welfare and training programs. Cambridge, MA: Harvard University Press; 1992. pp. 1–22. [ Google Scholar ]
  • Gerber AS, Green DP. Field Experiments. New York, NY: Norton: 2012. [ Google Scholar ]
  • Gertler PJ, Martinez S, Premand P, Rawlings LB, Vermeersch CMJ. Impact evaluation in practice. 2. Washington, DC: Inter-American Development Bank and World Bank; 2016. [ Google Scholar ]
  • Greenberg D, Shroder M, Onstott M. The social experiment market. Journal of Economic Perspectives. 1999; 13 (3):157–72. [ Google Scholar ]
  • Greenland S. Randomization, statistics, and causal inference. Epidemiology. 1990; 1 (6):421–9. [ PubMed ] [ Google Scholar ]
  • Greenland S, Mansournia MA. Limitations of individual causal models, causal graphs, and ignorability assumptions, as illustrated by random confounding and design unfaithfulness. European Journal of Epidemiology. 2015; 30 :1101–1110. [ PubMed ] [ Google Scholar ]
  • Gueron JM, Rolston H. Fighting for reliable evidence. New York, NY: Russell Sage; 2013. [ Google Scholar ]
  • Guyatt G, Sackett DL, Cook DJ. Users’ guides to the medical literature II: How to use an article about therapy or prevention. B. What were the results and will they help me in caring for my patients?. for the Evidence-Based Medicine Working Group, 1994. Journal of the American Medical Association. 1994; 271 (1):59–63. [ PubMed ] [ Google Scholar ]
  • Harrison GW. Field experiments and methodological intolerance. Journal of Economic Methodology. 2013; 20 (2):103–17. [ Google Scholar ]
  • Harrison GW. Impact evaluation and welfare evaluation. European Journal of Development Research. 2014a; 26 :39–45. [ Google Scholar ]
  • Harrison GW. Cautionary notes on the use of field experiments to address policy issues. Oxford Review of Economic Policy. 2014b; 30 (4):753–63. [ Google Scholar ]
  • Heckman JJ. Randomization and social policy evaluation. In: Manski CF, Garfinkel I, editors. Evaluating welfare and training programs. Cambridge, MA: Harvard University Press; 1992. pp. 547–70. [ Google Scholar ]
  • Heckman JJ, Hohman N, Smith J with the assistance of Khoo, M. Substitution and drop out bias in social experiments: A study of an influential social experiment. Quarterly Journal of Economics. 2000; 115 (2):651–94. [ Google Scholar ]
  • Heckman JJ, Ichimura H, Todd PE. Matching as an econometric evaluation estimator: Evidence from evaluating a job training program. Review of Economics and Statistics. 1997; 64 (4):605–54. [ Google Scholar ]
  • Heckman JJ, Lalonde RJ, Smith JA. The economics and econometrics of active labor markets. In: Ashenfelter O, Card D, editors. Handbook of labor economics. 3A. Amsterdam, Netherlands: North-Holland: 1999. pp. 1866–2097. [ Google Scholar ]
  • Heckman JJ, Pinto R, Savelyev P. Understanding the mechanisms through which an influential early childhood program boosted adult outcomes. American Economic Review. 2013; 103 (6):2052–86. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Heckman JJ, Vytlacil EJ. Econometric evaluation of social programs, Part 1: Causal models, structural models, and econometric policy evaluation. In: Heckman JJ, Leamer EE, editors. Handbook of Econometrics. 6B. Amsterdam, Netherlands: North-Holland: 2007. pp. 4779–874. [ Google Scholar ]
  • Hernán MA. A definition of a casual effect for epidemiological research. Journal of Epidemiology and Community Health. 2004; 58 (4):265–71. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hernán MA, Hernández-Diaz S, Robins JM. Randomized trials analyzed as observational studies. Annals of Internal Medicine. 2013; 159 (8):560–2. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hill AB. The environment and disease: Association or causation? Proceedings of the Royal Society of Medicine. 1965; 58 (5):295–300. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Horton R. Common sense and figures: The rhetoric of validity in medicine. Bradford Hill memorial lecture 1999. Statistics in Medicine. 2000; 19 :3149–64. [ PubMed ] [ Google Scholar ]
  • Horwitz RI. The dark side of evidence based medicine. Cleveland Clinic Journal of Medicine. 1996; 63 (6):320–3. [ PubMed ] [ Google Scholar ]
  • Horwitz RI, Hayes-Conroy A, Caricchio R, Singer BH. From evidence-based medicine to medicine-based evidence. American Journal of Medicine. 2017; 130 (11):1246–50. [ PubMed ] [ Google Scholar ]
  • Hotz VJ, Imbens GW, Mortimer JH. Predicting the efficacy of future training programs using past experience at other locations. Journal of Econometrics. 2005; 125 :241–70. [ Google Scholar ]
  • Howick J. The Philosophy of Evidence-Based Medicine. Chichester, England: Wiley-Blackwell; 2011. [ Google Scholar ]
  • Howick J, Glasziou JP, Aronson JK. Problems with using mechanisms to solve the problem of extrapolation. Theoretical Medicine and Bioethics. 2013; 34 :275–91. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hsieh C, Urquiola M. The effects of generalized school choice on achievement and stratification: Evidence from Chile’s voucher program. Journal of Public Economics. 2006; 90 :1477–1503. [ Google Scholar ]
  • Humphreys M, Jacobs A. Qualitative inference from causal models. [Retrieved November 27, 2017]; Draft manuscript (version 0.2) 2017 from: http://www.columbia.edu/~mh2245/qualdag.pdf .
  • Hurwicz L. On the structural form of interdependent systems. Studies in logic and the foundations of mathematics. 1966; 44 :232–9. [ Google Scholar ]
  • Ilardi SS, Craighead WE. Rapid early response, cognitive modification, and nonspecific factors in cognitive behavior therapy for depression: A reply to Tang and DeRubeis. Clinical Psychology: Science and Practice. 1999; 6 :295–99. [ Google Scholar ]
  • Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics. 2004; 86 (1):4–29. [ Google Scholar ]
  • Imbens GW, Kolesár M. Robust standard errors in small samples: Some practical advice. Review of Economics and Statistics. 2016; 98 (4):701–12. [ Google Scholar ]
  • Imbens GW, Wooldridge JM. Recent developments in the econometrics of program evaluation. Journal of Economic Literature. 2009; 47 (1):5–86. [ Google Scholar ]
  • International Committee of Medical Journal Editors. [Retrieved August 20, 2016]; Recommendations for the conduct, reporting, editing, and publication of scholarly work in medical journals. 2015 from: http://www.icmje.org/icmje-recommendations.pdf . [ PubMed ]
  • Kahneman D, Klein G. Conditions for intuitive expertise: A failure to disagree. American Psychologist. 2009; 64 (6):515–26. [ PubMed ] [ Google Scholar ]
  • Karlan D, Appel J. More than good intentions: how a new economics is helping to solve global poverty. New York, NY: Dutton: 2011. [ Google Scholar ]
  • Kasy M. Why experimenters might not want to randomize, and what they could do instead. Political Analysis. 2016; 24 (3):324–338. doi: 10.1093/pan/mpw012. [ CrossRef ] [ Google Scholar ]
  • Kramer P. Ordinarily well: The case for antidepressants. New York, NY: Farrar, Straus, and Giroux; 2016. [ Google Scholar ]
  • Kramer U, Stiles WB. The responsiveness problem in psychotherapy: A review of proposed solutions. Clinical Psychology: Science and Practice. 2015; 22 (3):277–95. [ Google Scholar ]
  • Kremer M, Holla A. Improving education in the developing world: What have we learned from randomized evaluations? Annual Review of Economics. 2009; 1 :513–42. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lakatos I. Falsification and the methodology of scientific research programmes. In: Lakatos I, Musgrave A, editors. Criticism and the growth of knowledge: Proceedings of the international colloquium in the philosophy of science, London. Cambridge, England: Cambridge University Press; 1970. pp. 91–106. [ CrossRef ] [ Google Scholar ]
  • Lalonde RJ. Evaluating the econometric evaluations of training programs with experimental data. American Economic Review. 1986; 76 (4):604–20. [ Google Scholar ]
  • Lehman EL, Romano JP. Testing statistical hypotheses. 3. New York, NY: Springer; 2005. [ Google Scholar ]
  • LeLorier J, Grégoire G, Benhaddad A, Lapierre J, Derderian F. Discrepancies between meta-analyses and subsequent large randomized, controlled trials. New England Journal of Medicine. 1997; 337 :536–42. [ PubMed ] [ Google Scholar ]
  • Levy S. Progress against poverty: sustaining Mexico’s Progresa-Oportunidades program. Washington, DC: Brookings; 2006. [ Google Scholar ]
  • Little D. Across the boundaries: Extrapolation in biology and social science. Oxford, England: Oxford University Press; 2007. [ Google Scholar ]
  • Longford NT, Nelder JA. Statistics versus statistical science in the regulatory process. Statistical Medicine. 1999; 18 :2311–20. [ PubMed ] [ Google Scholar ]
  • Machamer P, Darden L, Craver C. Thinking about mechanisms. Philosophy of Science. 2000; 67 :1–25. [ Google Scholar ]
  • Mackie JL. The cement of the universe: a study of causation. Oxford, England: Oxford University Press; 1974. [ Google Scholar ]
  • Manning WG, Newhouse JP, Duan N, Keeler E, Leibowitz A. Health insurance and the demand for medical care: Evidence from a randomized experiment. American Economic Review. 1987; 77 (3):251–77. [ PubMed ] [ Google Scholar ]
  • Manning WG, Newhouse JP, Duan N, Keeler E, Benjamin B, Leibowitz A, Marquis MA, Zwanziger J. Health insurance and the demand for medical care: Evidence from a randomized experiment. Santa Monica, CA: RAND; 1988. [ PubMed ] [ Google Scholar ]
  • Manski CF. Treatment rules for heterogeneous populations. Econometrica. 2004; 72 (4):1221–46. [ Google Scholar ]
  • Manski CF. Public policy in an uncertain world: Analysis and decisions. Cambridge, MA: Harvard University Press; 2013. [ Google Scholar ]
  • Manski CF, Tetenov A. Sufficient trial size to inform clinical practice. PNAS. 2016; 113 (38):10518–23. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Marmot MG, Altman DG, Cameron DA, Dewar JA, Thomson SG, Wilcox M the Independent UK panel on breast cancer screening. The benefits and harms of breast cancer screening: An independent review. British Journal of Cancer. 2013; 108 (11):2205–40. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Metcalf CE. Making inferences from controlled income maintenance experiments. American Economic Review. 1973; 63 (3):478–83. [ Google Scholar ]
  • Moffitt R. The labor supply response in the Gary experiment. Journal of Human Resources. 1979; 14 (4):477–87. [ Google Scholar ]
  • Moffitt R. Evaluation methods for program entry effects. In: Manski C, Garfinkel I, editors. Evaluating welfare and training programs. Cambridge, MA: Harvard University Press; 1992. pp. 231–52. [ Google Scholar ]
  • Morgan KL, Rubin DB. Rerandomization to improve covariate balance in experiments. Annals of Statistics. 2012; 40 (2):1263–82. [ Google Scholar ]
  • Muller SM. Causal interaction and external validity: Obstacles to the policy relevance of randomized evaluations. World Bank Economic Review. 2015; 29 :S217–S225. [ Google Scholar ]
  • Orcutt GH, Orcutt AG. Incentive and disincentive experimentation for income maintenance policy purposes. American Economic Review. 1968; 58 (4):754–72. [ Google Scholar ]
  • Parkkinen V-P, Wallmann C, Wilde M, Clarke B, Illari P, Kelly MP, Norrell C, Russo F, Shaw B, Williamson J. Evaluating evidence of mechanisms in medicine: Principles and procedures. New York, NY: Springer; 2008. [ PubMed ] [ Google Scholar ]
  • Patsopoulos NA. A pragmatic view on pragmatic trials. Dialogues in Clinical Neuroscience. 2011; 13 (2):217–24. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Pearl J, Bareinboim E. Proceedings of the 25 th AAAI Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press; 2011. Transportability of causal and statistical relations: A formal approach; pp. 247–254. [ Google Scholar ]
  • Pearl J, Bareinboim E. External validity: From do-calculus to transportability across populations. Statistical Science. 2014; 29 (4):579–95. [ Google Scholar ]
  • Pitman SR, Hilsenroth MJ, Goldman RE, Levy SR, Siegel DF, Miller R. Therapeutic technique of APA master therapists: Areas of difference and integration across theoretical orientations. Professional Psychology: Research and Practice. 2017; 48 (3):156–66. [ Google Scholar ]
  • Rawlins M. De testimonio: On the evidence for decisions about the use of therapeutic interventions. The Lancet. 2008; 372 :2152–61. [ PubMed ] [ Google Scholar ]
  • Reichenbach H. Nomological statements and admissible operations. Amsterdam, Netherlands: North-Holland: 1954. [ Google Scholar ]
  • Reichenbach H. Laws, modalities and counterfactuals, with a foreword by W.C. Salmon. Berkeley and Los Angeles: CA, University of California Press; 1976. [ Google Scholar ]
  • Reiss J. Against external validity (CHESS Working Paper 2017-03) 2017 Retrieved from: https://www.dur.ac.uk/resources/chess/CHESSK4UWP_2017_03_Reiss.pdf .
  • Rothman KJ. Causes. American Journal of Epidemiology. 1976; 104 (6):587–92. [ PubMed ] [ Google Scholar ]
  • Rothman KJ. Epidemiology. An introduction. 2. New York, NY: Oxford University Press; 2012. [ Google Scholar ]
  • Rothman KJ, Gallacher JEJ, Hatch EE. Why representativeness should be avoided. International Journal of Epidemiology. 2013; 42 (4):1012–1014. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Rothwell PM. External validity of randomized controlled trials: ‘To whom do the results of the trial apply’ Lancet. 2005; 365 :82–93. [ PubMed ] [ Google Scholar ]
  • Rubin DB. Causal inference using potential outcomes: Design, modeling, decisions. 2004 Fisher Lecture. Journal of the American Statistical Association. 2005; 100 (469):322–331. [ Google Scholar ]
  • Russell B. The problems of philosophy. Rockville, MD: Arc Manor; 1912. (2008) [ Google Scholar ]
  • Sackett DL, Rosenberg WMC, Gray JAM, Haynes RB, Richardson WS. Evidence based medicine: What it is and what it isn’t. British Medical Journal. 1996; 312 :71–2. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Savage LJ. Subjective probability and statistical practice. In: Barnard GA, Cox GA, editors. The Foundations of Statistical Inference. London, England: Methuen; 1962. pp. 9–35. [ Google Scholar ]
  • Scriven M. Evaluation perspectives and procedures. In: Popham WJ, editor. Evaluation in education—current applications. Berkeley, CA: McCutchan Publishing Corporation; 1974. pp. 68–84. [ Google Scholar ]
  • Seckinelgin H. Social Aspects of HIV, 3. Switzerland: Springer International Publishing; 2017. The politics of global AIDS: institutionalization of solidarity, exclusion of context. [ Google Scholar ]
  • Senn S. Seven myths of randomization in clinical trials. Statistics in Medicine. 2013; 32 :1439–50. [ PubMed ] [ Google Scholar ]
  • Shadish WR, Cook TD, Campbell DT. Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin; 2002. [ Google Scholar ]
  • Shepherd GM. Neurobiology. 2. New York, NY: Oxford University Press; 1988. [ Google Scholar ]
  • Stuart EA, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society A. 2011; 174 (2):369–86. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Student (Gosset, W. S.) Comparison between balanced and random arrangements of field plots. Biometrika. 1938; 29 (3/4):363–78. [ Google Scholar ]
  • Suzuki E, Yamamoto E, Tsuda T. Identification of operating mediation and mechanism in the sufficient-component cause framework. European Journal of Epidemiology. 2011; 26 :347–57. [ PubMed ] [ Google Scholar ]
  • Svorencik A. The experimental turn in economics: a history of experimental economics. Utrecht School of Economics, Dissertation Series #29. 2015 Retrieved from: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2560026 .
  • Todd PE, Wolpin KJ. Assessing the impact of a school subsidy program in Mexico: Using a social experiment to validate a dynamic behavioral model of child schooling and fertility. American Economic Review. 2006; 96 (5):1384–1417. [ PubMed ] [ Google Scholar ]
  • Todd PE, Wolpin KJ. Ex ante evaluation of social programs. Annales d’Economie et de la Statistique. 2008; 91/92 :263–91. [ Google Scholar ]
  • U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance. Identifying and implementing educational practices supported by rigorous evidence: A user friendly guide. Washington, DC: Institute of Education Sciences; 2003. [ Google Scholar ]
  • Van der Weele TJ. Confounding and effect modification: Distribution and measure. Epidemiologic Methods. 2012; 1 :55–82. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Vandenbroucke JP. When are observational studies as credible as randomized controlled trials? The Lancet. 2004; 363 :1728–31. [ PubMed ] [ Google Scholar ]
  • Vandenbroucke JP. The HRT controversy: Observational studies and RCTs fall in line. The Lancet. 2009; 373 :1233–5. [ PubMed ] [ Google Scholar ]
  • Vittengl JR, Clark LA, Thase ME, Jarrett RB. Are improvements in cognitive content and depressive symptoms correlates or mediators during Acute-Phase Cognitive Therapy for Recurrent Major Depressive Disorder? International Journal of Cognitive Therapy. 2014; 7 (3):255–71. doi: 10.1521/ijct.2014.7.3.251. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Vivalt E. How much can we generalize from impact evaluations? [retrieved, Nov. 28, 2017]; NYU, unpublished. 2016 Retrieved from: http://evavivalt.com/wp-content/uploads/2014/12/Vivalt_JMP_latest.pdf .
  • Williams HC, Burden-Teh E, Nunn AJ. What is a pragmatic clinical trial? Journal of Investigative Dermatology. 2015; 135 (6):1–3. [ PubMed ] [ Google Scholar ]
  • Wise DA. A behavioral model versus experimentation: The effects of housing subsidies on rent. In: Brucker P, Pauly R, editors. Methods of Operations Research. Vol. 50. Verlag, Germany: Anton Hain; 1985. pp. 441–89. [ Google Scholar ]
  • Wolpin KI. The limits of inference without theory. Cambridge, MA: MIT Press; 2013. [ Google Scholar ]
  • Worrall J. Evidence in medicine and evidence-based medicine. Philosophy Compass. 2007; 2/6 :981–1022. [ Google Scholar ]
  • Worrall J. Evidence and ethics in medicine. Perspectives in Biology and Medicine. 2008; 51 (3):418–31. [ PubMed ] [ Google Scholar ]
  • Yates F. The comparative advantages of systematic and randomized arrangements in the design of agricultural and biological experiments. Biometrika. 1939; 30 (3/4):440–66. [ Google Scholar ]
  • Young A. Channeling Fisher: Randomization tests and the statistical insignificance of seemingly significant experimental results (Working Paper) London School of Economics. 2017 Retrieved from: http://personal.lse.ac.uk/YoungA/ChannellingFisher.pdf .
  • Ziliak ST. Balanced versus randomized field experiments in economics: Why W.S. Gosset aka ‘Student’ matters. Review of Behavioral Economics. 2014; 1 :167–208. [ Google Scholar ]

IMAGES

  1. PPT

    random assignment rct

  2. Sample Size Calculation: Hypothesis Testing || Randomized control trial

    random assignment rct

  3. SPSS LEARNING TUTORIAL 24: GENERATING RANDOM NUMBERS & RANDOM ASSIGMENT

    random assignment rct

  4. Research design of the pragmatic randomized controlled trial (RCT

    random assignment rct

  5. Introduction to Random Assignment -Voxco

    random assignment rct

  6. Random Assignment in Experiments

    random assignment rct

VIDEO

  1. Gasss mhytic malam ini

  2. Task: Assignment

  3. BEATING POKEMON EMERALD IN ONE LIVESTREAM🔴LIVE🔴

  4. यह 5 कार्य करने से बचें, पाप से बचें🙏 #sin #stayaway #sayno

  5. 【Drawing Stream】Daily Pose Practice with #POSEMANIACS!【Learning to Draw One Day at a Time

  6. Random Tommy Understood The Assignment

COMMENTS

  1. How to Do Random Allocation (Randomization)

    Random allocation is a technique that chooses individuals for treatment groups and control groups entirely by chance with no regard to the will of researchers or patients' condition and preference. This allows researchers to control all known and unknown factors that may affect results in treatment groups and control groups.

  2. Randomized Controlled Trial (RCT) Overview

    A randomized controlled trial (RCT) is a prospective experimental design that randomly assigns participants to an experimental or control group. RCTs are the gold standard for establishing causal relationships and ruling out confounding variables and selection bias.

  3. Randomized Control Trial (RCT)

    Random allocation and random assignment are terms used interchangeably in the context of a randomized controlled trial (RCT). Both refer to assigning participants to different groups in a study (such as a treatment group or a control group) in a way that is completely determined by chance.

  4. The random allocation process: two things you need to know

    The key to the RCT lies in the random allocation process. When done correctly in a large enough sample, random allocation is an effective measure in reducing bias. In this article we describe the random allocation process. Go to: What makes up the random allocation process? The random allocation process consists of two steps:

  5. Research Guides: Study Design 101: Randomized Controlled Trial

    A study design that randomly assigns participants into an experimental group or a control group. As the study is conducted, the only expected difference between the control and experimental groups in a randomized controlled trial (RCT) is the outcome variable being studied. Advantages Good randomization will "wash out" any population bias

  6. Issues in Outcomes Research: An Overview of Randomization Techniques

    One critical component of clinical trials that strengthens results is random assignment of participants to control and treatment groups. Although randomization appears to be a simple concept, issues of balancing sample sizes and controlling the influence of covariates a priori are important.

  7. Understanding randomised controlled trials

    The main purpose of random assignment is to prevent selection bias by distributing the characteristics of patients that may influence the outcome randomly between the groups, ... An RCT is the most rigorous scientific method for evaluating the effectiveness of health care interventions. However, bias could arise when there are flaws in the ...

  8. Full article: An introduction to the use of randomised control trials

    Random assignment to the project and control groups overcomes selection bias which will otherwise occur from programme placement or self-selection. Conducting an RCT requires decisions regarding the unit of assignment, the number of 'treatment arms' and what, if anything, will be provided to the control group and when.

  9. Overview of the Randomized Clinical Trial and the Parallel ...

    A RCT is an intervention study in which the treatment assignment is random rather than systematic. Randomization confers several benefits. It removes the potential for bias in the allocation of participants to the intervention or the control group.

  10. Randomized Controlled Trial

    Randomized controlled trials (RCTs) are considered a gold standard to measure an intervention's effect by random assignment of individuals to an intervention or a control arm. Specifically, the randomization process ensures that the differences between the groups, if observed at the end of the trial, are only due to the treatment administration.

  11. Randomized Controlled Trial

    The randomized controlled trial (RCT) is considered the "gold standard" experimental research design. Randomized controlled trials allow for researchers to establish causal associations between predictor, confounding, and outcome variables. ... Random assignment is where randomly selected participants are randomly assigned to either the ...

  12. Assessing the Overall Validity of Randomised Controlled Trials

    In contrast to Cartwright's (Citation 2010, 63) view of the ideal RCT, random assignment cannot thus ensure, in real RCTs, 'that other possible reasons for dependencies and independencies between cause and effect under test will be distributed identically in the treatment and control wings' (emphasis added). Instead randomisation, when we ...

  13. Randomized controlled trial

    A randomized controlled trial (or randomized control trial; [2] RCT) is a form of scientific experiment used to control factors not under direct experimental control. Examples of RCTs are clinical trials that compare the effects of drugs, surgical techniques, medical devices, diagnostic procedures or other medical treatments. [citation needed]

  14. R: Designing, random assigning and evaluating Randomized Control

    Details RCT helps you focus on the statistics of the randomized control trials, rather than the heavy programming lifting. RCT helps you in the whole process of designing and evaluating a RCT. 1. Clean and summarise the data in which you want to randomly assign treatment 2. Decide the share of observations that will go to control group 3.

  15. Randomised controlled trials—the gold standard for effectiveness

    Randomized controlled trials (RCT) are prospective studies that measure the effectiveness of a new intervention or treatment. Although no study is likely on its own to prove causality, randomization reduces bias and provides a rigorous tool to examine cause-effect relationships between an intervention and outcome.

  16. Introduction to Field Experiments and Randomized Controlled Trials

    Random assignment means participants are assigned to different groups or conditions in a study purely by chance. Basically, each participant has an equal chance to be assigned to a control group or a treatment group. Field experiments, or randomized studies conducted in real-world settings, can take many forms.

  17. Everything you need to know about Randomised Controlled Trials

    Participants in an RCT are randomly assigned to different groups—control groups and treatment groups. The concept of a control group and treatment group has roots in clinical trials, and the method of random assignment to these groups was developed through agricultural experiments in the 1920s.

  18. Designing a research project: randomised controlled trials and their

    The sixth paper in this series discusses the design and principles of randomised controlled trials.

  19. Randomized Controlled Trials (RCTs)

    EN FR SP A randomized controlled trial (RCT) is an experimental form of impact evaluation in which the population receiving the programme or policy intervention is chosen at random from the eligible population, and a control group is also chosen at random from the same eligible population.

  20. Research Randomizer

    Research Randomizer is a free resource for researchers and students in need of a quick way to generate random numbers or assign participants to experimental conditions. This site can be used for a variety of purposes, including psychology experiments, medical trials, and survey research.

  21. Randomized controlled trials

    Randomized controlled trials (RCTs) are the hallmark of evidence-based medicine and form the basis for translating research data into clinical practice. This review summarizes commonly applied designs and quality indicators of RCTs to provide guidance in interpreting and critically evaluating clinical research data.

  22. Random assignment

    Random assignment or random placement is an experimental technique for assigning human participants or animal subjects to different groups in an experiment (e.g., a treatment group versus a control group) using randomization, such as by a chance procedure (e.g., flipping a coin) or a random number generator. [1]

  23. Understanding and misunderstanding randomized controlled trials

    Randomized Controlled Trials (RCTs) are increasingly popular in the social sciences, not only in medicine. We argue that the lay public, and sometimes researchers, put too much trust in RCTs over other methods of investigation.