The experimental science community is in the middle of a 'credibility revolution'. Widespread concerns over the replicability of published results in fields, including psychology (Nelson et al. 2018) and economics (Camerer et al. 2016), have prompted researchers to reconsider the way they formulate hypotheses, collect and analyse data, and interpret their findings (see, for example, Ioannidis 2005, Simmons et al. 2011, Maniadis et al 2014, Brodeur et al. 2016, and Munafó et al. 2017).
It's not just replicability
While it is of first-order importance to embrace research practices that increase reproducibility, experimental economists also must tackle the generalisability and applicability of the evidence they produce (Banerjee et al. 2017). After all, we are not only interested in ensuring that the same experiment yields the same outcome when it is repeated. Ideally, we would like to be able to generalise our findings to different contexts, and produce insights that contribute to economic theory and policy discussions.
These considerations have re-ignited a constructive debate that focuses on the credibility, generalisability, and relevance of findings in experimental economics. We are proposing 12 recommendations that summarise research practices that we should all do more of (Czibor et al. 2019). This is intended as a one-stop-shop for researchers in the design phase of their experiment.
While we compiled the list with experimental economics in mind, we believe many of the suggestions are relevant in other fields of experimental science, and for scholars pursuing estimates using observational data.
We cover four threats to generalisability: the representativeness of population, non-random selection into the experiment, treatment non-compliance, and characteristics of the experiment that may affect behaviour. It is important that we do not merely acknowledge these issues ex post as potential limitations of our study. They must guide decisions in the design phase that affect the type of data we generate.
What does that mean in practice?
We advocate for conducting more natural field experiments (NFEs). They are covert, and so can mitigate biases stemming from self-selection into the experiment and experimenter demand effects, and typically involve the population of interest. NFEs offer a unique combination of control and realism.
We argue that lab and field experiments, as well as naturally occurring data, are complements in the production of scientific knowledge. For instance, we can begin by documenting an effect among students in the lab, then test its generalisability by repeating the experiment with different tasks and populations. Alternatively, we could first evaluate a program in a field experiment, then use additional lab experiments to test mechanisms that may explain what we observed in the field (Stoop et al. 2012).
The second challenge is the informativeness of our findings – in other words, how to design experiments that optimise learning. This requires a critical look at the practice of basing our inference solely on p-values – an approach that leads to a high false-positive rate (Ioannidis 2005) and ignores the economic significance of findings (Ziliak and McCloskey 2004). This practice is especially dangerous in combination with specification searching (Simmons et al. 2011, Brodeur et al. 2016) and multiple hypothesis testing (List et al. 2016).
We discuss two ways of dealing with this issue.
We advocate for more replication studies to increase the credibility of research findings. In Czibor et al. (2019), we consider ways to incentivise such studies.
We also draw attention to the importance of statistical power in determining the informativeness of experiments, and present ways to increase power for a given experimental budget, such as using within-subject designs (where the same participant is exposed to different treatments, with their order randomised) when appropriate, and collecting baseline characteristics to perform blocked randomisation (partitioning our sample to subgroups along relevant variables, and randomising within these groups).
Finally, we highlight issues related to the policy relevance of experimental economics results. Even perfectly credible and reproducible findings may not inform policy discussions if they exclusively focus on short-term impacts, leave mechanisms uncovered, and fail to consider scalability. Studying longer-term outcomes ensures that promising results don’t fade quickly, but is also important because it may take time for important general equilibrium effects to emerge. For example, maybe an intervention led to short-term improvements for the participants, but in the long run transformed the market in ways that were harmful for everyone.
Experiments that document an effect but do not explicitly study the underlying mechanisms leave a lot of potential gains on the table. Theory can help us address the generalisability threats discussed above by explicitly modelling the participation and compliance decision. Specifying a model, then using experimental variation to identify its deep structural parameters, allows us extrapolate our results to different contexts and interventions. We can also design experiments to test the predictions of a theory, or to run a horse race between competing models. If we can understand why we observe a phenomenon, we are in a better position to advise policy.
Theory also becomes important when we consider the 'science of scaling': a systematic treatment of the issues that arise when a small-scale, short-run programme is rolled out on a much larger scale. As Banerjee et al. (2017) demonstrate, scaling up a programme is not a trivial undertaking. It requires careful planning and testing in the design phase to avoid a 'voltage drop' (Al-Ubaydli et al. 2017), whereby the scaled-up programme’s effect is smaller than the original, small-scale evaluation.
These considerations include general equilibrium effects (including the reaction of politicians to programs), and potential biases stemming from sample selection (the original pilot often includes a 'convenience' rather than a representative sample), site selection (researchers might choose to run their experiments in places where they are 'easier' to implement), and piloting (including the fact that as a program grows larger, it needs to recruit additional workers, who may be less skilled or motivated than the ones already hired; see Davis et al. 2017). Researchers need to 'backward induct', to address scaling-related challenges in the design phase. If they do not, they may not create programmes that work well at scale.
Grounds for optimism
Researchers are becoming increasingly aware of these challenges, and are taking important steps to improve the quality of their research. The profession as a whole is embracing new standards of evidence (an example being the success of the preregistration movement). We want to contribute to this positive change, to prompt other experimental researchers to join the debate, to share the challenges they have identified and their proposed suggestions to overcoming them, so that step-by-step we can work together to improve the quality of scientific research.
Al-Ubaydli, O, J A List, D LoRe, and D Suskind (2017), "Scaling for Economists: Lessons from the Non-Adherence Problem in the Medical Literature", Journal of Economic Perspectives, 31(4): 125–144.
Banerjee, A, R Banerji, J Berry, E Duflo, H Kannan, S Mukherji, M Shotland, and M Walton (2017), "From Proof of Concept to Scalable Policies: Challenges and Solutions, with an Application", Journal of Economic Perspectives 31(4): 73–102.
Brodeur, A, M Lé, M Sangnier, and Y Zylberberg (2016), "Star Wars: The empirics strike back", American Economic Journal: Applied Economics, 8(1): 1–32.
Camerer, C F, A Dreber, E Forsell, T Ho, J Huber, M Kirchler, M Johannesson, M Kirchler, J Almenberg, A Altmejd, T Chan, E Heikensten, F Holzmeister, T Imai, S Isaksson, G Nave, T Pfeiffer, M Razen, H Wu (2016) "Evaluating replicability of laboratory experiments in economics", Science 351(6280): 1433–1436.
Czibor, E, D Jimenez-Gomez, and J A List (2019), "The Dozen Things Experimental Economists Should Do (More of)", NBER working paper 25451.
Davis, J, J Guryan, K Hallberg, and J Ludwig (2017), "The Economics of Scale-Up", NBER working paper 23925.
Ioannidis, J P A (2005), "Why most published research findings are false", PLoS Medicine 2(8): 0696–0701.
List, J A, A M Shaikh, and Y Xu (2016), "Multiple Hypothesis Testing in Experimental Economics", NBER working paper 21875.
Maniadis, Z, F Tufano, and J A List (2014), "One swallow doesn’t make a summer: New evidence on anchoring effects", The American Economic Review 104(1): 277–290.
Munafó, M R, B A Nosek, D V M Bishop, K S Button, C D Chambers, N Percie, U Simonsohn, and E-J Wagenmakers (2017), "A manifesto for reproducible science", Nature Human Behaviour 1: 1–9.
Nelson, L D, J Simmons, and U Simonsohn (2018), "Psychology’s renaissance", Annual Review of Psychology, 69, 511–534.
Simmons, J P, L D Nelson, and U Simonsohn (2011), "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant", Psychological Science 22(11): 1359–1366.
Stoop, J, C N Noussair, and D Van Soest (2012), "From the lab to the field: Cooperation among fishermen", Journal of Political Economy 120(6): 1027–1056.