Person using laptop with floating images of online profiles
VoxEU Column Frontiers of economic research Productivity and Innovation

Scientific experimentation with generative AI

Scientific experimentation is not only essential for the progress of knowledge in social sciences, it is also the bedrock upon which technological revolutions are built and policies are crafted. This column describes how multiple actors, from researchers to entrepreneurs and policymakers, can revolutionise their practice of scientific experimentation by integrating generative artificial intelligence into scientific experimentation and at the same time democratise scientific education and foster evidence-based and critical thinking across society.

The recent emergence of generative artificial intelligence (AI) – applications of large language models (LLMs) capable of generating novel content (Bubeck et al. 2023) – has become a focal point of economic policy discourse (Matthews 2023),  capturing the attention of the EU, the US Senate and the United Nations.  This radical innovation, led by new specialised AI labs like OpenAI and Anthropic and supported financially by traditional ‘big tech’ such as Microsoft and Amazon, is not merely a theoretical marvel; it is already reshaping markets, from creative to health industries amid many other ones. However, we are merely at the cusp of its full potential for the economy (Brynjolsson and McAfee 2017, Acemoglu et al. 2021, Acemoglu and Johnson 2023) and humanity's future overall (Bommasani et al. 2022).

One domain poised for seismic change, albeit still in its nascent stages, is scientific knowledge production across social sciences and economics (Korinek 2023). In particular, experimental methods are seminal for progress of knowledge in social sciences (List 2011), but their relevance goes beyond academia; they are the bedrock upon which technological revolutions are built (Levitt and List 2009) and policies are crafted (Athey and Imbens 2019, Al-Ubaydli et al. 2021). As we elaborate in our recent paper (Charness et al. 2023), the integration of generative AI into scientific experimentation is not just promising; it can revolutionise the online experimentation of multiple actors, from researchers to entrepreneurs and policymakers, in different and scalable ways. Not only can it be easily deployed in different organisations, but it also democratises scientific education and fosters evidence-based and critical thinking across society (Athey and Luca 2019).

We identify three pivotal areas where AI can significantly augment online experiments — design, implementation, and data analysis — permitting longstanding scientific issues surrounding online experiments (Athey 2015) to be overcome at scale, such as measurement errors (Gilen et al. 2019) and overall violation of the four exclusive restrictions (List 2023).

First, in experimental design, LLMs can generate novel hypotheses by evaluating existing literature, current events, and seminal problems in a field (Davies et al. 2021). Their extensive training enables the models to recommend appropriate methodologies to isolate causal relationships, such as economic games or market simulations. Furthermore, they can assist in determining sample size (Ludwig et al. 2021), ensuring statistical robustness while crafting clear and concise instructions (Saunders et al. 2022), vital for ensuring the highest scientific value of experiments (Charness et al. 2004). They can also transform plain English into different coding languages, easing the transition from design to working interface (Chen et al. 2021) and allowing experiments to be deployed across different settings, which is applicable to the reliability of experimental results across different populations (Snowberg and Yariv 2021).

Second, during implementation, LLMs can offer real-time chatbot support to participants, ensuring comprehension and compliance. Recent evidence from Eloundou et al. (2023), Noy and Zhang (2023), and Brynjolfsson et al. (2023) shows, in different settings, that granting humans access to AI-powered chat assistants can significantly increase their productivity. AI assistance allows human support to provide faster and higher quality responses to a more extensive customer base. This technique can be imported to experimental research, where participants might need clarification on instructions or have other questions. Their scalability allows for the simultaneous monitoring of multiple participants, thereby maintaining data quality by detecting live engagement levels, cheating, or erroneous responses, by automating the deployment of Javascript algorithms already used in some experiments (Jabarian and Sartori 2020), which is usually too costly to implement at scale. In addition, automating the data collection process through chat assistants reduces the risk of experimenter bias or demand characteristics that influence participant behaviour, resulting in a more reliable evaluation of research questions (Fréchette et al., 2022).

Third, in the data analysis phase, LLMs can employ state-of-the-art natural language-processing techniques to explore new variables, such as participant sentiments or engagement levels. Regarding exploring new data, using natural language processing (NLP) techniques with live chat logs from experiments can yield insights into participant behaviour, uncertainty, and cognitive processes. They can automate data pre-processing, conduct statistical tests, and generate visualisations, allowing researchers to focus on substantive tasks. During data pre-processing, language models can distill pertinent details from chat logs, organise the data into an analytical-friendly format, and manage any incomplete or missing entries. Beyond these tasks, such models can perform content analysis – identifying and categorising frequently expressed concerns of participants; analysing sentiments and emotions conveyed; and gauging the efficacy of instructions, responses, and interactions.

However, the integration of LLMs into scientific research has its challenges. There are inherent risks of biases in their training data and algorithms (Kleinberg et al. 2018). Researchers must be vigilant in auditing these models for discrimination or skew. Privacy concerns are also paramount, given the vast amounts of data, including sensitive participant information, that these models process. Moreover, as LLMs become increasingly adept at generating persuasive text, the risk of deception and of the spread of misinformation looms large (Lazer et al. 2018, Pennycook et al. 2021). Over-reliance on standardised prompts could potentially stifle human creativity, necessitating a balanced approach that leverages AI capabilities and human ingenuity.

In summary, while integrating AI into scientific research necessitates a cautious approach to mitigate risks such as bias and privacy concerns, the potential benefits are monumental. LLMs offer a unique opportunity to distill a culture of experimentation in firms and policy at scale, allowing for systematic, data-driven decision-making instead of reliance on intuition, which can increase workers’ productivity. In policymaking, they can facilitate the piloting of policy options through low-cost randomised trials, thereby enabling an iterative, evidence-based approach. If these risks are judiciously managed, generative AI offers an invaluable toolkit for conducting more prolific, transparent, and data-driven experimentation, without diminishing the essential role of human creativity and discretion.


Acemoglu, D and S Johnson (2023), Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity, Hachette UK.

Acemoglu, D, D Autor, J Hazell and P Restrepo (2021), “AI and jobs: Evidence from US vacancies”,, 3 March. 

Al-Ubaydli, O, M S Lee, J A List, C L Mackevicius, and D Suskind (2021), "How can experiments play a greater role in public policy? Twelve proposals from an economic model of scaling", Behavioural Public Policy 5(1): 2-49.

Athey, S (2015), “Machine learning and causal inference for policy evaluation”, in Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Athey, S and G W Imbens (2019), “Machine learning methods that economists should know about”, Annual Review of Economics 11: 685–725.

Athey, S and M Luca (2019), “Economists (and economics) in tech companies”, Journal of Economic Perspectives 33: 209–30.

Bommasani, R, D A Hudson, E Adeli et al. (2022), “On the opportunities and risks of foundation models”, arXiv preprint, arXiv:2108.07258

Brynjolfsson, E, D Li, and L R Raymond (2023), “Generative AT at work”, NBER Technical Report w31161.

Brynjolfsson, E and A Mcafee (2017), “Artificial intelligence, for real”, Harvard Business Review 1: 1–31.

Bubeck, S, V Chandrasekaran, R Eldan et al. (2023), “Sparks of artificial general intelligence: Early experiments with GPT-4”, arXiv preprint, arXiv:2303.12712.15

Charness, G, G R Frechette, and J H Kagel (2004), “How robust is laboratory gift exchange?”, Experimental Economics 7: 189–205.

Charness, G, B Jabarian, and J A List (2023), "Generation Next: Experimentation with AI", NBER Working Paper 31679.

Chen, M, J Tworek, H Jun et al. (2021), “Evaluating large language models trained on code”, arXiv preprint, arXiv:2107.03374

Davies, A, P Veličković, L Buesing et al. (2021), “Advancing mathematics by guiding human intuition with AI”, Nature 600: 70–74

Eloundou, T, S Manning, P Mishkin, and D Rock (2023), “GPTs are GPTs: An early look at the labor market impact potential of large language models”, arXiv preprint, arXiv:2303.10130.

Fréchette, G R, K Sarnoff, and L Yariv (2022), “Experimental economics: Past and future”, Annual Review of Economics 14: 777–794

Gillen, B, E Snowberg, and L Yariv (2019), “Experimenting with measurement error: Techniques with applications to the caltech cohort study”, Journal of Political Economy 127: 1826–1863.

Jabarian, B and E Sartori (2020), “Critical thinking and storytelling”, arXiv preprint, arXiv:2303.16422

Kleinberg, J, J Ludwig, S Mullainathan, and A Rambachan (2018), “Algorithmic fairness”, AEA Papers and Proceedings 108: 22–27.

Korinek, A (2023), “Language models and cognitive automation for economic research”, NBER Working Paper No. w30957.

Lazer, D M J, M A Baum, Y Benkler et al. (2018), “The science of fake news”, Science 359: 1094–1096.

Levitt, S D and J A List (2009), "Field experiments in economics: The past, the present, and the future", European Economic Review 53(1): 1-18.

List, J. A (2011), “Why economists should conduct field experiments and 14 tips for pulling one off”, Journal of Economic Perspectives 25(3): 3-16.

List, J A (2023), A course in experimental economics, University of Chicago Press.

Matthews, D (2023), “The AI rules that US policymakers are considering, explained”,, 1 August.

Noy, S and W Zhang (2023), “Experimental evidence on the productivity effects of generative artificial intelligence”, Science 381: 187–192.

Pennycook, G, Z Epstein, M Mosleh, A A Arechar, D Eckles and D G Rand (2021), “Shifting attention to accuracy can reduce misinformation online”, Nature 592: 590–595.

Saunders, W, C Yeh, J Wu, S Bills, L Ouyang, J Ward and J Leike (2022), “Self-critiquing models for assisting human evaluators”, arXiv preprint, arXiv:2206.05802.

Snowberg, E and L Yariv (2021), “Testing the waters: Behavior across participant pools”, American Economic Review 111: 687–719.