VoxEU Column Industrial organisation Frontiers of economic research

Machine learning as a natural experiment: Application to fashion e-commerce

Machine learning algorithms are increasingly being used in decision making. Web companies, car-sharing services, and courts rely on algorithms to supply content, set prices, and estimate recidivism rates. This column introduces a method for predicting counterfactual performance of new algorithms using data from older algorithms as a natural experiment. When applied to a fashion e-commerce service, the method increases the click through rate and improved the recommendations algorithm.  

Decision making using prediction by machine learning (ML) algorithms is becoming increasingly widespread (Athey and Imbens 2017, Mullainathan and Spiess 2017). For instance, Amazon, Facebook, Google, Microsoft, Netflix, and other web companies apply machine learning to problems such as personalising ads and content (movies, music, news, etc.), determining prices, and ranking search results. The prices set by car sharing services such as Uber, Lyft, and DiDi are also based on proprietary algorithms based on information about supply and demand at each point in time and location (Cohen et al. 2016).

The use of ML algorithms in decision making is expanding beyond the digital world into high stakes real-world settings such as court and bail decisions. COMPAS, a software developed by Northpointe (now Equivant), uses supervised machine learning to predict the defendant's recidivism rates. The resulting risk prediction is already being put into practice by many judges in the US (Kleinberg et al. 2017). Other emergent areas are personnel recruitment systems, predictive policing, and medical diagnostics (Hoffman et al. 2017, Horton 2017, Shapiro 2017, Rajkomar et al. 2019). Table 1 summarises some of these examples.

Table 1 Examples of decision making based on ML algorithms



Non-ML algorithms are also popular in public policy. For instance, matching and assignment algorithms are used in school choice and admissions systems (Abdulkadiroğlu et al. 2017, 2020, Narita 2020a), entry-level labour markets, and organ transplant markets around the world. Auction algorithms are widespread in many other settings, from government bond markets and wholesale markets to online advertising and second-hand goods markets. In other public policy areas, algorithmic rules are used to determine eligibility for benefits (Currie and Gruber 1996, Mahoney 2015). Such market design and policy eligibility rules are also algorithmic decision making (Table 2).

Table 2 Examples of algorithm-based public policy decision making



An important part of algorithmic decision making is to predict the performance of new decision-making algorithms that have not yet been used. With accurate performance prediction, the algorithm can be iteratively improved. A method of performance prediction that immediately comes to mind would be a randomised experiment (RCT, A/B test), in which an old algorithm and a new algorithm are randomly assigned to users and compared. However, RCTs are time-consuming, expensive, and come with potential ethical issues (Narita 2020b). Is there a way to predict performance using only the data that is naturally generated by past algorithms without resorting to RCTs?

We propose a method that uses the data accumulated by past algorithms to predict the counterfactual performance of a new algorithm. As detailed in Narita and Yata (2021), this method is based on the following observation (Figure 1): when an algorithm is used to make a decision, the algorithm-generated data will almost always contain natural experiments (quasi-randomly assigned instruments) in which the decision is made (quasi-)randomly conditional on algorithm inputs. For instance, many probabilistic reinforcement learning and bandit algorithms are almost RCTs themselves as they randomise the decision (Li et al. 2010, Precup 2000).

Figure 1 Why algorithm is experiment


As a less obvious example, consider an algorithm that makes a selection based on whether or not some variable predicted by supervised learning exceeds some criterion value. In this case, although the variable is almost the same near the criterion value, different decisions are made almost by chance, depending on whether the criterion value happens to be cleared or not. This is a regression discontinuity-style local natural experiment (Bundorf et al. 2019, Cogwill 2018).

These natural experiments can be used for a variety of purposes. They can be used to measure the treatment effects of different decisions; or they can be used to predict how a new decision-making algorithm is likely to perform when introduced. We formalise this observation for a general algorithm and develop a method to improve the algorithm using only the data naturally generated by the algorithm.

Potential applications of this method range widely from business to policy. As a concrete application, we deploy our method to improve the design of the fashion e-commerce service ZOZOTOWN. ZOZOTOWN is the largest fashion e-commerce platform in Japan, with an annual gross merchandise value of over $3 billion. The founder of this company is famous for purchasing SpaceX’s first civilian ticket to the moon for about $700 million. 

In this application, which we detail in Saito et al. (2020), we increased the click through rate of fashion recommendations made by ZOZOTOWN by about 40%. We also succeeded in finding ways to further improve the recommendation algorithm (Figure 2). The recommendation data and the code used in this implementation are open source and available on GitHub.  

Figure 2 Performance comparison of the old (right) and new (left) algorithms


Authors’ note: The main research on which this column is based (Narita et al. 2020c) first appeared as a Discussion Paper of the Research Institute of Economy, Trade and Industry (RIETI) of Japan.


Abdulkadiroğlu, A, J D Angrist, Y Narita and P A Pathak (2017), “Research Design Meets Market Design: Using Centralized Assignment for Impact Evaluation”, Econometrica 85(5): 1373–1432. 

Abdulkadiroğlu, A, J D Angrist, Y Narita and P A Pathak (2020), “Breaking Ties: Regression Discontinuity Design Meets Market Design”, Working Paper.  

Athey, S and G W Imbens (2017), “The State of Applied Econometrics: Causality and Policy Evaluation”, Journal of Economic Perspectives 31(2): 3–32. 

Bundorf, K, M Polyakova and M Tai-Seale (2019), “How Do Humans Interact with Algorithms? Experimental Evidence from Health Insurance”, NBER Working Paper No. 25976. 

Cowgill, B (2018), “The Impact of Algorithms on Judicial Discretion: Evidence from Regression Discontinuities”, Working Paper. 

Cohen, P, R Hahn, J Hall, S Levitt and R Metcalfe (2016), “Using Big Data to Estimate Consumer Surplus: The Case of Uber”, NBER Working Paper No. 22627. 

Currie, J and J Gruber (1996), “Health Insurance Eligibility, Utilization of Medical Care, and Child Health”, Quarterly Journal of Economics 111(2): 431–466. 

Hoffman, M, L B Kahn and D Li (2017), “Discretion in Hiring”, Quarterly Journal of Economics 133(2): 765–800. 

Horton, J J (2017), “The Effects of Algorithmic Labor Market Recommendations: Evidence from a Field Experiment”, Journal of Labor Economics 35(2): 345–385. 

Kleinberg, J, H Lakkaraju, J Leskovec, J Ludwig and S Mullainathan (2017), “Human Decisions and Machine Predictions”, Quarterly Journal of Economics 133(1): 237–293. 

Li, L, W Chu, J Langford and R E Schapire (2010), “A Contextual-Bandit Approach to Personalized News Article Recommendation”, Proceedings of the 19th International Conference on the World Wide Web (WWW): 661–670. 

Mahoney, N (2015), “Bankruptcy as Implicit Health Insurance”, American Economic Review 105(2): 710–46. 

Mullainathan, S and J Spiess (2017), “Machine learning: an applied econometric approach”, Journal of Economic Perspectives 31(2): 87–106. 

Narita, Y (2020a), “A Theory of Quasi-Experimental Evaluation of School Quality”, Management Science. 

Narita, Y (2020b), “Incorporating Ethics and Welfare into Randomized Experiments”, Proceedings of the National Academy of Sciences of the United States of America 118(1). 

Narita, Y, S Aihara, Y Saito, M Matsutani and K Yata (2020c), “Machine Learning as Natural Experiment: Method and Deployment at Japanese Firms”, RIETI Discussion Paper.

Narita, Y and K Yata (2021), “Algorithm is Experiment: Machine Learning, Market Design, and Policy Eligibility Rules”, Working Paper. 

Precup, D (2000), “Eligibility Traces for Off-Policy Policy Evaluation”, ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning: 759–766. 

Rajkomar, A, J Dean and I Kohane (2019), “Machine Learning in Medicine”, The New England Journal of Medicine 380(14): 1347-1358. 

Saito, Y, S Aihara, M Matsutani and Y Narita (2020), “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation”, arXiv: 2008.07146. 

Shapiro, A (2017), “Reform predictive policing”, Nature News 541(7638): 458-460.

2,205 Reads