The recent availability of large datasets, combined with advances in the fields of statistics, machine learning, and econometrics, have generated interest in predictive models with many possible predictors. In 2018 a researcher who wants to forecast the future growth rate of US GDP, for example, can use hundreds of potentially useful predictive variables, such as aggregate and sectoral employment, prices, interest rates, and many others.
In this type of 'big data' situation, standard estimation techniques – such as ordinary least squares (OLS) or maximum likelihood – perform poorly. To understand why, consider the extreme case of an OLS regression with as many regressors as observations. The in-sample fit of this model will be perfect, but its out-of-sample performance would be embarrassingly bad. More formally, the proliferation of regressors magnifies estimation uncertainty, producing inaccurate out-of-sample predictions. As a consequence, inference methods aimed at dealing with this curse of dimensionality have become increasingly popular.
Ng (2013) and Chernozhukov et al. (2017), suggested that these methodologies can be divided into two broad classes:
- Sparse modelling techniques. These focus on selecting a small set of explanatory variables with the highest predictive power from a much larger pool of possible regressors. For instance, the popular LASSO and its variants belong to this class of estimators that yield sparse representations of predictive models (Tibshirani 1996, see Belloni et al. 2011 for a recent survey and examples of big data applications of these methodologies in economics).
- Dense modelling techniques. At the opposite side of the spectrum, these techniques recognise that all possible explanatory variables might be important for prediction, although the impact of some of them may be small. This insight justifies the use of shrinkage or regularisation techniques that prevent overfitting by forcing parameter estimates to be small when sample information is weak. Factor analysis or Ridge regressions are standard examples of dense statistical modelling (Pearson 1901, Tikhonov 1963, see Stock and Watson 2002 or De Mol et al. 2008 for big data applications of these techniques in economics).
While similar in spirit, these two approaches might differ in their predictive accuracy. In addition, there is a fundamental distinction between a dense model with shrinkage, which pushes some coefficients to be small, and a sparse model with variable selection, which sets some coefficients identically to zero. Low-dimensional, sparse models may also appear easier to interpret economically, which is an attractive property for researchers.
Before even starting to discuss whether these structural interpretations are warranted – in most cases they are not, given the predictive nature of the models – it is important to address whether the data are informative enough to favour sparse models and rule out dense ones.
Sparse or dense modelling?
We proposed to shed light on these issues by estimating a model that encompasses both sparse and dense approaches (Giannone et al. 2017). Our main result was that sparse predictive models are rarely preferred in economics. A clearer pattern of sparsity only emerges when a researcher strongly favours low-dimensional models a priori.
We developed a variant of the 'spike-and-slab' model, originally proposed by Mitchell and Beauchamp (1998). The objective was to predict a variable of interest – say, GDP growth – using many predictors, for example a large number of macroeconomic indicators.
The model postulates that only some of the predictors are relevant. The unknown fraction that are relevant – denote it by q – is a key object of interest since it represents model size. Note, however, that if we tried to conduct inference on model size in this simple framework, we would never estimate it to be very large. This is because high-dimensional models without regularisation suffer from the curse of dimensionality, as above. Therefore, to make our sparse–dense bake-off fairer, we also allowed for shrinkage: whenever a predictor was deemed relevant, its impact on the response variable was prevented from being too large to avoid overfitting. We then conducted Bayesian inference on model size and the degree of shrinkage.
Applications in macro, finance, and micro
We estimated our model on six popular 'big' datasets that have been used for predictive analyses with large information in the fields of macroeconomics, finance, and microeconomics.
In our macroeconomic applications, we investigated the predictability of aggregate economic activity in the US (Stock and Watson 2002) and the determinants of economic growth in a cross-section of countries (Barro and Lee 2004, Belloni et al. 2011).
In finance, we studied the predictability of the US equity premium (Welch and Goyal 2008), and the factors that explain the cross-sectional variation of US stock returns (Freyberger et al. 2017).
In our microeconomic analyses, we investigated the factors behind the decline in the crime rate in a cross-section of US states (Donohue and Levitt 2001, Belloni et al. 2014), and the determinants of rulings in the matter of government takings of private property in US judicial circuits (Chen and Yeh 2012, Belloni et al. 2012).
Table 1 reports some details of our six applications. They covered a broad range of configurations, in terms of types of data – time-series, cross-section and panel data – and sample sizes relative to the number of predictors.
Table 1 Summary details of the empirical applications
Source: Giannone et al. (2017).
Result 1: No clear pattern of sparsity
The first key result delivered by our Bayesian inferential procedure was that, in all applications but one, the data do not support sparse model representations. To illustrate this point, Figure 1 plots the posterior distribution of the fraction of relevant predictors (q) in our six empirical applications. In the case of Micro 1 this posterior is concentrated around very low values, but in none of the others. In all other applications, larger values of q are more likely, suggesting that including more than a handful of predictors is preferable in order to improve forecasting accuracy. For example, in the case of Macro 2 and Finance 1, the preferred specification is the dense model with all predictors (q=1).
Figure 1 Posterior density of the fraction of relevant predictors (q)
Source: Giannone et al. (2017).
Even more surprising, our posterior results were inconsistent with the existence of clear sparsity patterns even when the posterior density of q is concentrated around values smaller than 1, as in the Macro 1, Finance 2, and Micro 2 cases. To show this point, Figure 2 plots the posterior probabilities of inclusion of each predictor in the six empirical applications.
In the 'heat maps' of this figure, each vertical stripe corresponds to a possible predictor, and darker shades denote higher probabilities of inclusion. The most straightforward subplot to interpret is from Micro 1. This is a truly sparse model, in which the 39th regressor is selected 65% of the time, and all other predictors are rarely included.
Figure 2 Heat maps of the probabilities of inclusion of each predictor
Source: Giannone et al. (2017).
The remaining five applications, however, do not exhibit a distinct pattern of sparsity, because all predictors seem to be relevant with non-negligible probability. For example, consider the case of Macro 1, in which the best-fitting models are those with q around 0.25, according to Figure 1. But Figure 2 suggests that there is a lot of uncertainty about which specific group of predictors should be selected, because there are many different models using about 25% of the predictors with a very similar predictive accuracy. As a consequence, it is difficult to characterise any representation of the predictive model as sparse.
Result 2: More sparsity only with an a priori bias in favour of small models
Our second important result was that clearer sparsity patterns only emerged when the researcher has a strong a priori bias in favour of predictive models with a small number of regressors. To demonstrate this point, we re-estimated our model forcing q to be very small (more formally, we used an extremely tight prior centred on very low values of q).
Figure 3 shows the posterior probabilities of inclusion obtained with this alternative estimation. Relative to our baseline, these heat maps have much larger light-coloured areas, indicating that many more coefficients are systematically excluded, and revealing clearer patterns of sparsity in all six applications.
Put differently, when the model is forced to be low-dimensional, the data are better at identifying a few powerful predictors. When model size is not fixed a priori, model uncertainty is pervasive.
Figure 3 Heat maps of the probabilities of inclusion of each predictor when models are forced to be low dimensional
Source: Giannone et al. (2017).
Summing up, strong prior beliefs favouring low-dimensional models appear to be necessary to support sparse representations. In most cases, the idea that the data are informative enough to identify sparse predictive models might be an illusion.
Authors’ note: The views expressed in this paper are those of the authors and do not necessarily reflect views at the ECB, the Eurosystem, the Federal Reserve Bank of New York, or the Federal Reserve System.
Barro, R J and J-W Lee (1994), 'Sources of economic growth', Carnegie-Rochester Conference Series on Public Policy 40: 1–46.
Belloni, A, V Chernozhukov, and C Hansen (2011), 'Inference for high-dimensional sparse econometric models', in Advances in Economics and Econometrics, World Congress of Econometric Society 2010.
Belloni, A, D L Chen, V Chernozhukov, and C Hansen (2012), 'Sparse models and methods for optimal instruments with an application to eminent domain', Econometrica 80: 2369–2429.
Belloni, A, V Chernozhukov, and C Hansen (2014), 'Inference on Treatment Effects after Selection among High-Dimensional Controls', The Review of Economic Studies 81: 608.
Chen, D L and S Yeh (2012), 'Growth under the shadow of expropriation? The economic impacts of eminent domain', mimeo, Toulouse School of Economics.
Chernozhukov, V, C Hansen, and Y Liao (2017), 'A lava attack on the recovery of sums of dense and sparse signals', Annals of Statistics 45: 39–76.
De Mol, C, D Giannone, and L Reichlin (2008), 'Forecasting using a large number of predictors: Is Bayesian shrinkage a valid alternative to principal components?' Journal of Econometrics 146: 318–328.
Donohue, J J and S D Levitt (2001), 'The impact of legalized abortion on crime', The Quarterly Journal of Economics 116: 379–420.
Freyberger, J, A Neuhierl, and M Weber (2017), 'Dissecting Characteristics Nonparametrically', NBER Working Paper no. 23227.
Giannone, D, M Lenza, and G E Primiceri (2017), 'Economic Predictions with Big Data: The Illusion of Sparsity', CEPR Discussion Paper no. 12256.
Hastie, T, R Tibshirani, and M Wainwright (2015), Statistical learning with sparsity, CRC press.
Mitchell, T J and J J Beauchamp (1988), 'Bayesian Variable Selection in Linear Regression', Journal of the American Statistical Association 83: 1023–1032.
Ng, S (2013), 'Variable Selection in Predictive Regressions', in G Elliott and A Timmermann (eds.), Handbook of Economic Forecasting, Vol. 2, Elsevier.
Pearson, K (1901), 'On lines and planes of closest fit to systems of points in space', Philosophical Magazine Series 6(2): 559–572.
Stock, J H and M W Watson (2002a), 'Forecasting Using Principal Components from a Large Number of Predictors', Journal of the American Statistical Association 97: 147–162.
Tibshirani, R (1996), 'Regression shrinkage and selection via the lasso', Journal of the Royal Statistical Society. Series B (Methodological): 267–288.
Tikhonov, A N (1963), 'Solution of Incorrectly Formulated Problems and the Regularization Method', Soviet Math. Dokl. 5: 1035-1038.
Welch, I and A Goyal (2008), 'A Comprehensive Look at The Empirical Performance of Equity Premium Prediction', Review of Financial Studies 21: 1455–1508.