The evaluation of econometric models

A one-day workshop was held on Friday, 23 March to discuss methods of evaluating econometric models generally, though the evaluation of large macroeconometric models was a particular concern. The workshop, which was part of the Centre's programme on Developments in Applied Economic Theory and Econometrics, was chaired by Programme Director Grayham Mizon. It was attended by 24 people, including academic econometricians, economists from HM Treasury, NEDO and the Civil Service College, and members of the NIESR and LBS modelling teams.

In his introductory remarks, Mizon noted that econometric models had been built and used since the mid-1960s, but for obvious reasons more effort had been put into their construction than into the development and application of methods for evaluating and comparing such models. Now attention should turn to the adequacy of commonly used methods of model evaluation (e.g. comparison of forecast accuracy and dynamic simulation performance) and the alternatives to be explored. This was particularly important since the funding of many of the large macroeconometric modelling teams had been made more secure by the allocations of the Consortium on Macroeconomic Research. Furthermore, the creation of the Economic and Social Research Council's Macroeconomic Modelling Bureau at Warwick University, with funding from the Consortium, should allow easier access to the major models, and permit more ambitious model evaluation and comparison exercises. With these remarks in mind, the participants heard papers by David Hendry, Trevor Breusch, Hashem Pesaran and Noxy Dastoor, which were then commented on by Ron Smith, Alberto Holly, Len Gill and Grayham Mizon before a general discussion was opened. Some of the material was rather technical, but the issues addressed were of wide general relevance.

Hendry emphasized the need for funding of research on model evaluation because some methods currently used were inadequate. His paper concentrated on the statistical considerations which were relevant for model evaluation, in order that attention could be focussed on a coherent set of problems and potential solutions. While other non-statistical criteria are clearly relevant, econometric evaluation would always be an important part of model assessment. If they were to be useful, models must be subjected to a wide range of statistical tests - e.g. for serial correlation, heteroscedasticity, instrument validity, and constancy of parameters; and investigating whether random errors are innovations (shocks) relative to the information set being used.

Hendry pointed out the special role of such tests in modelling and model selection. Since an acceptable model for a particular purpose would by design satisfy the battery of test statistics employed to select it, the results of such tests cannot be interpreted in the usual way. Rather they should be regarded as descriptive statistics, which characterize the adequacy of the modelling strategy. For example, if models must exhibit constant parameters in order to be acceptable, the particular value taken by the parameter constancy test statistic for the finally reported model simply reflects the stringency with which this particular selection criterion was applied. A genuine test of the constancy of a model's parameters could only be carried out using data which were not employed in the model selection process.

In his comments on Hendry's paper, Ron Smith challenged the view that one should "test, test and test again", arguing that it is virtually impossible to determine the exact statistical properties of all the tests together. General discussion revealed that most participants were not persuaded that it was feasible or preferable to adopt the alternative procedure, a formal decision theoretic approach to model selection, with the costs and benefits of the relevant model characteristics clearly specified in a loss function.

Breusch and Dastoor analyzed in more detail particular types of significance test statistics. An increasingly common approach is to use the specification tests devised by Hausman, which test the adequacy of a model by comparing an efficient estimate of the key model parameters with another estimate which would be consistent for the same parameters if the appropriate model were more general than the one being entertained. Such tests can be compared with classical (e.g. likelihood ratio) tests of the hypothesis which yields the model of interest as a special case of a more general model. For situations where they are different, it has been suggested that Hausman tests are better because they focus precisely on the requirement that the estimators of key parameters have desirable properties. Breusch compared properties of Hausman and classical tests using the explicit objective of good parameter estimates, and he concluded that the superiority claimed for Hausman tests was ill-founded. In discussion it was suggested that while Breusch's argument was convincing, Hausman tests could nevertheless be useful as general tests of model adequacy. Dastoor maintained that the Cox non-nested test statistic, which is useful for comparing pairs of models, neither of which is a special case of the other - a very common situation with macroeconometric models - can also be interpreted as a classical test, within the framework of a general model which embeds the competing models. Whilst this point was uncontroversial, there was strong disagreement with the view that such a general model used for this purpose need not be economically sensible.

The question of how to compare alternative models which are not necessarily special cases of each other was treated by Pesaran. He discussed the use of information criteria for measuring the "closeness" of models and provided a taxonomy of nested, non- nested, and non-nested but locally nested models. Pesaran, who was presenting joint work with Ron Smith, also emphasised the practical difficulties in attempting to build large macroeconomic models which are useful, satisfy economic theorists and also pass the rigorous technical tests of econometricians. They argued that it is not surprising that compromise and pragmatism are the rule in large scale modelling. There was general agreement among the participants, however, that notwithstanding these difficulties models must be subjected to rigorous testing.
The most commonly used methods for comparing and evaluating econometric models are based on dynamic simulation tracking performance, forecast accuracy and economic plausibility. Hendry argued that the first criterion is inadequate. Differences among models in their choices of endogenous and extraneous variables make inter-model comparisons difficult, and this is not resolved by having all model builders agree on a common set of exogenous variables. All that dynamic simulation accuracy reflects, Hendry argued, is the extent to which the explanation of the data is attributed to non-modelled variables, i.e those which are asserted to be exogenous. Hence dynamic simulation is not a sensible model selection criterion if one wishes to choose models for forecasting, policy analysis and testing economic theories. Neither does the second criterion, forecast accuracy, guarantee model validity, since forecasts are usually generated by a combination of the model and the model builder. Hendry also argued that one-period forecast accuracy was a consequence of choosing models according to goodness-of-fit criteria, so multi- period forecast tests were desirable.

Moreover, even models which have been designed to satisfy best- practice econometric tests should then be evaluated by using the new information provided by alternative models. This could be achieved by adopting the encompassing principle, which requires that a model be able to explain the behaviour of competing models. In particular a model which is to replace previously acceptable models should be able to account for at least as much as its predecessors.

Economic plausibility was too weak a criterion for model selection. Confirmation of economic theories was inadequate. Model builders could not impose economic theories on their models and then claim that their models lend support to those same theories. Nor are goodness of fit and "correct" parameter signs in themselves sufficient criteria for model validity.

It was generally agreed that the meeting had been productive, both for the content of the papers presented and for clarifying a research agenda for model evaluation. There was a clear need for further research funding on appropriate methods of model evaluation, which should be firmly focussed on the direct comparison of at least two working econometric models.