A paper by Fritz et al (2012) published last week in the Proceedings of the National Academy of Sciences shows that professional musicians are unable to distinguish between the tonal superiority of a violin built by Stradivari (which would cost up to $4 million) from that of a new American instrument (a couple of thousand).
Twenty-one violinists aged 20 to 65 who had been playing the violin for 15 to 61 years were asked to test six instruments -- two by Stradivari and one by Guarneri del Gesu dating from the first half of the 18th century, and three contemporary violins. Each violinist could play the six instruments, by pairs associating an “old” and a “new”, and was eventually asked to choose the one “she or he would most like to take home”, as well as the “best” and the “worst”.
Conditions were set so that the violins could not be identified. The results are as follows. One of the two instruments by Stradivari was consistently chosen less often than any of the three new ones. In the other pairs associating an old and a new violin, the old and the new were chosen with similar frequencies, but one of the three new ones was chosen as “best” more often than any other, new or old. And just eight of the 21 violinists chose an old violin “to take home”.
Is this the sign of mishearing or of paying too much for instruments that are not worth their price? The results are reminiscent of another musical (in this case, natural) experiment I studied along with Jan van Ours (Ginsburgh and van Ours 2003). In that paper, we considered the final ranking of the 12 finalists (out of some 85 candidates) of a piano competition, which ranks among the top five in the world, and whose judges consist of very selected international top soloists.
At each stage of the competition including in the finals, candidates appear in an order randomly picked before the contest starts (this order of those who remain in the competition is the same through all the stages). Though the order tries to introduce ex ante fairness, it results in ex post unfairness. Indeed, the final ranking of the 12 pianists is correlated with this order: those who appear during the first six evenings of the finals and those who perform first in each evening have a lower probability of being ranked among the first. Yet a better final rank makes for a better later career. Experts ostensibly decide on quality, but it seems talent hardly matters.
Wine, skating, and films
Likewise, Ashenfelter and Quandt (1999) show that there is lack of concordance between wine judges. Hodgson’s (2008) result is even stronger, since he finds that only about 10% of the judges are able to replicate their score within a single wine medal group.
In artistic skating, evaluation depends on the incentives and the monitoring faced by judges. Lee (2004) points out that they face an “outlier aversion bias” because they may be excluded from further competitions if they cannot explain why their rating is at odds with the mean of other judges. Therefore, they manipulate their ratings to achieve “a targeted level of agreement with the other judges,” which essentially implies that their judgement is based on previous achievements, and not on the one that is unfolding, since they have to cast their votes a couple of seconds after the performance of each skater.
Film and novels are marked by similar judging problems. Singing in the rain (1952), Vertigo (1958), North by northwest (1959), Some like it hot (1959), Psycho (1960), 2001: A space odyssey (1968) and many other movies that appear to ay in the largest number of so-called “100 best movies lists,” had not even been nominated by the Academy of Motion Picture which bestows the Oscars. In 1959, Ben Hur was given the Oscar for best picture, while the Academy ignored North by Northwest and Some like it hot (Ginsburgh and Weyers 2011). In the 1960s, several publishers, including Simon & Schuster, whose referee claimed that the manuscript “isn’t really about anything,” rejected Toole’s Confederacy of Dunces. Toole committed suicide in 1969. The book was finally published in 1980 and was awarded the Pulitzer Prize for best novel.
Properly developed statistical procedures, based on the rating of explicit characteristics, perform much better in diagnosing health conditions than clinical methods, which rest on implicit mental processes (Dawes et al 1991 and Meehl 1996).
The ears of musicians are obviously not perfect, nor are the taste buds of wine tasters, the eyes of those who judge sports competitions or movies, the reading abilities of those who select books, the diagnoses of those who are supposed to take care of our health.
Though the consequences of financial expertise have more dramatic consequences, why should the Standard and Poors, Moodys and Fitches of this world, who gave triple-A ratings to securities that turned out as junk, be any better?
Ashenfelter, O and R Quandt (1999), “Analyzing wine tasting statistically”, Chance,12:16-20.
Dawes, Robyn, David Faust and Paul Mehl (1991), “Clinical versus Actuarial Judgment”, Science, 243:1668-1673.
Fritz, C, J Curtin, J Poitevineau, P Morrel-Samuels, and F-C Tao (2012), “Player preferences among new and old violins”, Proceedings of the National Academy of Sciences.
Ginsburgh, V and J van Ours (2003), “Expert opinion and compensation: evidence from a musical competition”, American Economic Review, 93:289-298.
Ginsburgh, V and S Weyers (2011), “De l'(in)efficacité des concours et des prix, in D. Lories et R. Dekoninck, eds., L'art en valeurs, Paris: L'Harmattan, 2011.
Hodgson, R. (2008), “An examination of judge reliability at a major US wine competition”, Journal of Wine Economics, 3:105-113.
Lee, J (2004), “Outlier aversion in evaluating performance: Evidence from figure skating”, IZA Discussion Paper 1257.
Meehl, P (1954), Clinical versus Statistical Prediction. Northvale, New Jersey and London, Jason Aronson Inc., 1996 (first published in 1954).