One of the most frequently asked questions when people review predictive models of default is this: “Aren’t those explanatory variables correlated, and doesn’t this create problems with multi-collinearity?” Since almost every default model has correlated explanatory variables, this is a question that comes up often. Since I am not an econometrician (although many of my colleagues are), this post collects quotes on this issue from nine popular econometrics texts (it was a 3 day weekend in the USA) to answer this question.
The texts that we consulted were the following popular (and intelligent) econometrics texts. In the interests of full disclosure, we receive no commissions if any readers decide to buy them!
- Campbell, John Y, Andrew W. Lo, and A. Craig McKinley, The Econometrics s of Financial Markets, Princeton University Press, 1997.
- Goldberger, Arthur S. A Course in Econometrics, Harvard University Press, 1991.
- Hamilton, James D. Times Series Analysis, Princeton University Press, 1994.
- Johnston, J. Econometric Methods, McGraw-Hill, 1972
- Maddala, G. S. Introduction to Econometrics, third edition, John Wiley & Sons, 2005.
- Stock, James H. and Mark W. Watson, Introduction to Econometrics, second edition, Pearson/Addison Wesley, 2007.
- Studenmund, A. H. Using Econometrics: A Practical Guide, Addison-Wesley Educational Publishers, 1997.
- Theil, Henri. Principles of Econometrics, John Wiley & Sons, 1971.
- Woolridge, Jeffrey M. Econometric Analysis of Cross Section and Panel Data, The MIT Press, 2002.
We’ve selected the following quotes on multi-collinearity from the texts above:
From Goldberger, page 246:
- “The least squares estimate is still the minimum variance linear unbiased estimator, its standard error is still correct and the conventional confidence interval and hypothesis tests are still valid.”
- “So the problem of multicollinearity when estimating a conditional expectation function in a multivariate population is quite parallel to the problem of small sample size when estimating the expectation of a univariate population. But researchers faced with the latter problem do not usually dramatize the situation, as some appear to do when faced with multi-collinearity”
From Johnston, page 164
- “If multicollinearity proves serious in the sense that estimated parameters have an unsatisfactorily low degree of precision, we are in the statistical position of not being able to make bricks without straw. The remedy lies essentially in the acquisition, if possible, of new data or information, which will break the multicollinearity deadlock.”
From Maddala, page 267
- “…Multicollinearity is one of the most misunderstood problems in multiple regression…there have been several measures for multicollinearity suggested in the literature (variance-inflation factors VIF, condition numbers, etc.). This chapter argues that all these are useless and misleading. They all depend on the correlation structure of the explanatory variables only…high inter-correlations among the explanatory variables are neither necessary nor sufficient to cause the multicollinearity problem. The best indicators of the problem are the t-ratios of the individual coefficients. This chapter also discusses the solution offered for the multicollinearity problem, such as ridge regression, principal component regression, dropping of variables, and so on, and shows they are ad hoc and do not help. The only solutions are to get more data or to seek prior information.”
Stock and Watson, page 249
- “Imperfect multicollinearity means that two or more of the regressors are highly correlated, in the sense that there is a linear function of the regressors that is highly correlated with another regressor. Imperfect multicollinearity does not pose any problems for the theory of the OLS estimators; indeed, a purpose of OLS is to sort out the independent influences of the various regressors when these regressors are potentially correlated.”
Studenmund, page 264
- “The major consequences of multicollinearity are
- Estimates will remain unbiased…
- The variances and standard errors of the estimates will increase…
- The computed t-scores will fall…
- Estimates will become very sensitive to changes in specification…
- The overall fit of the equation and the estimation of non-multicollinear variables will be largely unaffected…”
Theil, page 154
- “The situation of multi-collinearity (both extreme and near-extreme) implies for the analyst that he is asking more than his data are able to answer.”
Our experience in default modeling at Kamakura is that the amount of data overwhelms the number of potential explanatory variables, so multi-collinearity is almost never a problem. We have more than 2 million observations and more than 2,000 defaults in our listed company model. For mortgage models, there are more than 70,000,000 mortgages for which data is available in the United States, and there is no problem readily determining which variables are statistically significant.
Comments and questions are welcome at firstname.lastname@example.org. I reserve the right to involve my colleagues Professor Robert A. Jarrow, Professor Jens Hilscher, and Sean Klein, senior research fellow, in answering any hard questions. For real time risk management commentary, follow Kamakura on twitter at www.twitter.com/dvandeventer.
Donald R. van Deventer
Honolulu, May 26, 2009