One of the most frequently asked questions when people review predictive models of default is this: “Aren’t those explanatory variables correlated, and doesn’t this create problems with multicollinearity?” Since almost every default model has correlated explanatory variables, this is a question that comes up often. This post collects quotes on this issue from 12 popular econometrics texts and a comment by Prof. Robert Jarrow and a bank regulatory economist to answer this question.
The texts that we consulted were the following popular econometrics texts.
- Angrist, Joshua D. and Jorn-Steffen Pischke, Mostly Harmless Econometrics: An Empiricist’s Companion, Princeton University Press, Princeton, 2009.
- Campbell, John Y, Andrew W. Lo, and A. Craig McKinley, The Econometrics of Financial Markets, Princeton University Press, 1997.
- Goldberger, Arthur S. A Course in Econometrics, Harvard University Press, 1991.
- Hamilton, James D. Times Series Analysis, Princeton University Press, 1994.
- Hansen, Bruce E. Econometrics, University of Wisconsin, January 15, 2015.
- Hastie, Trevor, Robert Tibshirani and Jerome Friedman, Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, second edition, tenth printing, 2013.
- Johnston, J. Econometric Methods, McGraw-Hill, 1972
- Maddala, G. S. Introduction to Econometrics, third edition, John Wiley & Sons, 2005.
- Stock, James H. and Mark W. Watson, Introduction to Econometrics, third edition, Pearson/Addison Wesley, 2015.
- Studenmund, A. H. Using Econometrics: A Practical Guide, Addison-Wesley Educational Publishers, 1997.
- Theil, Henri. Principles of Econometrics, John Wiley & Sons, 1971.
- Woolridge, Jeffrey M. Econometric Analysis of Cross Section and Panel Data, The MIT Press, 2002.
We’ve selected the following quotes on multicollinearity from the texts above:
From Angrist and Pischke:
- Because of the quotes below, the authors of this new and respected book do not discuss multicollinearity and it is therefore not listed in their index.
From Goldberger, page 246:
- “The least squares estimate is still the minimum variance linear unbiased estimator, its standard error is still correct and the conventional confidence interval and hypothesis tests are still valid.”
- “So the problem of multicollinearity when estimating a conditional expectation function in a multivariate population is quite parallel to the problem of small sample size when estimating the expectation of a univariate population. But researchers faced with the latter problem do not usually dramatize the situation, as some appear to do when faced with multicollinearity.”
From Hansen, pages 105-107:
- “The more relevant situation is near multicollinearity, which is often called ‘multicollinearity’ for brevity. This is the situation when the X’X matrix is near singular, when the columns of X are close to linearly dependent. This definition is not precise, because we have not said what it means for a matrix to be ‘near singular.’ This is one difficulty with the definition and interpretation of multicollinearity. One potential complication of near singularity of matrices is that the numerical reliability of the calculations may be reduced. In practice this is rarely an important concern, except when the number of regressors is very large.”
- “A more relevant implication of near multicollinearity is that individual coefficient estimates will be imprecise…Thus the more ‘collinear’ are the regressors, the worse the precision of the individual coefficient estimates. What is happening is that when the regressors are highly dependent, it is statistically difficult to disentangle the impact of ß1 from that of ß2. As a consequence, the precision of individual estimates are reduced. The imprecision, however, will be reflected by large standard errors, so there is no distortion in inference.”
- “Some earlier textbooks overemphasized a concern about multicollinearity. A very amusing parody of these texts appeared in Chapter 23.3 of Goldberger’s A Course in Econometrics (1991), which is reprinted below.”
- See page 107 of Hansen for the Goldberger parody of the concern with multicollinearity concerns.
From Hastie, Tibshirani and Friedman,
- The term “multicollinearity” does not appear in the index nor in the text. The entire text of Chapter 3 describes the fact that correlation among explanatory variables is normal and discusses the relative efficiency of alternative linear regression techniques. Page 124 gives a specific example of highly correlated variables using a biomedical regression example.
From Johnston, page 164
- “If multicollinearity proves serious in the sense that estimated parameters have an unsatisfactorily low degree of precision, we are in the statistical position of not being able to make bricks without straw. The remedy lies essentially in the acquisition, if possible, of new data or information, which will break the multicollinearity deadlock.”
From Maddala, page 267
- “…Multicollinearity is one of the most misunderstood problems in multiple regression…there have been several measures for multicollinearity suggested in the literature (variance-inflation factors VIF, condition numbers, etc.). This chapter argues that all these are useless and misleading. They all depend on the correlation structure of the explanatory variables only…high inter-correlations among the explanatory variables are neither necessary nor sufficient to cause the multicollinearity problem. The best indicators of the problem are the t-ratios of the individual coefficients. This chapter also discusses the solution offered for the multicollinearity problem, such as ridge regression, principal component regression, dropping of variables, and so on, and shows they are ad hoc and do not help. The only solutions are to get more data or to seek prior information.”
From Stock and Watson, pages 205-206
- “Imperfect multicollinearity means that two or more of the regressors are highly correlated, in the sense that there is a linear function of the regressors that is highly correlated with another regressor. Imperfect multicollinearity does not pose any problems for the theory of the OLS estimators; indeed, a purpose of OLS is to sort out the independent influences of the various regressors when these regressors are potentially correlated.”
- “Imperfect multicollinearity is not necessarily an error, but rather just a feature of OLS (ordinary least squares), your data, and the question you are trying to answer.”
From Studenmund, page 264
- “The major consequences of multicollinearity are
1. Estimates will remain unbiased…
2. The variances and standard errors of the estimates will increase…
3. The computed t-scores will fall…
4. Estimates will become very sensitive to changes in specification…
5. The overall fit of the equation and the estimation of non-multicollinear variables will be largely unaffected…”
From Theil, page 154
- “The situation of multicollinearity (both extreme and near-extreme) implies for the analyst that he is asking more than his data are able to answer.”
Professor Robert Jarrow explains the issue of correlated explanatory variables this way:
“In order to achieve the highest levels of accuracy a model must combine a large amount of information, combining many input variables. In such a setting it is possible that some of the input variables will be correlated; extreme correlation is often referred to as multicollinearity. Nevertheless, even if variables are highly correlated, it is very often still the case that each of them adds information and explanatory power.”
“Multicollinearity is a well-studied issue in econometrics (see Johnson  and Maddala  for background). The term ‘multicollinearity’ refers to the condition in a regression analysis when a set of independent variables is highly correlated among themselves. This condition only implies that it will be hard to distinguish the individual effect of any of these variables, but their joint inclusion is not a concern. This is not an econometrics ‘problem’ with the regression analysis. Indeed, as long as the independent variables are not perfectly correlated (so that the X’X matrix is still invertible), the estimated coefficients are still BLUE (best linear unbiased estimates).”
“The only concern with multicollinearity in a regression is that the standard errors of the independent variables in the set of correlated variables will be large, so that the independent variables may not appear to be significant, when, in fact, they are. This can lead to the incorrect exclusion of some variables based on t-tests. However, if the set of variables help the fit of the regression (have a significant F-statistic for their inclusion), they should be included.”
“The idea is best explained by considering the simple regression y = a+bx+cz+e. Suppose x and z are highly correlated, but not perfectly correlated. Then, x and z span a 2-dimensional space. Both dimensions are important in explaining y. The inclusion of both x and z in the regression is important. Excluding one will only give a 1-dimensional space, and the explanatory power of the regression will be significantly less. The inclusion of x and z is necessary for the best model, for forecasting purposes. If x and z are highly correlated, the only issue is that both coefficients b and c are not accurately estimated. However, bx+cz, the linear combination’s influence on y is not adversely influenced. In our context, the final default probabilities are simply unaffected by correlated explanatory variables (unlike the individual coefficients).”
In a recent note, we decried the common error of dropping explanatory variables because an analyst, a supervisor, or a regulator judged that the signs were “not intuitive.” The fact that the signs of variables differ from their stand-alone sign as an explanatory variable is a common phenomenon when closely related (but different) variables are included as joint explanatory variables in a regression. Prof. Jarrow’s comment explains why neither variable should be dropped. A senior financial economist working for a major bank regulatory body adds this explanation of why “intuition” is an unreliable guide for variable exclusion:
“It’s no easy task to develop ‘intuitive signs’ for 37 variables used in a multiple regression. However, I would argue even this understates the degree of difficulty in determining the intuitive signs for variables in multivariate regression. Consider the regression setup from Prof. Jarrow’s comment above, with a simple regression with two highly correlated explanatory variables (but not so highly correlated that we have a ‘perfect multicollinearity’ problem). Suppose we’re primarily interested in the estimate for b which is the estimated coefficient on x. According to the Frisch-Waugh-Lovell (FWL) Theorem, the estimated coefficient on b can be calculated by regressing the residuals from a regression of y on z on the residuals from a regression of x on z. In other words, the estimated value of b reflects the correlation of the variation in y (around its mean) that is not explained by z with the variation in x (around its mean) that is not explained by z.”
“Even with just two variables, it would be difficult to develop an ‘intuition’ about the sign on b. If the regression instead contains 37 variables, many of which are correlated with x, I suggest it’s essentially impossible to develop any intuition on whether the estimated value for b should be positive or negative. Even if we’re very confident the estimated value for b is positive in a simple regression of y on x, the estimated value for b could be negative in a properly specified multiple regression with correlated explanatory variables.”
“I should note the exact results of the FWL Theorem are based on the linear algebra of least squares regression (with multiple non-orthogonal explanatory variables). However, even for more complicated models, it should still generally be true that the estimated coefficient on any explanatory variable x is net of the impact of all other explanatory variables on y and x.”
“For the FWL Theorem, see, for example, Bruce E. Hansen, Econometrics, 2014, pages 70-71.”
The Kamakura Corporation experience in default modeling is that the amount of data overwhelms the number of potential explanatory variables, so multicollinearity is almost never a problem. We have more than 2.5 million observations and more than 2,500 defaults in the Kamakura Risk Information Services public firm model and more than 10 million observations in the KRIS non-public firm model. For mortgage models, there are more than 70,000,000 mortgages for which data is available in the United States, and there is no problem in readily determining which variables are statistically significant and economically meaningful.
Donald R. van Deventer
Honolulu, May 26, 2009, updated April 16, 2015