Multicollinearity
What multicollinearity is. Let H = the set of all the X (independent) variables. Let Gk = the set of all the X variables except Xk. The formula for standard errors is then
sbk
(1 R
2 s 1 RYH
*y
) * ( N K 1) s X k
2
X k Gk
2 s 1 RYH
*y
Tolk * ( N K 1) s X k
Vif k *
2 s 1 RYH
*y
( N K 1) s X k
Questions: What happens to the standard errors as R2YH increases? As N increases? As K increases? As the multiple correlation between one DV and the others increases?
From the above formulas, it is apparent that
The bigger R2YH is, the smaller the standard error will be.
The bigger R2XkGk is (i.e. the more highly correlated Xk is with the other IVs in the model), the bigger the standard error will be. Indeed, if Xk is perfectly correlated with the other IVs, the standard error will equal infinity. This is referred to as the problem of multicollinearity.
The problem is that, as the Xs become more highly correlated, it becomes more and more difficult to determine which X is actually producing the effect on Y.
Also, 1 - R2XkGk is referred to as the Tolerance of Xk. A tolerance close to 1 means there is little multicollinearity, whereas a value close to 0 suggests that multicollinearity may be a threat. The reciprocal of the tolerance is known as the Variance Inflation Factor (VIF). The
VIF shows us how much the variance of the coefficient estimate is being inflated by multicollinearity. For example, if the VIF for a variable were 9, its standard error would be three times as large as it would be if its VIF was 1. In such a case, the coefficient would have to be 3 times as large to be statistically significant.
Larger sample sizes decrease standard errors (because the denominator gets bigger). This reflects the fact that larger samples will produce more precise