DATA PROBLEMS: Multicollinearity

Multicollinearity is one of the most widely taught of all the pathological diseases of econometrics. It is also one of the more frequently misunderstood of the pathological diseases. I believe this to be the case because on the surface it is conceptually a very simple idea. When we look a bit further, symptoms that we ascribe to multicollinearity may be the result of something else.  Similarly, we may be misled by the usual diagnostics.  When you are done reading this section of the notes, go to "Explorations in Multicollinearity."

We begin with the usual model y = xb + u.  y and u are nx1, x is nxk, and b is kx1. The error term is well behaved. There is no correlation between the independent variables and the error term.  If where ui is a "stochastic" term, with finite mean and variance, and not all the li are zero then we have a problem of multicollinearity. That is, there is some linear dependence, albeit not exact, between the columns of the design matrix. This defintion is already problematic. The classical assumption of regression analysis is that the columns of the design matrix are linearly independent; they span a vector space of dimension k; the matrix x'x has k non-zero roots. The model and the experiment are designed so that the independent variables have separate and independent effects on the dependent variable. With economic data the experimant usually cannot be reproduced or redesigned, in spite of the fact that a common assumption that x is fixed in repeated samples.. The data is not always well behaved.

What if the linear dependence, while not exact, is close? What if one of the roots is very small and another very large? What if one of the explanatory variables doesn't have much variation in it? In the sense of linear algebra the matrix x will still span a vector space of dimension k. That is, we can estimate the parameter vector b, but not very precisely.
Our problem is a sample phenomenon. The sample may not be rich enough to allow for all the effects we believe to exist.

Classic Multicollinearity and Imprecision of OLS


Then we can show


where

Clearly, the classic case of multicollinearity affects the precision of our estimator. Of course, in the event that x1 and x2 have an exact linear relationship the situation is even worse. The best that we could hope to do would be to estimate the single regression coefficient in yi = axi + ui, where a = b1 + lb2

Example
The model is
yt = b1 + b2x2t + b3x3t + b4x4t + ut
yt is import demand for France
x2t is a dummy variable for entry into the EEC in 1960
x3t is GDP
x4t is gross capital formation

The results, with standard errors in parentheses, are

yt = -5.92 + 2.1 x2t + .133 x3t + .55 x4t + et
(1.27)   (.2)   (.006) (.11)

All t statistics are significant.
Suppose we include an additional variable; consumption.

yt = -8.79 + 2.1 x2t - .021 x3t + .559 x4t + .235 x5t + et
(1.38) (.2) (.051) (.087)   (.077)

 
The inclusion of x5 has sopped up all the variation in y that had been previously explained by x3. The consumption variable is almost indistinguishable from GDP. The correlation between them is .99.

Multicollinearity, Characteristic Vectors and Roots

Suppose y = xb+ u. We can find the characteristic vectors of x'x. Call them A, so A'x'xA = L,  where L is a diagonal matrix of the characteristic roots of x'x. Since A is a collection of orthogonal vectors we can write (x'x)-1 = AL-1A'. Now we usually write , but we can also write it in terms of the characteristic roots and vectors; .
Now write A = [ a1 a2 ... ak ] where aj is a kx1 column vector corresponding to the jth column of A. This allows us to rewrite the variance as



Looking at a particular product of a pair of vectors we have

a1 is the characteristic vector in the first column of A and
a1'(x'x)-1a1=l1 so a1 may be thought of as picking off the variance of b1. So for a particular coefficient in the regression model

If one of the ajl2 is quite large relative to its characteristic root then we have a problem. Or, if one of the lk is quite small then we have a problem.

Using characteristic roots and vectors there is some simple geometry to the problem.
Assume that the number of observations is n=4 and the number of unknown location parameters is k=2. The scatter of observations on the indpendent variables is shown in figure 1.  Each * corresponds to the plot of an observation on x1 and x2

Figure 1

where e1 and e2 are the basis vectors for the space spanned by the two 4x1 vectors of observations on the independent variables. The characteristic vectors of x'x form a new basis which has the following relation to the original (see figure 2).

Figure 2

For this two variable model we have

Note and recall that the law of cosines is adjacent over hypotenuse. So cos(d) = a11 is large relative to a12 and cos(y)=a21 is small relative to a22.
Now Ax'xA' = L and x'xai = liai, or ai'x'xai = li. Making a simple transformation we can rewrite this as zi'zi = li. So li can be thought of as the variation in the data along the ith axis, since zi'zi is the explained sum of squares of the projection of the n observations onto ai. In our diagram the data along a1 is quite spread out, but is quite compressed along a2, so l1 >> l2. Therefore, numerator and denominator in are of the same order of magnitude. However, a222 is large and l2 is small so the large size of implies that b2 is measured with less precision.

Example:

We can give some numerical flesh to the exposition. In our example we are given the following design matrix and observations on the dependent variable:



Working in the real world of empirical analysis this would be all you would know about the data generating process. Since this is an experiment designed to show you the effects of multicollinearity, the following information is also provided



This should enable you to compute the four realizations of the disturbance vector. Can you do it?
Given the data, we can calculate the least squares coefficient estimates as. Now calculate an estimate of the error variance



Now we have enough information to compute the variance-covariance matrix for the coefficient vector and the observed t-statistics.

The critical t for a two tail test at the 10% level with two degrees of freedom is 2.9. No coefficient is statistically different from zero, in spite of what we know to be the truth.
Recall that the characteristic roots of a matrix of full rank are all nonzero. The roots for the matrix x'x are 45.642 and 1.358. While neither is zero, the larger is 33.6 times larger than the smaller. On the basis of our earlier observations about the variance of the estimator, this is potentially a problem. The simple correlation between the two indpendent variables is .942. The characteristic vectors corresponding to the two roots are. In figure 3 these vectors and the four observations for the independent variables are plotted. The vectors are the short, perpendicular lines at about 45o to the usual axes. These vectors form the basis for the same vector space. Notice, however, that in the a1 dimension the data on the independent variables is not very spread out. In the a2 dimension it is quite variable. The figure also shows the regression line of x1 on x2; the dashed line.


Figure 3

As shown above, we can calculate the coefficient variances from the characteristic roots and vectors.

We can also use the matrix of characteristic vectors, A = [ a1 | a2 ] to transform the original data onto the new basis of the vector space. There are four data points, with two coordinates each. This projection is given by

If we square each term then add across the rows we get an explained sum of squares for each of the two dimensions

Notice that these sums of squares are the characteristic roots. If the data are spread out along one of the characteristic vectors then we get a large root. If the data does not show a lot of variation along one of the vectors then we get a small root.

Returning to the coefficient variances in terms of roots and vectors, we see that a large estimated variance of the coefficient estimator can result from an unfortunately large error variance. There is nothing one can do about this. There could also be a lack of variation in the data along one of the characteristic vectors. This is often characterized by data having an elliptical shape in the plane of the independent variables, rather than being scattered in a spherical or cube shape. The large estimate of the coefficient variance might also be due to unfortunate values for the elements of the characteristic vectors. The sizes of these elements are related to the rotation of the axes of the characteristic vectors relative to the original axes. The rotation is large when the data has a lot of variation in one variable, but not the other, or when the independent variables are highly correlated.
Let us repeat the example with a 'good' design matrix. This design matrix is good in the sense that the simple correlation between the independent variables is zero. The observed data is now

The true coefficient vector is unchanged. There has been a new draw from the N(0,2) distribution for the error vector. Applying least squares we obtain the following results


Neither coefficient is different from zero. Since the correlation between RHS variables is zero, the conventional researcher might conclude that multicollinearity is not the source of his/her bad results. On the other hand, while neither of the two variables is significant, they do explain 29% of the variation in the dependent variable. This is usually taken as evidence that there is collinearity.

The eigen values for the matrix x'x are 6.25 and 10.75. On the basis of the condition number there is no reason to suspect that collinearity is the problem. To sort out all of this, Figure 3 is repeated for the new design matrix. In figure 4 we again show the regression of x1 on x2, the characteristic vectors, and the data.



Figure 4
It would seem that the bad results are a result of the unfortunately large estimate of the error variance, which inflates the estimates of the variances of the coefficients.

Now repeat the exercise again with another 'bad' design matrix and a new set of observed values of the dependent variable.



The data are plotted in figure 5 as the small boxes. The figure suggests that there is little or no relation between them; the regression line is the dashed line. The simple correlation is .70, rather high. The basis vectors corresponding to the eigen vectors of x'x lie very nearly along the original basis.


Figure 5

The model results are

One coefficient is now different from zero. Why isn't the other also different from zero? The characteristic roots are 49.156 and 4.856. The largest is about ten times the size of the other. This is not a small condition number, although the graph suggests that collinearity is not a problem. The problem here is that the data is rather spread out along one axis, but not the other. This ill conditioned design matrix produces a set of results which one might ascribe to multicollinearity.

Finally, we turn to the received doctrine about multicollinearity.

Consequences
1. Even in the presence of multicollinearity, OLS is BLUE and consistent.
2. Standard errors of the estimates tend to be large.
3. Large standard errors mean large confidence intervals.
4. Large standard errors mean small observed test statistics. The researcher will accept too many null hypotheses. The probability of a type II error is large.
5. Estimates of standard errors and parameters tend to be sensitive to changes in the data and the specification of the model.

Detection
1. A high F statistic or R2 leads us to reject the joint hypothesis that all of the coefficients are zero, but the individual t-statistics are low.
2. High simple correlation coefficients are sufficient but not necessary for multicollinearity.
3. Farrar-Glauber suggest regressing groups of variables on a culprit. If there is collinearity then the resulting F statistic will be large.
4. One can compute the condition number. That is, the ratio of the largest to the smallest root of the matrix x'x. This may not always be useful as the standard errors of the estimates depend on the ratios of elements of the characteristic vectors to the roots.

5. Leamer suggests using the magnification factor

where Rk.2 is the coefficient of determination from the regression of one of the explanatory variables on all of the others.


Remediation
1. Use prior information or restrictions on the coefficients. One clever way to do this was developed by Theil and Goldberger. See, for example, Theil, Principles of Econometrics, Wiley, 1971, P 347-352.
2. Use additional data sources. This does not mean more of the same. It means pooling cross section and time series.
3. Transform the data. For example, inversion or differencing.
4. Use a principal components estimator. This involves using a weighted average of the regressors, rather than all of the regressors. The classic in this application is George Pidot, "A Principal Components Analysis of the Determinants of Local Government Fiscal Patterns", Review of Economics and Statistics, Vol. 51, 1969, P 176-188.
5. Another alternative regression technique is ridge regression. This involves putting extra weight on the main diagonal of x'x so that it produces more precise estimates. This is a biased estimator.
6. Some writers encourage dropping troublesome RHS variables. This begs the question of specification error.

Now that you've finished reading the notes, go to "Explorations in Multicollinearity."