6.3 Estimation
6.3.1 Single Equation Methods
6.3.1.1 Instrumental Variables
We begin with Instrumental Variables (IV) estimation for several reasons. First, it
is easy to show that the estimator is consistent. Second, as we shall see, it is easy to
show that our other systems estimators are equivalent to the IV estimator. Third, the IV
estimator has a number of applications outside of systems models, making it rather more
general in applications.
Suppose that the jth equation is
In an earlier section we showed that dj cannot
be consistently estimated because some of the variables on the RHS are correlated with the
error term. To get around this predicament we define Wj: Tx(mj+kj)
to satisfy
The IV estimator is
substituting for yj we get
Applying Slutsky's Theorem we can see
The asymptotic variance of the IV estimator (the variance of the asymptotic
distribution) is
with
a consistent estimator.
Suppose that the system is exactly identified. Then
The * designates excluded variables. The number of excluded exogenous variables, kj*,
is equal to the number of included endogenous variables, mj. Let the matrix of
instruments be made from the included and excluded exogenous variables, W = [ Xj*
Xj ].
so. Or, expanding the matrices
We can show the equivalence of the IV estimator and the Indirect Least Squares
estimator. Begin by going back and estimating the reduced forms coefficients for the
edogenous variables in the jth equation.
Premultiply by (X'X)
12 (a)
From our work on identification we know that the following restrictions must hold
Now rewrite (a) as
Postmultiply the above result by to
get
The product of the second and third matrices on the left hand side of the equation we
recognize as . Making this substitution
yields
Doing the multiplication and rearranging terms gives the following
Now factor out the estimated parameter vector
20 (i)
We can see that (a) and (i) are the same thing !
When the equation is overidentified ILS and IV are not equivalent. ILS produces more than one estimate of the
overidentified parameters, while IV uses the overidentifying restrictions in producing its
one estimate of each parameter.
What Does IV Do?
The following three figures help us understand what IV does. The first
figure shows the normal case of the classical regression model in which no RHS variable is
correlated with the error term. The second figure illustrates what happens when the zero
correlation assumption is violated but we apply OLS anyway. The third figure shows the IV
correction.
Figure 1: The Classical Regression Model
In the above figure Y is the dependent variable, X is the set of k independent
variables. The circles represent a contour of the normal distribution in the space in
which the dependent variable lies. The idea behind the contour is similar to the use of
indifference curves in micro theory. All the points on a contour are equally likely to
occur. Points on a smaller circle are more likely to occur; that is, they are higher on
the probability density function. e is a particular realization of the error term. The
particular realization of the error term is added to E(Y) to give the realized value of
the dependent variable. OLS then projects the realized dependent variable onto the space
spanned by the columns of X. Since the mean of the error term is zero, we are adding zero
on average to E(Y). On average then our OLS projection would give us E(Y) = Xb.
Figure 2: OLS When E(X'U)¹0
The principle difference between the first and second figures is that the mean of the
error term, conditional on X, is not zero. We have drawn it so that there is a positive
correlation between X and the error term. That is, the center of the error term contours
is no longer at the origin, nor is it the axis on which it lies at right angles to the
space spanned by X. Thus we are always adding a positive term to Xb.
The E(Y) is now located to the northeast of Xb. We will, on
average, be dropping a perpendicular from E(Y) onto the space spanned by X. As you can see
from the diagram we will overshoot Xb by some amount, even on
average. Our conclusion is that OLS produces biased estimates of the model parameters.
Increasing the sample size won't correct things since all we accomplish is to make the
contours tighter. Hence, OLS is also not consistent.
Figure 3: The Instrumental Variables Projection
In figure 3 the error contours have been suppressed to make the figure easier to
follow. The realization of the dependent variable, Y, is shown. As is the space spanned by
the 'independent' variables, X. The non-zero error term, correlated with the RHS
variables, is shown as before. The figure shows that E(Y) when projected orthogonally onto
the space spanned by the RHS variables couses us to overestimate Xb.
Added to the picture is the space spanned by the set of instruments.
IV first projects the RHS variables onto the space spanned by the instruments, W. Then the
dependent variable is projected onto this set of fitted values. As you can see from the
picture, when the dependent variable is projected onto W it results in an oblique
projection onto the original set of RHS variables. The result is that, at least in large
samples, we will correctly estimate Xb with greater likelihood.
The algebra is straightforward. We do the general case in which an equation is
overidientified because it provides a lead-in to the next section. We suppress the
subscript for the equation number.
Recall that Z is the full set of RHS variables, y is the dependent variable, and W is the
set of instruments. In the first step project the RHS variables onto the set of
instruments
Replace the original RHS variables by their fitted values and estimate the original
set of parameters
From the last line we can see that in IV we are really using a set of fitted values of
the dependent variable that have been created by projecting the dependent variable onto
the space spanned by the instrumental variables.
6.3.1.2 Two Stage Least Squares
We have the structural model
23 (i)
which can be written in reduced form as
24 (ii)
Our problem is that the system is overidentified. The consequence is that we cannot
use either ILS or the simple IV estimator outlined in the sections preceding the earlier
discussion of what IV does. We need to devise a method that uses all of the information in
the exogenous variables.
We will consider the jth equation
of (i)
25
(iii)
The usual problem is that Yj
is correlated with the error. IV solves this problem by choosing as instruments a set of
variables which are correlated with Yj,
but not with the error term. What we plan to do is use linear combinations of all the
exogenous variables in place of the endogenous variables which are included on the RHS of
the equation. To construct the linear combinations of the set of all exogenous variables
we run the regression
The fitted values of the included endogenous variables are given by
Note that (X'X)-1X'Yj is part of the matrix of reduced form
coefficients. With this result we can rewrite the model in (iii) as . After factoring out the coefficients we can
write this in matrix form as
Now apply least squares
Now X(X'X)-1X' is idempotent so
Since the exogenous variables are independent of the error term and the fitted
endogenous variables are orthogonal to the residual we can also write
Making these substitutions, the Two Stage Least Squares estimator is
Suppose the model is exactly identified so that kj* = mj.
In this case Xj* would work equally well as the set of instruments
as . If you go back to equation (i) on page
21 and make this substitution you will see that ILS, IV and 2SLS are all the same in the
exactly identified case.
Finally, since the IV estimator is known to be consistent, we also know 2SLS to be
consistent.
6.3.1.3 Limited Information Maximum Likelihood
We are interested in the jth structural equation once again. We begin by
looking at a partitioning of the reduced form that corresponds to the jth
structural equation
To economize on notation a little bit we will write this as
The error covariance matrix is written as
The log of the likelihood function for the joint density is
The idea is that we will maximize the likelihood for the equation of interest subject
to the constraints that we developed in the section on identification. Namely,
In the exactly identified case we can maximize the likelihood then go to the
constraints to solve for the structural coefficients. We conclude also that LIML is,
therefore, equivalent to ILS, IV, and 2SLS in the exactly identified case. Also observe
that the likelihood function is equivalent to that for the SUR model.
Returning to the over identified case, construct the concentrated likelihood function.
First, find
Solve for and substitute back into the
log likelihood, this will give us the likelihood in terms of only the reduced form
coefficients.
Now form the lagrangean
and determine
THEOREM
The LIML estimates for bj andare respectively
a) endogenous coefficients
where the normalization rule is and g
is the solution to . In order for there to
be a solution the part in brackets must be singular. Therefore, l*
is found as the maximum root of the determinantal equation . So we conclude that g is the characteristic vector corresponding
to l*. Note that the root can be solved from .
Q is an mj+1 x mj+1 matrix of RSS in the regression of all the
endogenous variables of equation j on all of the exogenous variables of the system.
Q1 is an mj+1 x mj+1 matrix of RSS in the regression
of all the endogenous variables of equation j on only those exogenous variables included
in the jth equation.
b) exogenous coefficients
LEMMA
(i)
(ii) l* £ 1
Proof:
(i) Recall from linear algebra
where l1 is the largest root of
(ii) We can write Q1 = Q + D where D is a positive semi definite matrix. So
The Lemma leads us to a minimum variance ratio interpretation of the LIML. Consider
the jth structural equation in its original form and a version which includes
the entire set of exogenous variables
Incidentally, comparing these two specifications of the jth equation could
form the basis of a test for the exclusion restrictions of the structural model.
Define two sets of residual sums of squares corresponding to the two specifications of the
jth structural equation.
Now construct the ratio
The trick is to pick the coefficients on the endogenous variables, , so that l is
maximized. Or, is minimized. Call this a
minimum variance ratio estimator
If then we conclude that the exclusion
restrictions are correct. If then we
conclude that the model is incorrectly specified. For any model .
THEOREM
The LIML estimator minimizes the correlation coefficient between and Wbj*
where Vj = Yj - Xj(Xj'Xj)-1Xj'Yj
is the residual from the regression of Yj on the included exogenous variables
and W = Xj*-Xj(Xj'Xj)-1Xj'Xj*
is the residual from the regression of the excluded exogenous variables on those which are
included. r is known as the cannonical correlation coefficient, so LIML is also known as
the minimum correlation coefficient estimator.
6.3.1.4 k-Class Estimators
Consider first the 2SLS estimator
Which we can rewrite as
If we had applied OLS to the model we would get
If we construct an estimator
then one can see that when k=0 we get OLS and when k=1 we get 2SLS.
THEOREM
Recall that we had previously said that LIML was the minimum variance ratio
estimator
If we let k = in our k-class estimator
then we get the LIML.
Proof:
Recall
76 (1)
To solve for we will partition Q1 and Q as Q1 = [q11 Q12]
and
Q = [q1 Q2], where q11 and q1 each
have one column corresponding to the endogenous variable on the left hand side of the
structural equation, and Q12 and Q2 each have the
remaining mj columns for the other included endogenous variables of the
equation. This allows us to rewrite (1) as
There are mj + 1 equations in this system; we have one too many so need to
discard one. We can partition further as
Given our definition , we can use our
partitioning to disregard the first equation in the determinantal equation and use
Rearranging, we can derive the LIML estimator
Now suppose that there are no included exogenous variables. That is, kj = 0
so
Note that we have Xj* in both Q22 and q21 because there is no difference between the
set of all exogenous variables and the set of all excluded exogenous variables for this
particular example. Substituting into we
get
We conclude then that the LIML estimator is also a k class estimator.
There is a problem with the k-class estimator
for some value of k the inverse explodes, so we get the follwoing relationship between
the different estimators. We know that the k for LIML is always greater than 1. Therefore,
the diagrams show us the ordering for the estimates, always!
6.3.2 System Methods
6.3.2.1 Three Stage Least Squares
The first equation of the structural model is
Or, using the notation we adopted for our explorations of 2SLS
In general we can write the jth equation in the system as
Now we will put the equations together as we did for SURE. Toward that end define the
following vectors and matrices
So the whole system can be written as
5 (1)
Premultiply (1) by , where X' is the
kxT matrix of all observations on all of the exogenous variables.
7 (2)
Rewrite (2) as
8 (3)
Estimate the coefficients of (3) by GLS
9 (4)
Estimate S as follows
10 (5)
where is the 2SLS estimate. Substitute
(5) into (4) to get the 3SLS estimate.
6.3.2.2 Full Information Maximum Likelihood The complete system of m
structural equations is written
Using all of the observations
where the subscripts show the dimensions of the matrices. The likelihood function is . Using a theorem form distribution theory we
can rewrite the likelihood as
The term at the right end of the expression is the absolute value of the determinant of
the matrix of partial derivatives. You wil recognize this as the Jacobean of the
transformation. In this instance it is given by so the likelihood is
Substituting in for the normal density
As usual it is easier to work with the log likelihood. We will also make use of the
fact that the exponent is a scalar. As such, taking its trace has no effect.
where
We have the following factoids from linear algebra
First consider
Substituting back into L* gives us the concentrated likelihood function.
The likelihood function is concentrated in the sense that we have found a maximum in the
G(G+1)/2 hyperplane of the covariance parameter space.
The result of substituting back in for AMA' gives
One would proceed by differentiating this with respect to the unknown coefficients,
setting those partials to zero and solving. Needless to say, the first order conditions
would be highly nonlinear. For years this was the stopping place in Full Information
Maximum Likelihood. With the advent of cheap computing power, it became possible to write
hill climbing algorithms based on the Gauss Newton principle that could solve these
equations numerically. That is, in fact, how LIMDEP, RATS and TSP do their maximum
likelihood estimation.
Before moving on, it is worth considering the exactly identified case. Go back to the
original likelihood and substitute in the reduced form to get
Define
and write the likelihood function as
The first order condition for the maximum is
From the FOCs we can derive estimates of the reduced form parameters . Therefore, we conclude that in the exactly
identified case, OLS provides us with the maximum likelihood estimates of the reduced form
parameters. We could then solve the reduced form parameter equations for the structural
parameter estimates.
To picture the relationship between estimators I have provided the following flowchart.