DATA PROBLEMS: Missing Data
By way of introducing the area of concern and organizing our thinking, consider the
following table. The first row describes how it might come to pass that some of the data
is missing in both cross section and time series data. If the data is missing for a reason
that has nothing to do with the behavioral content of the model then the only consequence
is that our estimators are less efficient. This is solely a result of sample size. In a
cross section one could not predict which observations will be missing on the basis of
some knowledge about the respondent pool, for example.
There is a class of problem in which observations are missing as a result of the process
being modeled. This is the third row of the table. For example, we might be modeling the
purchase of refrigerators. If the purchase price of a refrigerator is never below $200,
then in our sample we will never see a household that would have liked to purchase a
refrigerator for $175.
Cross Section |
Time Series |
|
Nature of Problem |
Respondent fails to answer question |
Model needs monthly, we have quarterly |
Missing data unrelated to complete observations |
Lowers efficiency |
Lowers efficiency |
Gaps in data systematically related to behavior to be modeled |
See any text on censored and/or truncated data. The classic
estimator for this genre is Tobit. |
In the next table we categorize the missing data problems we can deal with at this
point. Basically, you can be missing data for either the dependent variable or for some of
the independent variables. The problem, or question, is how to leverage the data that one
does have in order to increase the efficiency of the estimator.
ya, xa |
na complete observations |
|
Case 1 |
___, xb |
nb missing observations on yb |
Case 2 |
yc, ___ |
nc missing observations on xc |
CASE 1 Missing observations for
dependent variable
Letbe a predictor for yb so the least
squares estimator for the coefficient vector in this filled data set is. In general we can write the normal equations as x'y = x'x, so make
this substitution for the data-on-hand estimator, ba, and the missing-data
estimator, bb:.
Define F = [xa'xa + xb'xb]-1xa'xa
so I - F = I - [xa'xa + xb'xb]-1xa'xa,
or I-F = [xa'xa + xb'xb]-1xb'xb
since I = [xa'xa + xb'xb]-1[xa'xa
+ xb'xb]. Then the filled data estimator is bf = Fba
+ (I-F)bb, with Ebf = Fb + (I-F)Ebb.
Unbiasedness is seen to depend on Ebb.
Method 1
Let i be a column vector of ones. For the missing data estimator we will replace each
missing observation on y by the mean of the observations on ya. Thenwhere andis a row
vector of variable means. Note where is a
column vector of variable means. Making the obvious substitution we get
Method 2
Get ba = (xa'xa)-1xa'ya
then build b = xbba and
Also,
Now for the variance. We have
Substitute in for
factor out xa'ya and substitute in for ya
Post-multiplying bf by its transpose and taking expectations will give us
the variance of our pooled estimator.
If you multiply xa'xa through to the right then the last pair of
square brackets cancel. Now factor (xa'xa)-1 out of the
remaining terms and one is left with. There is no gain in
efficiency from using the filled estimator. In practice, because one must estimate s2, it will appear that the estimated standard errors of
the filled estimator are smaller. Do you see why this is so? It depends only on the number
of observations the computer uses in the calculated estimate of s2.
EXAMPLE: Left Hand Side Missing Data
Suppose that the entire correct data set is shown in the following table.
Y | x |
0 | 0 |
2 | 1 |
1 | 2 |
3 | 1 |
1 | 0 |
3 | 3 |
4 | 4 |
2 | 2 |
1 | 2 |
2 | 1 |
If we use all of the observations to estimate a simple linear model then the result, with
t-statistics in parentheses, is
y = | .8333 | +.666x | +e | n=10 |
(1.75) | (2.81) |
But suppose that for some reason we do not observe the last four data points for y. Using the abbreviated data set our regression result is
y = | .6476 | + .6341x | +e | n=6 |
(1.43) | (1.55) |
Using the coefficient estimates from this abbreviated database we construct fitted values for the last four observations on y, bringing our total number of observations back up to ten. Running the regression with 6 observed values for y and 4 fitted values for y gives us the following result
y = | .9268 | +.6341x | +e | n=10 |
(2.32) | (3.179) |
Notice how much higher the t-statistics are than those in the regression
when we had all of the correct data. This is contrary to the derived result. We had
asserted above that our filled estimator was no more efficient. This is another case of
the computer not really knowing what is going on. It must compute an estimate of the error
variance from the observed data for use in an estimate of the variance of the coefficient
estimator. In our filled data set the last four observations contribute nothing to the sum
of squared errors, but the denominator is still 10. The consequence is that the estimate
of the error variance is understated. The implication is that the careless researcher will
use this filled estimator, which is no more efficient than the estimator based on the
abbreviated data set, not question the very high observed t-statistics and reject too many
null hypotheses.
Case 2 Missing observations on
the right hand side.
Simple Regression
We have the simple regression model ya = a + bxa + ua and the additional nc
observations on yc, but the observations on xc are missing. One
possibility is to useto fill the blanks. This won't work since
these terms would get zero weight in. We would gain nothing for
our effort.
Another possibility is to use the reverse regression to construct
then build a set of fitted values for xc according to . Now apply OLS to the filled model . The
sampling distribution of the OLS estimator for the parameter vector is unknown in this
scheme so we can say nothing about unbiasedness and efficiency.
EXAMPLE: Missing data on the right hand side
For this example we use the same data as in the previous example. If we run the simple regression with y on the right hand side, x on the left hand side and using only the first six observations then we get the following result
x = | .7633 | +.3817y | +e | n=6 |
(.24) | (1.55) |
Using these results to fill in for the missing values for x and applying OLS we get
y = | .7670 | + .8685x | +e | n=10 |
(1.59) | (2.90) |
Notice that the slope estimate is higher than when we have all the data, as it was in
the prior example. This is coincidence. It depends on the realization of the data which is
available to us. That is, it is sample specific. Also, the t-statistic on the slope
estimate is higher. This is not coincidence, for the same reason given in the earlier
example.
Multiple Regression
Given the available data we have the following model
(1)
for which we have na observations. We have another nc observations on yc and zc.
From the additional observations we might fill in for the missing xc. We could use the first na observations on x and z in the following equation
(2)
In the last nc observations we would then substitute in for x
The model then consists of the equations (1), (2), and (3). To operationalize
let .
Now . This method allows us to use more information about the
dependent variable and z in our estimation of . This should
provide us with a more efficient estimator.
EXAMPLE: Right Hand Side Missing Data: Multiple
Regression
We are given the following aggregate data for Taiwan. One could estimate a production function using this data
YEAR |
CAPITAL | LABOR | OUTPUT |
1958 | 17804. | 275.50 | 16608. |
1959 | 18097. | 274.40 | 17511. |
1960 | 18272. | 269.70 | 20171. |
1961 | 19167. | 267.00 | 20933. |
1962 | 19648. | 267.80 | 20406. |
1963 | 20804. | 275.00 | 20832. |
1964 | 22077. | 283.00 | 24806. |
1965 | 23445. | 300.70 | 26466. |
1966 | 24939. | 307.50 | 27403. |
1967 | 26714. | 303.70 | 28629. |
1968 | 29958. | 304.70 | 29905. |
1969 | 31586. | 298.60 | 27508. |
1970 | 33475. | 295.50 | 29036. |
1971 | 34822. | 299.00. | 29282. |
1972 | 41794. | 288.10 | 31536. |
Using all of the data one obtains
Q = | -1.6900 | + 2.454L | + .4326K | + e |
(-2.97) | (4.06) | (5.48) |
But suppose the last three observations on capital are missing. Using just the first twelve data points we observe
Q = | - 1.763 | +2.1193 L | +.5406 K | + e |
(-1.76) | (2.06) | (2.39) |
If we use the first 12 observations on capital and labor to estimate a linear regression
the result is
K = | -2.926 | + 3.8273 K | + e |
(-3.35) | (4.95) |
The R2 is .71 and both coefficients are significant; L does quite a good job
explaining capital. Use these coefficients to construct a fitted series for capital so
that we have a total of 15 observations and run the original model again.
Q = | - 1.7610 | + 2.4428 L | + 5.406 K | + e |
(-1.40) | (1.58) | (1.57) |
The coefficient on labor does not change a great deal. Compared to the complete correct
data set, the coefficient on K has changed by 25%, this is a big change. Compared to
either the full or abbreviated data set the estimated coefficient standard errors have
apparently gotten larger, the t-statistics have fallen.