DATA PROBLEMS: Missing Data
By way of introducing the area of concern and organizing our thinking, consider the
following table. The first row describes how it might come to pass that some of the data
is missing in both cross section and time series data. If the data is missing for a reason
that has nothing to do with the behavioral content of the model then the only consequence
is that our estimators are less efficient. This is solely a result of sample size. In a
cross section one could not predict which observations will be missing on the basis of
some knowledge about the respondent pool, for example.
There is a class of problem in which observations are missing as a result of the process
being modeled. This is the third row of the table. For example, we might be modeling the
purchase of refrigerators. If the purchase price of a refrigerator is never below $200,
then in our sample we will never see a household that would have liked to purchase a
refrigerator for $175.
Cross Section 
Time Series 

Nature of Problem 
Respondent fails to answer question 
Model needs monthly, we have quarterly 
Missing data unrelated to complete observations 
Lowers efficiency 
Lowers efficiency 
Gaps in data systematically related to behavior to be modeled 
See any text on censored and/or truncated data. The classic
estimator for this genre is Tobit. 
In the next table we categorize the missing data problems we can deal with at this
point. Basically, you can be missing data for either the dependent variable or for some of
the independent variables. The problem, or question, is how to leverage the data that one
does have in order to increase the efficiency of the estimator.
y_{a}, x_{a }  n_{a} complete observations 

Case 1 
___, x_{b }  n_{b} missing observations on y_{b } 
Case 2 
y_{c}, ___ 
n_{c} missing observations on x_{c } 
CASE 1 Missing observations for
dependent variable
Letbe a predictor for y_{b} so the least
squares estimator for the coefficient vector in this filled data set is. In general we can write the normal equations as x'y = x'x, so make
this substitution for the dataonhand estimator, b_{a}, and the missingdata
estimator, b_{b}:.
Define F = [x_{a}'x_{a} + x_{b}'x_{b}]^{1}x_{a}'x_{a}
so I  F = I  [x_{a}'x_{a} + x_{b}'x_{b}]^{1}x_{a}'x_{a},
or IF = [x_{a}'x_{a} + x_{b}'x_{b}]^{1}x_{b}'x_{b}
since I = [x_{a}'x_{a} + x_{b}'x_{b}]^{1}[x_{a}'x_{a}
+ x_{b}'x_{b}]. Then the filled data estimator is b_{f} = Fb_{a}
+ (IF)b_{b}, with Eb_{f} = Fb + (IF)Eb_{b}.
Unbiasedness is seen to depend on Eb_{b}.
Method 1
Let i be a column vector of ones. For the missing data estimator we will replace each
missing observation on y by the mean of the observations on y_{a}. Thenwhere andis a row
vector of variable means. Note where is a
column vector of variable means. Making the obvious substitution we get
Method 2
Get b_{a} = (x_{a}'x_{a})^{1}x_{a}'y_{a}
then build _{b} = x_{b}b_{a} and
Also,
Now for the variance. We have
Substitute in for
factor out x_{a}'y_{a} and substitute in for y_{a}
Postmultiplying b_{f} by its transpose and taking expectations will give us
the variance of our pooled estimator.
If you multiply x_{a}'x_{a} through to the right then the last pair of
square brackets cancel. Now factor (x_{a}'x_{a})^{1} out of the
remaining terms and one is left with. There is no gain in
efficiency from using the filled estimator. In practice, because one must estimate s^{2}, it will appear that the estimated standard errors of
the filled estimator are smaller. Do you see why this is so? It depends only on the number
of observations the computer uses in the calculated estimate of s^{2}.
EXAMPLE: Left Hand Side Missing Data
Suppose that the entire correct data set is shown in the following table.
Y  x 
0  0 
2  1 
1  2 
3  1 
1  0 
3  3 
4  4 
2  2 
1  2 
2  1 
If we use all of the observations to estimate a simple linear model then the result, with
tstatistics in parentheses, is
y =  .8333  +.666x  +e  n=10 
(1.75)  (2.81) 
But suppose that for some reason we do not observe the last four data points for y. Using the abbreviated data set our regression result is
y =  .6476  + .6341x  +e  n=6 
(1.43)  (1.55) 
Using the coefficient estimates from this abbreviated database we construct fitted values for the last four observations on y, bringing our total number of observations back up to ten. Running the regression with 6 observed values for y and 4 fitted values for y gives us the following result
y =  .9268  +.6341x  +e  n=10 
(2.32)  (3.179) 
Notice how much higher the tstatistics are than those in the regression
when we had all of the correct data. This is contrary to the derived result. We had
asserted above that our filled estimator was no more efficient. This is another case of
the computer not really knowing what is going on. It must compute an estimate of the error
variance from the observed data for use in an estimate of the variance of the coefficient
estimator. In our filled data set the last four observations contribute nothing to the sum
of squared errors, but the denominator is still 10. The consequence is that the estimate
of the error variance is understated. The implication is that the careless researcher will
use this filled estimator, which is no more efficient than the estimator based on the
abbreviated data set, not question the very high observed tstatistics and reject too many
null hypotheses.
Case 2 Missing observations on
the right hand side.
Simple Regression
We have the simple regression model y_{a} = a + bx_{a} + u_{a} and the additional n_{c}
observations on y_{c}, but the observations on x_{c} are missing. One
possibility is to useto fill the blanks. This won't work since
these terms would get zero weight in. We would gain nothing for
our effort.
Another possibility is to use the reverse regression to construct
then build a set of fitted values for x_{c} according to . Now apply OLS to the filled model . The
sampling distribution of the OLS estimator for the parameter vector is unknown in this
scheme so we can say nothing about unbiasedness and efficiency.
EXAMPLE: Missing data on the right hand side
For this example we use the same data as in the previous example. If we run the simple regression with y on the right hand side, x on the left hand side and using only the first six observations then we get the following result
x =  .7633  +.3817y  +e  n=6 
(.24)  (1.55) 
Using these results to fill in for the missing values for x and applying OLS we get
y =  .7670  + .8685x  +e  n=10 
(1.59)  (2.90) 
Notice that the slope estimate is higher than when we have all the data, as it was in
the prior example. This is coincidence. It depends on the realization of the data which is
available to us. That is, it is sample specific. Also, the tstatistic on the slope
estimate is higher. This is not coincidence, for the same reason given in the earlier
example.
Multiple Regression
Given the available data we have the following model
(1)
for which we have n_{a} observations. We have another n_{c} observations on y_{c} and z_{c}.
From the additional observations we might fill in for the missing x_{c}. We could use the first n_{a} observations on x and z in the following equation
(2)
In the last n_{c} observations we would then substitute in for x
The model then consists of the equations (1), (2), and (3). To operationalize
let .
Now . This method allows us to use more information about the
dependent variable and z in our estimation of . This should
provide us with a more efficient estimator.
EXAMPLE: Right Hand Side Missing Data: Multiple
Regression
We are given the following aggregate data for Taiwan. One could estimate a production function using this data
YEAR 
CAPITAL  LABOR  OUTPUT 
1958  17804.  275.50  16608. 
1959  18097.  274.40  17511. 
1960  18272.  269.70  20171. 
1961  19167.  267.00  20933. 
1962  19648.  267.80  20406. 
1963  20804.  275.00  20832. 
1964  22077.  283.00  24806. 
1965  23445.  300.70  26466. 
1966  24939.  307.50  27403. 
1967  26714.  303.70  28629. 
1968  29958.  304.70  29905. 
1969  31586.  298.60  27508. 
1970  33475.  295.50  29036. 
1971  34822.  299.00.  29282. 
1972  41794.  288.10  31536. 
Using all of the data one obtains
Q =  1.6900  + 2.454L  + .4326K  + e 
(2.97)  (4.06)  (5.48) 
But suppose the last three observations on capital are missing. Using just the first twelve data points we observe
Q =   1.763  +2.1193 L  +.5406 K  + e 
(1.76)  (2.06)  (2.39) 
If we use the first 12 observations on capital and labor to estimate a linear regression
the result is
K =  2.926  + 3.8273 K  + e 
(3.35)  (4.95) 
The R^{2} is .71 and both coefficients are significant; L does quite a good job
explaining capital. Use these coefficients to construct a fitted series for capital so
that we have a total of 15 observations and run the original model again.
Q =   1.7610  + 2.4428 L  + 5.406 K  + e 
(1.40)  (1.58)  (1.57) 
The coefficient on labor does not change a great deal. Compared to the complete correct
data set, the coefficient on K has changed by 25%, this is a big change. Compared to
either the full or abbreviated data set the estimated coefficient standard errors have
apparently gotten larger, the tstatistics have fallen.