## Chapter 7 Notes

Data comes to us in several different forms. Can you explain the differences?

Cardinal

Ordinal

Categorical

 wage educ exper tenure nonwhite female married 3.1 11 2 0 0 1 0 3.24 12 22 2 0 1 1 3 11 2 0 0 0 0 6 8 44 28 0 0 1 5.3 12 7 2 0 0 1 8.75 16 9 8 0 0 1 11.25 18 15 7 0 0 0 5 12 5 3 0 1 0 3.6 12 26 4 0 1 0 18.18 17 22 21 0 0 1 6.25 16 8 2 0 1 0 8.13 13 3 0 0 1 0

Note the coding of nonwhite, female and married. Although these are categorical, they are all 'either/or' propositions.

On the other hand, a survey might ask you for the region of the country in which you reside. This cannot be answered directly in a binary fashion. The responses are inherently categorical, but not binary. Instead, we can turn the multiple categories into a set of binary responses.

 wage educ exper tenure nonwhite female married numdep smsa northcen south west 3.1 11 2 0 0 1 0 2 1 0 0 1 3.24 12 22 2 0 1 1 3 1 0 0 1 3 11 2 0 0 0 0 2 0 0 0 1 6 8 44 28 0 0 1 0 1 0 0 1 5.3 12 7 2 0 0 1 1 0 0 0 1 8.75 16 9 8 0 0 1 0 1 0 0 1 11.25 18 15 7 0 0 0 0 1 0 0 1 5 12 5 3 0 1 0 0 1 0 0 1 3.6 12 26 4 0 1 0 2 1 0 0 1 18.18 17 22 21 0 0 1 0 1 0 0 1 6.25 16 8 2 0 1 0 0 1 0 0 1

A few columns have been added to the data set. The columns northcen, south and west show place of residence. They are mutually exclusive. Individually, each of them is binary. Are they collectively exhaustive or geographic region?

### One Dummy Indpendent Variable

In this country we believe that there are positive returns to education. Matt Riculate is a student interested in investigating this proposition. He plans to estimate the coefficients of the following model

His results are

Good scholars that we are, we point out to him the following facts about the American workplace:

Overall average wage = \$5.89, male average wage = \$7.10 , female average wage = \$4.58

Matt's response is 'whatever.' What do you explain to him about the whatever? After a long explanation, that includes making male the benchmark, you provide the following result

wage = 0.6228 -2.2733female +0.5064educ + resid

What do you tell him about the (dis)advantage of being a female? If you were to draw a graph to illustrate the regression result, what would it look like?

Suppose that the specification was changed so that the wage variable is in natural-log form. The new result is

ln(wage) = 0.8262 - 0.3608 female + 0.0772 educ + resid

### Multiple Categories

One could also ask if it is the simple fact that one is female that matters, or whether being married carries a furhter penalty. With this in mind there are now four states of the world that need consideration. They can be summarized in the following table:

 married not married male married male single male female married female single female

Keeping in mind yuor discussion with Matt about benchmarks, how many binary dummy variables do you need in order to account for all four categories?

To see if it is even worth going down this path we can set up the average wage of each group:

 married single male 7.97 5.16 female 4.56 4.61

There is a difference between men and women regardless of marital status. The difference between married and single women is \$0.05/hour, \$0.40/day, or \$100/year. This doesn't seem like much, but as WalMart is learning there is the principal of the thing, and it can also be indicative of wider problems. The regression model with all marital-gender groups accounted for is

wage = -1.024 -0.5567 marrfem -.3689 singfem + 2.6411marrmale+0.4935educ + resid

The only statistically significant coefficients are educ and marrmale. What does this mean about pay differences between men and women, and their marital status?

### Ordinal Data

This kind of data provides a rank order. There are many banners in the halls of the Fox School of Business and Management proclaiming its rank in one survey or another. If the Fox School is ranked 23rd on some survey do we know how much better it is than the one ranked 24th, or how much worse it is than the one ranked 22nd? We can, perhaps, see if rankings matter in a measurable fashion.

 Entire Sample Separate Intercepts Female Male Constant 0.2719 (3.2) 0.5589 (7.0) 0.0407 (.32) 0.5375 (5.42) female -0.4532 (15.51) belavg -0.1800 (-3.9) -0.1542 (-3.6) -0.1257 (-1.8) -0.1693 (-3.2) abvavg -0.0129 (-0.3) -0.0066 (-0.2) 0.0433 (.86) -0.0390 (-1.0) exper 0.0483 (10.13) 0.0408 (9.3) 0.0298 (4.1) 0.0504 (8.9) expersq -0.0006 (-6.5) -0.0006 (-6.4) -0.0004 (-2.7) -0.0008 (-6.5) educ 0.0687 (0.0687) 0.0663 (12.5) 0.0786 (8.7) 0.0609 (9.4) RSS 339.61 284.88 94.17 186.84 1259 1259 435 823

On the basis of the table is it appearance, or gender, or both that matters?

### Interactions between Dummies and Measured Variables

We discussed the interaction between dummies in the discussion of wages, marital status, and gender. Such interactions permitted the creation of different intercepts for different groups, while keeping the slope constant across the groups.

Let's return to women and wages and see if the return to education is the same for men and women. We'll allow the intercept to differ for the two groups and we'll allow the slope coefficient on education to differ between the two groups, but all other slopes will remain equal. This is accomplished by creating an interaction between educ and female and including that in the regression. The result is

ln(wage) = 0.6114 - 0.7161female + 0.0205 femeduc + 0.0611 educ + 0.0403 exper - 0.0006245 expersq + resid

### Testing for Differences Across Groups

Can you use the information in the table of results about beauty to see if intercepts, or slopes or both are different across men and women?

### Qualitative Dependent Variable

This is an enormous topic. A discussion of the linear probability model only scratches the surface.

Suppose that y = 1 with probability p and y = 0 with probability 1-p. Furthermore, we think that the variable x can be used to explain y and we have posited the following simple regression model .

What can be said about the expected value of y?

 y probability prob*y 0 1-p 0 1 p p

From the table E(y) = p. If the mean of the error is 0 then the implication is that E(y) = p = and 1-p = . so we can write

What is the implication of this result?

Another small problem is that the linear probability model is almost surely wrong. First, the cumulative density almost certainly is sigmoidal in shape. Related to that is the observation that for small or large values of x the predicted probability can be less than zero or greater than one.