Asymptotic Theory, Order in Probability

Asymptotic Theory, Order in Probability
and Laws of Large Numbers

Notation
W is the set of all possible outcomes, or the sample space. For example, in flipping two coins the sample space consists of {H₁ H₂, H₁ T₂, T₁ H₂, T₁ T₂}, where the subscripts are an index for the differentiable coins.

w is a particular outcome. For example, H₁ H₂ in the coin toss experiment.

x(w) is a function that assigns a numerical result to the outcome w. For example, it might be the number of heads in two coin tosses.

{x_i(w)}_i=1,2,...,T is a sequence of random variables which assigns, e.g., the proportion of heads in i tosses of a fair coin. Other examples include the sample mean or variance, or a regression coefficient.

Convergence Almost Surely

Consider the sequence x₁(w), x₂(w), ..., x_j(w), ..., x_T(w), which is the sample mean computed from a sample of size j. Suppose E X_j(w) = m. We know intuitively that for large enough T the sample mean approaches the population mean. That is, by choosing a large T we can make arbitrarily small, say less than e. This property of the random sequnce can be written . There are a large number of outcomes or sequences for which this inequality remains true. Let us denote them

To illustrate how the set A is constructed we offer the following mind construct. Line up a sequence of barrels from here to the North Pole. The barrel closest to us is filled with random sequences of size one from which we can construct a sample mean. The next barrel is filled with random sequences of size two from which we can construct sample means. The next barrel is filled with random sequences of size three from which we can construct sample means. And so on all the way to the North Pole, and beyond. If you draw a random sequence from the barrel closest to you it is quite likely that the mean will differ from the population mean by more than e. On the other hand, if you draw a sequence from the barrel closest to the North Pole, a very large sequence, it is very likely that the sample mean will be within e of the population mean. If the sequence you drew and its associated random variable, the sample mean, falls close to the population mean then it belongs to the set A.
Now ask yourself, what is the probability of drawing a sequence from the barrel closest to the North Pole, or beyond, and getting a sample mean that qualifies the sequence for membership in A? Or, similarly, what proportion of the sample space belongs to A? Or, what is the probability of the union of the sets in A? The answer is

or

The evaluation of this probability provides our first definition.

DEFINITION Convergence Almost Surely

x_t(w) converges almost surely to m if


we write this as


If x_t(w) is a vector random variable then we change the norm to Euclidean distance

Convergence in Probability

Now let us consider and pose the question

Suppose we consider the sample mean problem again and look at

The normally distributed random variable w has a mean of 10 and a variance of 36. Suppose we initially use a sample of size nine and choose e to be 2. Then,

Suppose now we increase the sample size to 81.

We see that as the sample size is allowed to get large, the probability that the sample mean differs from the population mean gets smaller. Therefore, we arrive at the following definition.

DEFINITION Convergence in Probability

A sequence x_T(w) converges in probability to m if the probability of can be made arbitrarily small for a large enough T. That is

we write .

Note the important difference from the concept of convergence almost surely. For convergence in probability, each of the sets of events w,
has arbitrarily small probability, whereas in convergence almost surely it was the union of the sets of events.

Theorem

If then .

Proof:
Define . If then for T_o we knowand we can say that the probability of one set of events is less than the probability of the union of all such sets. Therefore

The converse is not necessarily true. However, a subset of any sequence which is convergent in probability will converge almost surely.

DEFINITION
If where k is a constant then k is termed the probability limit of x_T(w) and we denote it as plim(x_T(w)) = k.

SLUTSKY'S THEOREM

If g(x) is a continuous function then plim(g(x_T)) = g(plim(x_T)).
Although we state Slutsky's Theorem without proof, we note that it is used to prove that estimators are consistent.

DEFINITION
Suppose x_T(w) is used to estimate the parameter q. If plim(x_T(w)) = q then the estimator is said to be consistent.
The notion of consistency can be seen in the following diagram:

As can be seen in the diagram, increasing the sample size tightens the distribution of the sample mean about the true mean. By choosing a large enough sample size we can make the probability that differs from by some derisory amount arbitrarily small.

DEFINITION
{x_t(w)} converges in r^th moment to the random variable x if
(1) E[x_t^r] exists for all r
(2) E[x^r] exists
and if .

DEFINITION
When r = 2 we refer to the above limit as convergence in quadratic mean. It is written as .

THEOREM

This theorem is proved using Chebyshev's Inequality. When x is a constant then this is a specific form of the weak law of large numbers, and convergence in quadratic mean is a sufficient condition to prove consistency.

DEFINITION
The sequence of functions {F_T} converges to the funciton F if and only if for all x in the domain of F and for every e > 0 there exists a To such that . This is denoted F_T Ž F.

DEFINITION
The sequence of random variables {x_T} with corresponding distribution functions {F_T(x)} is said to converge in distribution (converge in law) to the random variable x with distribution function F(x) if and only if F_T Ž F at all continuity points of F. We write either .

DEFINITION: Characteristic Function
Suppose x is a random variable with density function f(x), then its characteristic function is

The relationship between the characteristic function and moments about the origin is

We will use this definition in a later proof of a law of large numbers.

THEOREM
Suppose that f(t), continuous at 0, is the characteristic function of F(x). Let {x_T} be a sequence of random variables with characteristic functions f_T(t). Then if and only if .

Summary of Convergence Concepts

Review of Order in Probability: O and o

We will need to do some expansions in order to proceed with our asymptotic work. You may recall from your calculus course that when doing a Taylor series expansion there is always a remainder. In our work the remainder might turn out to be a nuisance. In order to justify dropping the remainder term we have to have a rule for determining and indicating how large the term is that has been dropped.

DEFINITION
Let {a_t} and {b_t} be sequences of real variables and positive real variables, respectively. The a_t is of smaller order than b_t, denoted by a_t = o(b_t), if

DEFINITION
Let {a_t} and {b_t} be sequences of real variables and positive real variables, respectively. Then a_t is at most of order b_t, denoted by a_t = O(b_t), if there exists a positive number M such that

for all t.

Examples

(1) Consider the sequence a_n = 4 + n - 3n² and the sequence b_n = n².
The sequence is bounded by M = +3, so a_n = O(n²), or a_n is at most of order n².
(2) Again a_n = 4 + n - 3n² and now b_n = n³. The ratio a_n/b_n goes to zero as n Ž Ľ. So a_n is of smaller order than n³.

DEFINITION
Let {y_t} be a sequence of random variables and {a_t} be sequence of nonstochastic, positive real numbers. Then y_t is of smaller order in probability than a_t, denoted by y_t = o_p(a_t) if .

DEFINITION
Let {y_t} be a sequence of random variables and {a_t} be sequence of nonstochastic, positive real numbers. Then y_t is at most of order in probability a_t, denoted by y_t = O_p(a_t) if for every e > 0 there is a positve M such that .

Examples
(1) Assume x_t ~ N(0,s²) and consider the sequence y_t = x_t/t. It should be obvious that Ey_t = 0 and Var(y_t) = (1/t²)s² and we can see that so plim(y_t) = 0. Therefore x_t is of smaller order in probability than t.
(2) Assume x_t ~ N(m,s2) and consider the sequence. From a version of the central limit theorem we know

from which we conclude that .

Central Limit Theorems

THEOREM
Let {x_t} be a sequence of iid rv's with E(x_t) = m and Var(x_t) = s². Then where .
Proof:
Let then . The characteristic function of will be given by

The underlined portion is nothing more than . Continuing to substitute in from the definition

        53       *

        NOTE:        A Taylor series approximation at t=0 gives


Making use of this observation, and E(x-m) = 0 and E(x-m)² = s², we have


expanding by the binomial theorem we get

The summation drops out as t Ž Ľ since is O(t^j) and gets killed off by the o term. So

which we all know to be the characteristic function for the N(0,1)!

LINDBERG-FELLER THEOREM
Let {x_k} be a sequence of independent rv's with finite means and variances. Let {a_k} be a sequence of constants. Define then if and only if

This condition states that no term in the sequence is very important.

Uses of Lindberg-Feller:
(1) can be used to show that in large samples the distribution of sample moments is approximately normal. That is

(2) can also be used to show that functions of the sample moments converge in distribution to the normal.

where m_j^' is the j^th population moment about the origin,
m_h^' is the h^th sample moment about the origin,
g(^.) is a continuous, differentiable function.

THEOREM: Asymptotic Results for the Standard Linear Model
Consider y = x b + u with E(u|x) = 0, Var(u|x) = s²I, and n^-1x'x { Q_kxk. Then 1)
2) if the u_i are iid then
3)
Proof:
1) We know E(_i - u_i) = 0 and . We recognize the variance as a quadratic form. The matrix in the quadratic form converges to n^-1Q^-1 and the ratio . So and by Chebyshev's inequality and we showed in a previous theorem that convergence in probability is sufficient for convergence in law.
2) now premultiply by Q for the rest of the proof consider each row of x'u and proceed as in our previous proof of a special C.L.T.
3) follows from the above.