Asymptotic Theory, Order in Probability
and Laws of Large Numbers


Notation
W is the set of all possible outcomes, or the sample space. For example, in flipping two coins the sample space consists of {H1 H2, H1 T2, T1 H2, T1 T2}, where the subscripts are an index for the differentiable coins.

w is a particular outcome. For example, H1 H2 in the coin toss experiment.

x(w) is a function that assigns a numerical result to the outcome w. For example, it might be the number of heads in two coin tosses.

{xi(w)}i=1,2,...,T is a sequence of random variables which assigns, e.g., the proportion of heads in i tosses of a fair coin. Other examples include the sample mean or variance, or a regression coefficient.


Convergence Almost Surely

Consider the sequence x1(w), x2(w), ..., xj(w), ..., xT(w), which is the sample mean computed from a sample of size j. Suppose E Xj(w) = m. We know intuitively that for large enough T the sample mean approaches the population mean. That is, by choosing a large T we can make arbitrarily small, say less than e. This property of the random sequnce can be written . There are a large number of outcomes or sequences for which this inequality remains true. Let us denote them

To illustrate how the set A is constructed we offer the following mind construct. Line up a sequence of barrels from here to the North Pole. The barrel closest to us is filled with random sequences of size one from which we can construct a sample mean. The next barrel is filled with random sequences of size two from which we can construct sample means. The next barrel is filled with random sequences of size three from which we can construct sample means. And so on all the way to the North Pole, and beyond. If you draw a random sequence from the barrel closest to you it is quite likely that the mean will differ from the population mean by more than e. On the other hand, if you draw a sequence from the barrel closest to the North Pole, a very large sequence, it is very likely that the sample mean will be within e of the population mean. If the sequence you drew and its associated random variable, the sample mean, falls close to the population mean then it belongs to the set A.
Now ask yourself, what is the probability of drawing a sequence from the barrel closest to the North Pole, or beyond, and getting a sample mean that qualifies the sequence for membership in A? Or, similarly, what proportion of the sample space belongs to A? Or, what is the probability of the union of the sets in A? The answer is


or
       

The evaluation of this probability provides our first definition.


DEFINITION Convergence Almost Surely

xt(w) converges almost surely to m if  
       

we write this as
       

If xt(w) is a vector random variable then we change the norm to Euclidean distance
       



Convergence in Probability


Now let us consider and pose the question
       

Suppose we consider the sample mean problem again and look at 
       




The normally distributed random variable w has a mean of 10 and a variance of 36. Suppose we initially use a sample of size nine and choose e to be 2. Then,


Suppose now we increase the sample size to 81. 
       




We see that as the sample size is allowed to get large, the probability that the sample mean differs from the population mean gets smaller. Therefore, we arrive at the following definition.


DEFINITION Convergence in Probability

A sequence xT(w) converges in probability to m if the probability of can be made arbitrarily small for a large enough T. That is

we write .

Note the important difference from the concept of convergence almost surely. For convergence in probability, each of the sets of events w,
has arbitrarily small probability, whereas in convergence almost surely it was the union of the sets of events.


Theorem

If then .

Proof:
Define . If then for To we knowand we can say that the probability of one set of events is less than the probability of the union of all such sets. Therefore

The converse is not necessarily true. However, a subset of any sequence which is convergent in probability will converge almost surely.


DEFINITION
If where k is a constant then k is termed the probability limit of xT(w) and we denote it as plim(xT(w)) = k.

SLUTSKY'S THEOREM

If g(x) is a continuous function then plim(g(xT)) = g(plim(xT)).
Although we state Slutsky's Theorem without proof, we note that it is used to prove that estimators are consistent.


DEFINITION
Suppose xT(w) is used to estimate the parameter q. If plim(xT(w)) = q then the estimator is said to be consistent.
The notion of consistency can be seen in the following diagram:


As can be seen in the diagram, increasing the sample size tightens the distribution of the sample mean about the true mean. By choosing a large enough sample size we can make the probability that differs from by some derisory amount arbitrarily small.

DEFINITION
{xt(w)} converges in rth moment to the random variable x if
(1) E[xtr] exists for all r
(2) E[xr] exists
and if .


DEFINITION
When r = 2 we refer to the above limit as convergence in quadratic mean. It is written as .


THEOREM

This theorem is proved using Chebyshev's Inequality. When x is a constant then this is a specific form of the weak law of large numbers, and convergence in quadratic mean is a sufficient condition to prove consistency.


DEFINITION
The sequence of functions {FT} converges to the funciton F if and only if for all x in the domain of F and for every e > 0 there exists a To such that . This is denoted FT ® F.


DEFINITION
The sequence of random variables {xT} with corresponding distribution functions {FT(x)} is said to converge in distribution (converge in law) to the random variable x with distribution function F(x) if and only if FT ® F at all continuity points of F. We write either .

DEFINITION: Characteristic Function
Suppose x is a random variable with density function f(x), then its characteristic function is

The relationship between the characteristic function and moments about the origin is

We will use this definition in a later proof of a law of large numbers.

THEOREM
Suppose that f(t), continuous at 0, is the characteristic function of F(x). Let {xT} be a sequence of random variables with characteristic functions fT(t). Then if and only if .

Summary of Convergence Concepts

Review of Order in Probability: O and o

We will need to do some expansions in order to proceed with our asymptotic work. You may recall from your calculus course that when doing a Taylor series expansion there is always a remainder. In our work the remainder might turn out to be a nuisance. In order to justify dropping the remainder term we have to have a rule for determining and indicating how large the term is that has been dropped.


DEFINITION
Let {at} and {bt} be sequences of real variables and positive real variables, respectively. The at is of smaller order than bt, denoted by at = o(bt), if

DEFINITION
Let {at} and {bt} be sequences of real variables and positive real variables, respectively. Then at is at most of order bt, denoted by at = O(bt), if there exists a positive number M such that

for all t.


Examples

(1) Consider the sequence an = 4 + n - 3n2 and the sequence bn = n2.
The sequence is bounded by M = +3, so an = O(n2), or an is at most of order n2.
(2) Again an = 4 + n - 3n2 and now bn = n3. The ratio an/bn goes to zero as n ® ¥. So an is of smaller order than n3.


DEFINITION
Let {yt} be a sequence of random variables and {at} be sequence of nonstochastic, positive real numbers. Then yt is of smaller order in probability than at, denoted by yt = op(at) if .


DEFINITION
Let {yt} be a sequence of random variables and {at} be sequence of nonstochastic, positive real numbers. Then yt is at most of order in probability at, denoted by yt = Op(at) if for every e > 0 there is a positve M such that .

Examples
(1) Assume xt ~ N(0,s2) and consider the sequence yt = xt/t. It should be obvious that Eyt = 0 and Var(yt) = (1/t2)s2 and we can see that so plim(yt) = 0. Therefore xt is of smaller order in probability than t.
(2) Assume xt ~ N(m,s2) and consider the sequence. From a version of the central limit theorem we know

from which we conclude that .

Central Limit Theorems

THEOREM
Let {xt} be a sequence of iid rv's with E(xt) = m and Var(xt) = s2. Then where .
Proof:
Let then . The characteristic function of will be given by

The underlined portion is nothing more than . Continuing to substitute in from the definition


        53       *

        NOTE:
       

A Taylor series approximation at t=0 gives
       

Making use of this observation, and E(x-m) = 0 and E(x-m)2 = s2, we have
       

expanding by the binomial theorem we get
       




The summation drops out as t ® ¥ since is O(tj) and gets killed off by the o term. So

which we all know to be the characteristic function for the N(0,1)!


LINDBERG-FELLER THEOREM
Let {xk} be a sequence of independent rv's with finite means and variances. Let {ak} be a sequence of constants. Define then if and only if


This condition states that no term in the sequence is very important.

Uses of Lindberg-Feller:
(1) can be used to show that in large samples the distribution of sample moments is approximately normal. That is

(2) can also be used to show that functions of the sample moments converge in distribution to the normal.

where mj' is the jth population moment about the origin,
mh' is the hth sample moment about the origin,
g(.) is a continuous, differentiable function.

THEOREM: Asymptotic Results for the Standard Linear Model
Consider y = x b + u with E(u|x) = 0, Var(u|x) = s2I, and n-1x'x { Qkxk. Then 1)
2) if the ui are iid then
3)
Proof:
1) We know E(i - ui) = 0 and . We recognize the variance as a quadratic form. The matrix in the quadratic form converges to n-1Q-1 and the ratio . So and by Chebyshev's inequality and we showed in a previous theorem that convergence in probability is sufficient for convergence in law.
2) now premultiply by Q for the rest of the proof consider each row of x'u and proceed as in our previous proof of a special C.L.T.
3) follows from the above.