28 October, 2015

Measure Theoretic Approach to Linear Regression

In this post we discuss the basic theory of linear regression, mostly from the perspective of probability theory.  At the end we devote some time to statistical applications.  The reason for focusing on the probabilistic point of view is to make rigorous the actual use of linear regression in applied settings; furthermore, the approach clarifies what linear regression actually is (and isn't). Indeed, there seems to be wide gaps in understanding among practitioners, probably due to the ubiquitous use of the methodology and the need for less mathematically inclined people to use it.

Regardless of the perspective, we have a probability (sample) space $(\Omega,\mathcal{F},\mathbb{P})$ and real-valued $\mathcal{F}$-measurable random variables $X$ and $Y$ defined on $\Omega$.  What differentiates the probabilistic and statistical point of views are what we know about $X$ and $Y$.  In the former, we know their distributions, i.e. all of the statistical properties of $X$ and $Y$ on their (continuous) range of values.  In the latter case we have only samples $\{(x_{n}, y_{n})\}_{n\geq0}$ of $X$ and $Y$.  In regression theory, the goal is to express a relationship between $X$ and $Y$ and associated properties, a much easier task from the probabilistic point of view since we have a parametric (or at least fully determined) distributions.  From the statistical point of view, due to lack of a complete distribution, we have to make certain modeling, distributional, and even sampling assumptions in order to make useful inferences (without these assumptions, the inferences we would be able to make would be far too weak to serve any useful purpose).  We will discuss some of these assumptions that are typical in most regression applications later.  For now, we focus on $X$ and $Y$ with known distribution measures $\mu_{X}$ and $\mu_{Y}$.




A fundamental concept in probability theory is that of conditional expectation.  To set the stage for a definition that will fit will be appropriate for our discussion, we consider the function space $L^{2}(\mathcal{F})$ consisting of all $\mathcal{F}$-measurable random variables $Z:\Omega\mapsto\mathbb{R}$ such that
$$||Z||^{2}_{2}:=\mathbb{E}[Z^{2}]=\int_{\Omega}|Z(\omega)|^{2}\;d\mathbb{P}(\omega)<\infty.$$ We emphasize the $\sigma$-algebra $\mathcal{F}$ because this will be the main component of interest in regression theory of the $L^{2}(\Omega,\mathcal{F},\mathbb{P})$ space; the sample space $\Omega$ and the probability measure $\mathbb{P}$ will remain the same throughout. An important property of $L^{2}$ spaces is that they are monotonic with respect to $\sigma$-algebras:
Proposition 1 (Information Monotonicity) Let $\mathcal{G}\subset\mathcal{F}$ be a sub-$\sigma$-algebra of $\mathcal{F}$ and $L^{2}(\mathcal{G})$ be the space of all $\mathcal{G}$-measurable random variables $Z:\Omega\mapsto\mathbb{R}$ such that $||Z||_{2}<\infty$.  Then $$L^{2}(\mathcal{G})\subset L^{2}(\mathcal{F})$$ is a closed linear subspace (i.e. it is closed under linear combinations and that $\overline{L^{2}(\mathcal{G})}=L^{2}(\mathcal{G}))$.
Proof.  As we will see below, the crucial point is that $L^{2}(\mathcal{G})$ is closed and that it is a subspace (and not merely a subset) of $L^{2}(\mathcal{F})$.  It is clear that if $Z$ is $\mathcal{G}$-measurable, then it is $\mathcal{F}$-measurable and that $||Z||_{L^{2}(\mathcal{F})}=||Z||_{L^{2}(\mathcal{G})}$ and so $L^{2}(\mathcal{G})\subset L^{2}(\mathcal{F})$.  Furthermore, if $a,b\in\mathbb{R}$ and $Z_{1},Z_{2}$ are $\mathcal{G}$-measurable, then clearly $Z^{-1}(A)=(aZ_{1}+bZ_{2})^{-1}(A)\in\mathcal{G}$ and $||Z||_{2}\leq |a|||Z_{1}||_{2}+|b|||Z_{2}||_{2}<\infty$ so that $L^{2}(\mathcal{G})$ is a linear subspace. Moreover, from well-known closure theorems in in measure theory, if $\{Z_{n}\}_{n}$ is a sequence of $\mathcal{G}$-measurable functions, then $Z=\lim_{n}Z_{n}$ is $\mathcal{G}$-measurable.  Hence $L^{2}(\mathcal{G})$ is closed.

Remark. In the proof we did not specify what topology the limit $\lim Z_{n}$ is supposed to be taken in. The closure theorem from measure theory referenced is typically proved for a.s. pointwise convergence (i.e., $\lim_{n} Z_{n}(\omega)=Z(\omega)$ exists for almost every fixed $\omega\in\Omega$).  However, in the proposition we clearly need $L^{2}(\mathcal{G})$ convergence (i.e., $||Z_{n}-Z||_{2}\to0$ as $n\to\infty$).  It is easy to see that if this limit exists in $L^{2}(\mathcal{F})$, then there is a subsequence $\{Z_{n_{k}}\}_{k}$ that converges pointwise to the $L^{2}(\mathcal{F})$ limit $Z$, and this together with $\mathcal{G}$-closure under pointwise convergence is enough to show that actually $Z\in L^{2}(\mathcal{G})$.

The reason for our focus on $\sigma$-algebras is that we can interpret them as information, and when this information is available to us, it can be used to help us estimate a random variable (in particular, revise its distribution by assigning probability $0$ to those events excluded by the known information, and redistributing it on the remaining possible events).  The additional focus on $L^{2}$ spaces is a matter of convenience, since the set of random variables belonging to this space have a very rich structure that is very easy to exploit in estimation problems (in particular, $L^{2}$ is a Hilbert space).  And anyway, $L^{2}$ is dense in $L^{p}$ (in particular, for $p=1$ and $p=\infty$) since $\mathbb{P}(\Omega)=1<\infty$, so the restriction to random variables in $L^{2}$ does not usually cause problems in practice (note that these are all the random variables with finite variance).

Our primary goal in regression is to make precise the following problem, and then find an appropriate solution:
(Estimation Problem - Imprecise Statement) For $Y\in L^{2}(\mathcal{F})$ with distribution $\mu_{Y}$, find the best estimate $\hat{Y}$ of $Y$ with distribution $\mu_{\hat{Y}}$ given a $\sigma$-algebra $\mathcal{G}\subset\mathcal{F}$.
To precisely formulate this problem, we make a definition that will occupy our attention for much of these notes:
Definition 1 (Conditional Expectation) Let $X\in L^{2}(\mathcal{F})$ and let $\mathcal{G}=\sigma(X)$, the $\sigma$-algebra generated by $X$ (i.e., $\sigma(X)=\cup_{\alpha} X^{-1}(A_{\alpha})$ as $\alpha$ ranges over all the Borel sets of $\mathbb{R}$). Consider $Y\in L^{2}(\mathcal{F})$.  Then we define the conditional expectation of $Y$ with respect to $X$ by $$\mathbb{E}[Y|X](\omega):=(\mathbb{proj}_{L^{2}(\mathcal{G})}Y)(\omega).$$
Remark.  We make several remarks about this definition (proofs of various statements can be found in any graduate probability or analysis text).

First, the notation $\mathbb{proj}_{L^{2}(\sigma(X)}$ denotes the idempotent linear operator $P$ ($P^{k}Y=PY$ for all $k\geq1$) on the Hilbert space $L^{2}(\mathcal{F})$.  Such a linear operator is called an orthogonal projection operator, because it maps $L^{2}(\mathcal{F})$ surjectively onto the closed subspace $L^{2}(\sigma(X))$ in a linear way and such that $Y$ has the unique decomposition $Y=PY+(I-P)Y$ where $(PY,(I-P)Y)=0$ (we will return to this important decomposition into orthogonal components later).  Such an operator always exists and is unique whenever the target subspace is closed under the particular Hilbert space norm. We will discuss more properties of orthogonal projection operators and explicit computation methods later, but the reader should already be familiar with this topic from linear algebra/functional analysis.

Second, our definition is equivalent to the usual measure-theoretic definition that $\mathbb{E}(Y|X)$ is the unique (upto a set of measure zero) $\sigma(X)$-measurable such that $\mathbb{E}[\mathbb{E}[Y|X]; A)=\mathbb{E}(Y; A)$ holds for every $A\in\sigma(X)$.

Third, we will typically be in the situation where $\sigma(X)\subset\sigma(Y)$ (for instance, this corresponds to a situation where we have a process $\{X_{t}\}_{t}$ and a corresponding filtration of sub $\sigma$-algebras $\{\sigma(X_{t})\}_{t}$ and want to estimate for $t>s$ $X_{t}$ given $X_{s}$ (i.e., $\sigma(X_{s}))$ resolved)).  In this case, $X(\omega)$ can be determined by the resolution of fewer sets (i.e., knowing whether $\omega\in A$ or $\omega\notin A$ for every $A\in\sigma(X)\subset\sigma(Y)$ is enough to know the value of $X(\omega)$) compared to $Y(\omega)$.  This is equivalent to the distribution of $X$ degenerating to a Dirac measure (unit point-mass) concentrated on $\{X(\omega)\}$ (whereas the distribution of $Y$ changes to reflect the new information, but does not necessarily degenerate to a unit point mass (indeed, if $\sigma(X)\subsetneq\sigma(Y)$, then it does not)).  This brings up the important point about the "probabilistic way of thinking" in that we view operations on and properties of random variables solely in terms of their distributions (technically, those properties invariant under probability space extensions, something we will not go into here), not their domains, and in particular not the structure of the sample space $\Omega$ or the particular arrangements of the $\sigma$ algebras involved (indeed, this is usually impossible to do once one starts studying complex random models involving many sources of randomness [random variables]).  Another typical situation we will be in is $\sigma(X)\subsetneq\sigma(Y)$, $\sigma(X)\supsetneq\sigma(Y)$, and $\sigma(X)\cap\sigma(Y)\neq\{\Omega,\emptyset\}$ (i.e. not equal to the trivial $\sigma$-algebra). This corresponds to the case where $X$ contains information relevant to estimating $Y$ and information that is not relevant (note that we will not generally pay a lot of attention to $\sigma(Y)$ and just assume $Y$ is $\mathcal{F}$-measurable, in which case we are in the $\sigma(X)\subset\mathcal{F}$ situation).  At the extreme ends we have $\sigma(X)=\sigma(Y)$ and $\sigma(X)$ and $\sigma(Y)$ independent; in the former the resolution of $\sigma(X)$ determines both $X$ and $Y$ and so there is no estimation problem, and in the latter resolution of $\sigma(X)$ does nothing to determine $Y$ and so there is no progress made toward the estimation problem.

Fourth, we emphasize that our definition of $\mathbb{E}[Y|X]$ is defined as an orthogonal projection of $Y$ into the closed subspace $L^{2}(\sigma(X))$, and in particular is not defined to be the projection of $Y$ onto $X$ (indeed, such a projection is not even orthogonal unless $X$ is $\mathbb{E}[Y|X]$, as we shall see later). Nevertheless, the projection of $Y$ onto $X$ will be an important concept in studying regression, but as we shall see, it yields (based on to-be-defined criteria) an inferior estimate of $Y$ based on the information $\sigma(X)$ generated by $X$.

With the definition of conditional expectation out of the way, we can formalize the estimation problem stated above, and then state and prove a solution to it, which is the main theorem of linear regression theory.
(Estimation Problem - Precise Statement) Let $X, Y\in L^{2}(\mathcal{F})$ and $\mathcal{G}=\sigma(X)$ be the $\sigma$-algebra generated by $X$ (formally, the information resolved in the determination of $X$ for any outcome $\omega\in\Omega$). Now find random variables $\hat{Y}$ and $Y'$ such that the following conditions hold: $$|||Y-\hat{Y}||_{2}=\inf_{Z=a+bX; a,b,\in\mathbb{R}}||Y-Z||_{2}$$ and $$||Y-Y'||_{2}=\inf_{Z\in L^{2}(\sigma(X))}||Y-Z||_{2}.$$
We call $\hat{Y}$ the best (affine) linear unbiased estimator of $Y$ in terms of $X$ (often called a BLUE estimator in statistics) and and $Y'$ the best unbiased estimator of $Y$ in terms of $X$.  Note that the quantitative conditions defining our estimators show that the qualifier "best" in estimating $Y$ given $X$ is quantified using the $L^{2}$ norm ("least squares").  Note also that while we have two estimates of $Y$ given $X$, each estimate is not allowed to be obtained from information beyond what is resolved by $X$ (i.e., our estimates of $Y$ based on $X$ must be $\sigma(X)$-measurable).  Moreover, the criterion for $Y'$ says that $Y'$ is the best least-squares estimate of $Y$ given all of the information generated by $X$ (in particular, among all random variables that $\sigma(X)$-measurable), whereas $\hat{Y}$ is merely the best least-squares estimate of $Y$ among linear functions of $X$ (or seen another way, among all functions spanned by the set $\{1_{\Omega}, X\}$, a point we will return to in the proof of the main theorem below).

The the term linear used in describing $\hat{Y}$ refers to the fact that as a function of $X$, $\hat{Y}$ is an (affine) linear function (i.e. a linear function with an intercept).  As we shall see, the solution to the above problem is given by a linear Hilbert space operator, where the term linear is now being used in a completely different sense (in particular, $Y'$ is very unlikely to be a linear function of $X$, and in any case, it is not even clear a priori that $Y'$ is an explicit function of $X$ from the defining criterion).

As a final note, we may just as well define other estimators $\bar{Y}$ using different criterion.  For instance, we could for any Borel function $g_{\alpha}$ with parameters $\alpha=(\alpha_{1},\ldots,\alpha_{n})$ define $\hat{Y}_{g_{\alpha}}$ by the condition
$$||Y-\hat{Y}_{g_{\alpha}}||_{2}=\inf_{Z=g_{\alpha}(X);\alpha\in\mathbb{R}^{n}}||Y-Z||_{2}.$$ In the case of $\hat{Y}$, we have $\hat{Y}=\hat{Y}_{g_{\alpha}}$ where $g=g(X)=a+bX$, $\alpha=(a,b)$, and $\hat{Y}=\alpha+\beta X$ where $(\alpha,\beta)$ achieve the optimization constraint defining $\hat{Y}$.  Moreover, we could use other norms or quantification criteria in order to obtain estimates with specific properties that may be deemed to lead to better estimators (usually at the expense of introducing bias or other undesirable influences); an example of this is MLE (Maximum Likelihood Estimation).  Such estimation techniques are not usually necessary from the perspective of probability theory, but arise in statistical applications due to lack of control or data on the distributions of the random variables in play (for example, in estimating parameters for some time-series models).  Moreover, the introduction of different norms eliminates much of the advantages obtained from Hilbert space theory; in particular, the straight-forward representation of the solution to the estimation problem as a linear Hilbert space operator is lost.  In any event, we shall focus almost exclusively on $\hat{Y}$ and $Y'$ defined above for the remainder of these notes, since they represent in a certain sense the "end-points" of effectiveness in estimating $Y$ in terms of $X$.

We now state the solution to the estimation problem in the following theorem, which one could probably have already guessed based on the optimization criterion for $\hat{Y}$ and $Y'$ and its relation to Hilbert space theory;
Theorem 1 (Solution to Estimation Problem).  In the estimation problem above, $\hat{Y}$ and $Y'$ both exist, are unique, and lie in the closed subspace $L^{2}(\sigma(X))$.  Furthermore, they have explicit representations in terms of linear $L^{2}(\mathcal{F})$ operators (or more precisely, $L^{2}(\mathcal{F})\mapsto L^{2}(\mathcal{G})$ linear transformations) given by $$\hat{Y}(\omega)=(\mathbb{proj}_{X}Y)(\omega)$$ and $$Y'(\omega)=(\mathbb{proj}_{L^{2}(\sigma(X))}X)(\omega).$$ Moreover, $$Y'(\omega)=\mathbb{E}[Y|X](\omega)$$ and the optimization criterion for $Y'$ is equivalent to $$||\mathbb{E}[Y|X]-\hat{Y}||_{2}=\inf_{Z=a+bX; a,b,\in\mathbb{R}}||\mathbb{E}[Y|X]-Z||_{2}.$$  Finally, the projection operators are given by $$(\mathbb{proj}_{X}Y)(\omega)=\frac{(X,Y)_{L^{2}}}{||X||^{2}_{2}}X(\omega)$$ and $$(\mathbb{proj}_{L^{2}(\sigma(X))}Y)(\omega)=\sum_{n}(\mathbb{proj}_{X_{n}}Y)(\omega)=\sum_{n}\frac{(X_{n},Y)_{L^{2}}}{||X_{n}||^{2}_{2}}X_{n}(\omega)$$ where $\{X_{n}\}_{n}$ is a (complete in the case $n=\infty$) orthogonal basis for $L^{2}(\sigma(X))$ (such a basis exists because $L^{2}(\sigma(X))$ is closed).