The space L^{2}(\mathcal{F}) is a Hilbert space with norm
(X,Y)=\mathbb{E}[XY]=\int_{\Omega}X(\omega)Y(\omega)\;d\mathbb{P}(\omega). The integral defining this inner product can be calculated as a Lebesgue integral on the range of the random vector (X,Y) (in particular, an integral over \mathbb{R}^{2}) in the usual way by changing variables and using the appropriate push-forward (distribution) measures of X and Y (and densities if the distributions are absolutely continuous with respect to Lebesgue measure).
This inner product can be related to many of the common statistical measures of X and Y. Let us use the notations
\left\{\begin{array}{l} \mu_{X}=\mathbb{E}(X)\\ \sigma_{XY}=\mathbb{Cov}(X,Y)=\mathbb{E}[(X-\mu_{X})(Y-\mu_{Y})]\\ \sigma^{2}_{X}=\mathbb{Var}(X)=\mathbb{Cov}(X,X)=\mathbb{E}[(X-\mu_{X})^{2}]\end{array}\right. Then in terms of the inner product we have
\left\{\begin{array}{l} \sigma_{X}^{2}=(X-\mu_{X},X-\mu_{X})=||X-\mu_{X}||_{2}^{2}\\ \sigma_{XY}=(X-\mu_{X},Y-\mu_{Y}) \end{array}\right.
This correspondence suggests that covariance between random variables is akin to an orthogonal projection. The only complication is the centering about the means \mu; if our random variables are distributionally symmetric, then there is no problem. But in general this is not the case; moreover, we cannot just subtract off the means and maintain a consistent definition for the inner product, since \mu_{X}\neq\mu_{Y} in general. However, a very useful identity will partially resolve this, namely
(Y,X-\mu_{X})=(Y-\mu_{Y},X)=(Y-\mu_{Y},X-\mu_{X})=\mathbb{Cov}(X,Y).
Indeed, using the fact that (1_{\Omega},X)=\mathbb{E}X=\mu_{X} and the one-argument linearity property of the inner products, we have
(Y-\mu_{Y},X-\mu_{X})=(X,Y)-\mu_{X}(Y, 1_{\Omega})-\mu_{Y}(1_{\Omega},X)+\mu_{X}\mu_{Y}(1_{\Omega},1_{\Omega})=(X,Y)-\mu_{X}\mu_{Y},
yet
(Y,X-\mu_{X})=(X,Y)-\mu_{X}(Y,1_{\Omega})=(X,Y)-\mu_{X}\mu_{Y}, and the claim follows by symmetry. Thus, we can define in the usual way the orthogognal projection of Y in the direction of X-\mu_{X} by
(\mathbb{proj}_{X-\mu_{X}}Y)(\omega)=\frac{(Y,X-\mu_{X})}{||X-\mu_{X}||_{2}^{2}}(X(\omega)-\mu_{X})=\frac{\sigma_{XY}}{\sigma_{X}^{2}}(X(\omega)-\mu_{X})=\frac{\mathbb{Cov}(X,Y)}{\mathbb{Var}(X)}(X(\omega)-\mu_{X}) One might recognize the coefficient \sigma_{XY}/\sigma_{X}^{2} as the minimal variance hedge ratio of Y and X (i.e., the quantity h such that the random variable (portfolio) P:=X+hY has minimal variance). With the above in mind, we make some clarifying definitions:
Definition. Let X_{1},\ldots,X_{n} be a collection of random variables in \in L^{2}(\mathcal{F}). Then \{X_{n}\}_{n}Note again that (3) is \mathbb{Cov}(X_{i},X_{j})=(X_{i},X_{j}-\mu_{j})=(X_{i}-\mu_{i},X_{j})=(X_{i},X_{j})-\mu_{i}\mu_{j}. Consequently, (2) and (3) are equivalent if either \mu_{i}=0 or \mu_{j}=0 and mutually exclusive if \mu_{i}\neq0 and \mu_{j}\neq0. Also note that (2) implies (1), but not conversely.
- Linearly independent if for all \omega\in\Omega the identity \alpha_{1}X_{1}(\omega)+\ldots+\alpha_{n}X_{n}(\omega)=0 implies \alpha_{j}=0 for 1\leq j\leq n. Otherwise, \{X_{n}\}_{n} are linearly dependent.
- Pairwise orthogonal if (X_{i},X_{j})=0 for all i\neq j and 1\leq i,j\leq n.
- Pairwise uncorrelated if (X_{i}-\mu_{i},X_{j}-\mu_{j})=0 for i\neq j and 1\leq i,j\leq n.
We are now ready for the main business of this post. Let Y\in L^{2}(\mathcal{F}), \mathcal{G}\subset\mathcal{F}, and \{X_{n}\}_{n} be a collection of \mathcal{G}-measurable random variables. After removing any linearly dependent X_{i}, let \mathcal{S}:=\mathbb{span}(X_{0},X_{1},\ldots,X_{n}). We assume X_{0}(\omega)=1_{\Omega}(\omega) (i.e., the constant function). We wish to estimate Y from the variables \{X_{n}\}_{n\geq0}. This can be done optimally (in the sense of minimal L^{2} norm) by orthogonally projecting Y onto \mathcal{S}. In order to carry out this procedure, we must first orthogonalize the collection \{X_{n}\}_{n}. This can be carried out using the Gram-Scmidt procedure, which generates a new family \{\hat{X}_{n}\}_{n} that is pairwise orthogonal and such that \hat{\mathcal{S}}=\mathcal{S}. The procedure is very simple: begin with \hat{X}_{0}=X_{0}. Then define \hat{X}_{1} to be the difference between \hat{X}_{0} and the orthogonal projection of X_{1} onto \hat{X_{0}}. In other words, \hat{X}_{1} is the error in estimating X_{1} from \hat{X_{0}}. The procedure is then repeated for X_{2}, where \hat{X}_{2} is set equal to the difference between X_{2} and its projection onto (the space spanned by) \hat{X}_{0} and \hat{X_{1}}, and so on.
Let us carry out this procedure for the case n=3. We note again that the projection of Y in the direction of X is defined as
\mathbb{proj}_{X}Y=\frac{(X,Y)}{(X,X)}X.
Note that this means
\mathbb{proj}_{X-\mu_{X}}Y=\frac{\mathbb{Cov}(X,Y)}{\mathbb{Var}(X)}(X-\mu_{X})=\frac{\sigma_{XY}}{\sigma^{2}_{X}}(X-\mu_{X}).
We have \hat{X}_{0}=1_{\Omega};
\begin{align*} \hat{X}_{1} &=X_{1}-\mathbb{proj}_{\hat{X}_{0}}X_{1} \\&=X_{1}-\frac{\left(X_{1},1_{\Omega}\right)}{||1_{\Omega}||_{2}^{2}}1_{\Omega} \\&=X_{1}-\mu_{1} ;\end{align*}
\begin{align*} \hat{X}_{2} &=X_{2}-\left(\mathbb{proj}_{\hat{X}_{0}}+\mathbb{proj}_{\hat{X}_{1}}\right)X_{2} \\&=X_{2}-\frac{\left(X_{2}, 1_{\Omega}\right)}{||1_{\Omega}||_{2}^{2}}1_{\Omega}-\frac{\left(X_{2},X_{1}-\mu_{1}\right)}{||X_{1}-\mu_{1}||_{2}^{2}}\left(X_{1}-\mu_{1}\right) \\&=\left(X_{2}-\mu_{2}\right)-\frac{\sigma_{12}}{\sigma^{2}_{1}}\left(X_{1}-\mu_{1}\right) \\&=X_{2}-\frac{\sigma_{12}}{\sigma^{2}_{1}}X_{1}-\left(\mu_{2}-\frac{\sigma_{12}}{\sigma^{2}_{1}}\mu_{1}\right) ;\end{align*}
\begin{align*} \hat{X}_{3} &=X_{3}-\left(\mathbb{proj}_{\hat{X}_{0}}+\mathbb{proj}_{\hat{X}_{1}}+\mathbb{proj}_{\hat{X}_{2}}\right)X_{3} \\&=X_{3}-\frac{\left(X_{3}, 1_{\Omega}\right)}{||1_{\Omega}||_{2}^{2}}1_{\Omega}-\frac{\left(X_{3},X_{1}-\mu_{1}\right)}{||X_{1}-\mu_{1}||_{2}^{2}}\left(X_{1}-\mu_{1}\right)-\frac{\left(X_{3},\left(X_{2}-\mu_{2}\right)-\frac{\sigma_{12}}{\sigma^{2}_{1}}\left(X_{1}-\mu_{1}\right)\right)}{||\left(X_{2}-\mu_{2}\right)-\frac{\sigma_{12}}{\sigma^{2}_{1}}\left(X_{1}-\mu_{1}\right)||^{2}_{2}}\left(\left(X_{2}-\mu_{2}\right)-\frac{\sigma_{12}}{\sigma^{2}_{1}}\left(X_{1}-\mu_{1}\right)\right) \\&=\left(X_{3}-\mu_{3}\right)-\frac{\sigma_{13}}{\sigma^{2}_{1}}\left(X_{1}-\mu_{1}\right)-\frac{\sigma_{23}-\frac{\sigma_{12}}{\sigma_{1}^{2}}\sigma_{13}}{\sigma_{2}^{2}-2\frac{\sigma_{12}}{\sigma_{1}^{2}}\sigma_{12}+\frac{\sigma_{12}^{2}}{\sigma_{1}^{4}}\sigma_{1}^{2}}\left(\left(X_{2}-\mu_{2}\right)-\frac{\sigma_{12}}{\sigma^{2}_{1}}\left(X_{1}-\mu_{1}\right)\right) \\&=\left(X_{3}-\mu_{3}\right)-\frac{\sigma_{13}}{\sigma^{2}_{1}}\left(X_{1}-\mu_{1}\right)-\frac{\sigma_{1}^{2}\sigma_{23}-\sigma_{12}\sigma_{13}}{\sigma_{1}^{2}\sigma_{2}^{2}-\sigma_{12}^{2}}\left(\left(X_{2}-\mu_{2}\right)-\frac{\sigma_{12}}{\sigma^{2}_{1}}\left(X_{1}-\mu_{1}\right)\right) \\&=\left(X_{3}-\mu_{3}\right)-\left(\frac{\sigma_{13}}{\sigma^{2}_{1}}-\frac{\sigma_{12}}{\sigma_{1}^{2}}\frac{\sigma_{1}^{2}\sigma_{23}-\sigma_{12}\sigma_{13}}{\sigma_{1}^{2}\sigma_{2}^{2}-\sigma_{12}^{2}}\right)\left(X_{1}-\mu_{1}\right)-\left(\frac{\sigma_{1}^{2}\sigma_{23}-\sigma_{12}\sigma_{13}}{\sigma_{1}^{2}\sigma_{2}^{2}-\sigma_{12}^{2}}\right)\left(X_{2}-\mu_{2}\right) \\&=X_{3}-\left(\frac{\sigma_{1}^{2}\sigma_{23}-\sigma_{12}\sigma_{13}}{\sigma_{1}^{2}\sigma_{2}^{2}-\sigma_{12}^{2}}\right)X_{2}-\left(\frac{\sigma_{13}}{\sigma^{2}_{1}}-\frac{\sigma_{12}}{\sigma_{1}^{2}}\frac{\sigma_{1}^{2}\sigma_{23}-\sigma_{12}\sigma_{13}}{\sigma_{1}^{2}\sigma_{2}^{2}-\sigma_{12}^{2}}\right)X_{1}-\left(\mu_{3}-\left(\frac{\sigma_{1}^{2}\sigma_{23}-\sigma_{12}\sigma_{13}}{\sigma_{1}^{2}\sigma_{2}^{2}-\sigma_{12}^{2}}\right)\mu_{2}-\left(\frac{\sigma_{13}}{\sigma^{2}_{1}}-\frac{\sigma_{12}}{\sigma_{1}^{2}}\frac{\sigma_{1}^{2}\sigma_{23}-\sigma_{12}\sigma_{13}}{\sigma_{1}^{2}\sigma_{2}^{2}-\sigma_{12}^{2}}\right)\mu_{1}\right) ;\end{align*}
:=\sigma(X)\subset\mathcal{F} be the sub-\sigma-algebra of \mathcal{F} generated by X (i.e., \{X^{-1}(B)\}_{B\in\mathcal{B}(\mathbb{R})},