In this post we discuss the basic theory of linear regression, mostly from the perspective of probability theory. At the end we devote some time to statistical applications. The reason for focusing on the probabilistic point of view is to make rigorous the actual use of linear regression in applied settings; furthermore, the approach clarifies what linear regression actually is (and isn't). Indeed, there seems to be wide gaps in understanding among practitioners, probably due to the ubiquitous use of the methodology and the need for less mathematically inclined people to use it.
Regardless of the perspective, we have a probability (sample) space $(\Omega,\mathcal{F},\mathbb{P})$ and real-valued $\mathcal{F}$-measurable random variables $X$ and $Y$ defined on $\Omega$. What differentiates the probabilistic and statistical point of views are what we know about $X$ and $Y$. In the former, we know their distributions, i.e. all of the statistical properties of $X$ and $Y$ on their (continuous) range of values. In the latter case we have only samples $\{(x_{n}, y_{n})\}_{n\geq0}$ of $X$ and $Y$. In regression theory, the goal is to express a relationship between $X$ and $Y$ and associated properties, a much easier task from the probabilistic point of view since we have a parametric (or at least fully determined) distributions. From the statistical point of view, due to lack of a complete distribution, we have to make certain modeling, distributional, and even sampling assumptions in order to make useful inferences (without these assumptions, the inferences we would be able to make would be far too weak to serve any useful purpose). We will discuss some of these assumptions that are typical in most regression applications later. For now, we focus on $X$ and $Y$ with known distribution measures $\mu_{X}$ and $\mu_{Y}$.
28 October, 2015
24 October, 2015
Analyzing the Definition of Independence
One of the most fundamental concepts of probability theory is that of independence. The concept is intuitive and captures the idea that two experiments are independent if the outcome of one does not affect the outcome of the other. What we mean here by experiment is a measurable space $\Omega$ (called the sample space) consisting of points $\omega\in\Omega$ that represent all the possible outcomes of our experiment, and a $\sigma$-algebra $\mathcal{F}$ consisting of all possible combinations of outcomes (events) represented by subsets $A\in\Omega$. Additionally, there is a measure $\mathbb{P}$ that assigns $1$ to the entire sample space $\Omega$ and that is countably additive on $\mathcal{F}$ in the sense that whenever $\{A_{n}\}_{n\geq0}$ is an at-most countable set of disjoint events of $\mathcal{F}$ we have $\mathbb{P}(\cup_{n} A_{n})=\sum_{n}\mathbb{P}(A_{n})$. We then have the following formal definition of independence: