\(\DeclareMathOperator*{\argmin}{arg\,min}\) \(\DeclareMathOperator*{\argmax}{arg\,max}\) \(\newcommand{\var}{\mathrm{Var}}\)
Repository containing summary of notes for the Statistical Methods 1 course in the Computational Statistics and Data Science (COMPASS) Center for Doctoral Training (CDT) PhD at the University of Bristol.
Given a training dataset \[D_0:= \left\{(\boldsymbol{\mathbf{x}}_i, y_i)\right\}_{i=1}^n\] of \(n\) pairs of inputs \[\boldsymbol{\mathbf{x}}_i\ = (x_{1i}, \ldots, x_{di})\in\mathbb{R}^{d\times 1}\] and outputs \(y_i\in \mathbb{R}\) we want to predict the output \(\widehat{y}\) for a new input \(\widehat{\boldsymbol{\mathbf{x}}}\) by inferring the parameters \(\boldsymbol{\mathbf{w}}\in \mathbb{R}^{(d+1)\times 1}\) of a prediction function \(f(\boldsymbol{\mathbf{x}}, \boldsymbol{\mathbf{w}})\).
We want \(f(\boldsymbol{\mathbf{x}}, \boldsymbol{\mathbf{w}})\) to perform well on unseen testing data \(D_1\). In other words, we want our model to generalize well. We define the error on a dataset \(D'\) given the inferred parameters \(\boldsymbol{\mathbf{w}}^*\) as
\[E(D', \boldsymbol{\mathbf{w}}^*):=\sum_{i\in D'} \left(y_i - f(\boldsymbol{\mathbf{x}}_i, \boldsymbol{\mathbf{w}}^*)\right)^2\]
This gives two performance metrics:
We could split our dataset into training set \(D_0\) and testing set \(D_1\), however data is often scarce so we want to use as much of it as possible. To fully exploit our dataset we can use Cross Validation:
Cross validation can be used to choose between competing models. Thus, it is often used for model selection when a model has one or more hyperparameters that needs to be estimated. The main problems with Cross Validation are:
We adopt the following notation: each column of the matrix \(X\) represents a single observation of all the explanatory variable features, and each row represents all observations of a single feature.
\[X = \begin{pmatrix} \boldsymbol{\mathbf{x}}_1 & \boldsymbol{\mathbf{x}}_2 & \cdots & \boldsymbol{\mathbf{x}}_n \\ 1 & 1 & \cdots 1 \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \vdots & \vdots \\ x_{d1} & x_{d2} & \cdots & x_{dn} \\ 1 & 1 & \cdots & 1 \end{pmatrix} \in \mathbb{R}^{(b+1)\times n} \]
The response vector is a row vector containing all \(n\) observations of the target variable. \[\boldsymbol{\mathbf{y}}= (y_1, \ldots, y_n)\in \mathbb{R}^{1\times n}\] We call the function
\[\phi:\mathbb{R}^d\to\mathbb{R}^b\] a feature transform and will be used to make inference more flexible. Sometimes we will write \(\phi(X)\), this is a short-hand notation for the following matrix \[ \phi(X) = \begin{pmatrix} \phi(\boldsymbol{\mathbf{x}}_1) & \phi(\boldsymbol{\mathbf{x}}_2) & \cdots & \phi(\boldsymbol{\mathbf{x}}_n) \\ 1 & 1 & \cdots & 1 \end{pmatrix} \in \mathbb{R}^{(b+1)\times n} \] There are \(2\) main non-bayesian inference methods that we look at:
Linear Least Squares: Finds parameters \(\boldsymbol{\mathbf{w}}_{\text{LS}}\in\mathbb{R}^{(b+1)\times 1}\) that minimize the sum of squares residuals
\[\boldsymbol{\mathbf{w}}_{\text{LS}}:= \argmin_{\boldsymbol{\mathbf{w}}} \sum_{i\in D_0} \left(y_i - f(\phi(\boldsymbol{\mathbf{x}}_i), \boldsymbol{\mathbf{w}})\right)^2 = \left(\phi(X)\phi(X)^\top\right)^{-1}\phi(X)\boldsymbol{\mathbf{y}}^\top\] where \(f(\phi(\boldsymbol{\mathbf{x}}_i), \boldsymbol{\mathbf{w}})\) is the dot product between the \(i^{\text{th}}\) column of \(\phi(X)\) and \(\boldsymbol{\mathbf{w}}\). It is important to understand that it’s called linear because it is linear in the unknown parameters \(\boldsymbol{\mathbf{w}}\), but can be non-linear in the inputs if we apply a non-linear feature transform \(\phi\).
Maximum Likelihood Estimation: Finds parameters \(\boldsymbol{\mathbf{w}}_{\text{ML}}\) describing a joint probability density function for \(\boldsymbol{\mathbf{y}}\) that has maximum density values at our data points \[ \boldsymbol{\mathbf{w}}_{\text{ML}} = \argmax_{\boldsymbol{\mathbf{w}}} \,\,\,p(\boldsymbol{\mathbf{y}}\mid \phi(\boldsymbol{\mathbf{x}}_1), \ldots, \phi(\boldsymbol{\mathbf{x}}_n), \boldsymbol{\mathbf{w}}, \sigma) \] Assuming \(y_i\) are IID and normally distributed, Maximum likelihood and Least Squares lead to the same solution \(\boldsymbol{\mathbf{w}}_{\text{LS}} = \boldsymbol{\mathbf{w}}_{\text{ML}}\). The advantage of Maximum Likelihood over Least Squares is that by finding \(\sigma^2_{\text{ML}}\) we obtain a measure of uncertainty on our prediction.
Here we assume \(y_i\) are IID and normally distributed. Moreover, we assign a normal prior to the parameters.
Now we consider \(y\in \left\{+1, -1\right\}\). We are looking for a decision function \(f(\boldsymbol{\mathbf{x}})\) whose level set \(f(\boldsymbol{\mathbf{x}}) = 0\), called decision boundary, separates feature space into two regions:
The optimal Bayes classifier \[f(\boldsymbol{\mathbf{w}}) = p(\boldsymbol{\mathbf{w}}, y=+1) - p(\boldsymbol{\mathbf{x}}, y= -1)\] minimizes the probability of making a mistake given the decision function \(f\), that is it minimizes \[p(\text{mistake} \mid f) = \int_{R_{+}} p(\boldsymbol{\mathbf{x}}, y=-1)d\boldsymbol{\mathbf{x}}+ \int_{R_{-}} p(\boldsymbol{\mathbf{x}}, y=+1)d\boldsymbol{\mathbf{x}}\] However, this classifier is unpractical because in reality
Therefore, we introduce a loss function \(L(y, y_0)\) which is described by a loss matrix whose entries give the loss of predicting \(y_0\) when the actual value is \(y\). Our goal then becomes to minimize expected loss \[\widehat{y}= \argmin_{y_0} \mathbb{E}_{p(y\mid \boldsymbol{\mathbf{x}}, D)}\left[L(y, y_0) \mid \boldsymbol{\mathbf{x}}\right]\] where \(p(y \mid \boldsymbol{\mathbf{x}}, D)\) can be found with either the Fully Bayesian approach, or we can find \(\boldsymbol{\mathbf{w}}_{\text{ML}}\) or \(\boldsymbol{\mathbf{w}}_{\text{MAP}}\) first, and then plug them in to obtain the distribution.
Notice that in general there are two strategies to find \(p(y \mid \boldsymbol{\mathbf{x}}, D)\):
Once we have \(p(y \mid \boldsymbol{\mathbf{x}}, D)\) we can start taking decisions. Sometimes, we might be uncertain about our decision. In such cases, we can reject a data point \(\boldsymbol{\mathbf{x}}\) when both probabilities are lower than a threshold \(\theta\) \[\max\left\{ p(y = +1\mid \boldsymbol{\mathbf{x}}), \, p(y=-1\mid \boldsymbol{\mathbf{x}})\right\} < \theta\]
\[\widehat{y}\approx \begin{cases} \mathbb{E}_{p(y\mid \boldsymbol{\mathbf{x}}, D)}\left[L(y, y_0)\mid \boldsymbol{\mathbf{x}}\right] && \text{if } L=(y-y_0)^2 \\ \text{Predictive median of } p(y \mid \boldsymbol{\mathbf{x}}, D) && \text{if } L=|y-y_0| \end{cases}\] Basically if we use a squared loss, our prediction \(\widehat{y}\) will be the same as the one found by using the fully-Bayesian approach or the MAP approach.
We say that a random variable \(X\sim N_x(\mu, \sigma^2)\) has a univariate Gaussian distribution if We say that a random vector \(\boldsymbol{\mathbf{X}}= (X_1, \ldots, X_d)^\top\) has a multivariate Gaussian distribution if \[p(\boldsymbol{\mathbf{x}}) = N_{\boldsymbol{\mathbf{x}}}(\boldsymbol{\mathbf{\mu}}, \Sigma) = \frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}(\boldsymbol{\mathbf{x}}-\boldsymbol{\mathbf{\mu}})^\top \Sigma^{-1} (\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})\right)\] where \(\boldsymbol{\mathbf{\mu}}\in\mathbb{R}^{d}\) and \(\Sigma\in \mathbb{R}^{d\times d}\) and the variance-covariance matrix needs to be positive definite \(\Sigma \succ 0\). Issues with MVN:
A very useful formula that is used to derive the formulae below regards the kernel of the MVN \[-\frac{1}{2}(\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})^\top \Sigma^{-1}(\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}}) = -\frac{1}{2}\boldsymbol{\mathbf{x}}^\top \Sigma^{-1}\boldsymbol{\mathbf{x}}+ \boldsymbol{\mathbf{x}}^\top \Sigma^{-1} \boldsymbol{\mathbf{\mu}}+ \text{const}\]
Eigendecomposing \(\Sigma^{-1}\) and using the change of variable \(\boldsymbol{\mathbf{y}}= U^\top (\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})\), corresponding to centering the observations and rotating using the eigenvector matrix, we find that every MVN under a linear coordinate transform is a product of \(d\) univariate normals: \[p(\boldsymbol{\mathbf{y}}) = \prod_{i=1}^d N_{y_i}(0, \lambda_i) \qquad \text{where } \lambda_i \text{ eigenvectors of } \Sigma^{-1}\]
If two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. That is if we partition \[ \boldsymbol{\mathbf{x}}= \begin{pmatrix} \boldsymbol{\mathbf{x}}_a \\ \boldsymbol{\mathbf{x}}_b \end{pmatrix} \qquad \boldsymbol{\mathbf{\mu}}= \begin{pmatrix} \boldsymbol{\mathbf{\mu}}_a \\ \boldsymbol{\mathbf{\mu}}_b \end{pmatrix} \qquad \Sigma = \begin{pmatrix} \Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{bb} & \Sigma_{bb} \end{pmatrix} \qquad \Lambda = \Sigma^{-1} = \begin{pmatrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb} \end{pmatrix} \qquad \text{where } \Lambda \text{ is the $\textbf{precision}$ matrix} \] Then we have the mean and covariance matrix of \(p(\boldsymbol{\mathbf{x}}_a\mid\boldsymbol{\mathbf{x}}_b)\) are given by \[ \begin{align} \boldsymbol{\mathbf{\mu}}_{a\mid b} &= \boldsymbol{\mathbf{\mu}}_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\boldsymbol{\mathbf{x}}_b - \boldsymbol{\mathbf{\mu}}_b) = \boldsymbol{\mathbf{\mu}}_a - \Lambda_{aa}^{-1}\Lambda_{ab}(\boldsymbol{\mathbf{x}}_b - \boldsymbol{\mathbf{\mu}}_b) \qquad \textbf{linear} \text{ function of } \boldsymbol{\mathbf{x}}_b\\ \Sigma_{a\mid b} &= \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba} = \Lambda_{aa}^{-1} \qquad \textbf{independent} \text{ of } \boldsymbol{\mathbf{x}}_b \end{align} \] The general form \(p(\boldsymbol{\mathbf{y}}\mid \boldsymbol{\mathbf{x}})\) with mean being a linear function of \(\boldsymbol{\mathbf{x}}\) and covariance independent of \(\boldsymbol{\mathbf{x}}\) is called a linear Gaussian model.
The marginal of a MVN is again a Gaussian: \[ p(\boldsymbol{\mathbf{x}}_a) = \int p(\boldsymbol{\mathbf{x}}_a, \boldsymbol{\mathbf{x}}_b) d\boldsymbol{\mathbf{x}}_b = N(\boldsymbol{\mathbf{\mu}}_a, \Sigma_{aa}) \]
Suppose we have a
Then we can find:
Suppose we have \(N\) IID multivariate Gaussian observations \(\left\{\boldsymbol{\mathbf{x}}_1, \ldots, \boldsymbol{\mathbf{x}}_N\right\}\), where \(\boldsymbol{\mathbf{x}}_i\sim N(\boldsymbol{\mathbf{\mu}}, \Sigma)\). Then \[ \begin{align} \boldsymbol{\mathbf{\mu}}_{\text{ML}} &= \frac{1}{N} \sum_{i=1}^N \boldsymbol{\mathbf{x}}_i \\ \Sigma_{\text{ML}} &= \frac{1}{N} \sum_{i=1}^N (\boldsymbol{\mathbf{x}}_i - \boldsymbol{\mathbf{\mu}}_{\text{ML}})(\boldsymbol{\mathbf{x}}_i - \boldsymbol{\mathbf{\mu}}_{\text{ML}})^\top \end{align} \] Notice that while the maximum likelihood estimate of the mean is unbiased \(\mathbb{E}[\boldsymbol{\mathbf{\mu}}_{\text{ML}}] = \boldsymbol{\mathbf{\mu}}\) the maximum likelihood estimate of the covariance matrix is biased \[\mathbb{E}\left[\Sigma_{\text{ML}}\right] = \frac{N - 1}{N}\Sigma\]
The training error \(E[D, \boldsymbol{\mathbf{w}}_{\text{LS}}]\) is evaluated only on the specific dataset \(D\). We call in-sample error the training error on all possible datasets \[\mathbb{E}_D\left[E[D, \boldsymbol{\mathbf{w}}_{\text{LS}}]\right] = \sum_{i=1}^n \mathbb{E}_D\left[[y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]\mid\boldsymbol{\mathbf{x}}_i\right]\] where \(\mathbb{E}_D\left[[y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]\mid\boldsymbol{\mathbf{x}}_i\right]\) is called expected error and can be decomposed as follows \[ \mathbb{E}_D\left[[y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]\mid\boldsymbol{\mathbf{x}}_i\right] = \underbrace{\text{Var}_{\epsilon}[\epsilon_i]}_{\text{irreducible error}} + \underbrace{(g(\boldsymbol{\mathbf{x}}_i) - \mathbb{E}_{\epsilon}[f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})])^2}_{\text{bias}} + \underbrace{\text{Var}_{\epsilon}[f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]}_{\text{variance}} \] This cannot be calculated cause we don’t know the distribution of \(\epsilon\) and we don’t know \(g(\boldsymbol{\mathbf{x}}_i)\). Thus we actually use the out-sample error: \[ \mathbb{E}_{\boldsymbol{\mathbf{x}}}[\mathbb{E}_{\epsilon}[(y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}}))]] = \mathbb{E}_{p(y, \boldsymbol{\mathbf{x}})}[(y-f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}}))^2] \] If we have a new batch of data \(D':=\left\{(y_i', \boldsymbol{\mathbf{x}}_i')\right\}_{i=1}^{n'}\) we see that the out-sample error can be approximated by the testing error thanks to the Law of Large Numbers: \[\frac{1}{n'}\sum_{i=1}^{n'} (y_i' - f(\boldsymbol{\mathbf{x}}_i'; \boldsymbol{\mathbf{w}}_{\text{LS}}))^2 \quad \overset{\text{LLN}}{\longrightarrow} \quad \mathbb{E}_{p(y, \boldsymbol{\mathbf{x}})}\left[(y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}}))^2\right]\]