\(\DeclareMathOperator*{\argmin}{arg\,min}\) \(\DeclareMathOperator*{\argmax}{arg\,max}\) \(\newcommand{\var}{\mathrm{Var}}\)

Course Description

Repository containing summary of notes for the Statistical Methods 1 course in the Computational Statistics and Data Science (COMPASS) Center for Doctoral Training (CDT) PhD at the University of Bristol.

Regression Problems

Problem Statement and Assumptions

Given a training dataset \[D_0:= \left\{(\boldsymbol{\mathbf{x}}_i, y_i)\right\}_{i=1}^n\] of \(n\) pairs of inputs \[\boldsymbol{\mathbf{x}}_i\ = (x_{1i}, \ldots, x_{di})\in\mathbb{R}^{d\times 1}\] and outputs \(y_i\in \mathbb{R}\) we want to predict the output \(\widehat{y}\) for a new input \(\widehat{\boldsymbol{\mathbf{x}}}\) by inferring the parameters \(\boldsymbol{\mathbf{w}}\in \mathbb{R}^{(d+1)\times 1}\) of a prediction function \(f(\boldsymbol{\mathbf{x}}, \boldsymbol{\mathbf{w}})\).

Measuring Performance and Model Selection

We want \(f(\boldsymbol{\mathbf{x}}, \boldsymbol{\mathbf{w}})\) to perform well on unseen testing data \(D_1\). In other words, we want our model to generalize well. We define the error on a dataset \(D'\) given the inferred parameters \(\boldsymbol{\mathbf{w}}^*\) as

\[E(D', \boldsymbol{\mathbf{w}}^*):=\sum_{i\in D'} \left(y_i - f(\boldsymbol{\mathbf{x}}_i, \boldsymbol{\mathbf{w}}^*)\right)^2\]

This gives two performance metrics:

  • Training Error : \(E(D_0, \boldsymbol{\mathbf{w}}^*)\)
  • Testing Error : \(E(D_1, \boldsymbol{\mathbf{w}}^*)\)

We could split our dataset into training set \(D_0\) and testing set \(D_1\), however data is often scarce so we want to use as much of it as possible. To fully exploit our dataset we can use Cross Validation:

  • Split the dataset into \(s\) groups \(D_1, \ldots, D_s\) called folds. We then call it s-fold cross validation. In the extreme case in which \(s=n\), i.e. each group consists of one data point, we call it leave-one-out cross validation.
  • Use \(s-1\) groups to infer \(\boldsymbol{\mathbf{w}}^*\). Test performance of \(f(\boldsymbol{\mathbf{x}}, \boldsymbol{\mathbf{w}}^*)\) on the remaining unseen group, and calculate \(E(D_i, \boldsymbol{\mathbf{w}}^{*^{(i)}})\). Repeat this until all groups have been used for testing exactly once.
  • At the end, calculate the mean test error \[E_{\text{cv}} = \frac{1}{s+1}\sum_{i=1}^s E(D_i, \boldsymbol{\mathbf{w}}^{*^{(i)}})\]

Cross validation can be used to choose between competing models. Thus, it is often used for model selection when a model has one or more hyperparameters that needs to be estimated. The main problems with Cross Validation are:

  • Dataset needs to be IID, but this is often not the case. For instance in a time-dependent dataset.
  • Computational complexity of Cross Validation algorithm increases by a factor of \(s\) for each hyperparameter.

Non-Bayesian Inference Methods

We adopt the following notation: each column of the matrix \(X\) represents a single observation of all the explanatory variable features, and each row represents all observations of a single feature.

\[X = \begin{pmatrix} \boldsymbol{\mathbf{x}}_1 & \boldsymbol{\mathbf{x}}_2 & \cdots & \boldsymbol{\mathbf{x}}_n \\ 1 & 1 & \cdots 1 \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \vdots & \vdots \\ x_{d1} & x_{d2} & \cdots & x_{dn} \\ 1 & 1 & \cdots & 1 \end{pmatrix} \in \mathbb{R}^{(b+1)\times n} \]

The response vector is a row vector containing all \(n\) observations of the target variable. \[\boldsymbol{\mathbf{y}}= (y_1, \ldots, y_n)\in \mathbb{R}^{1\times n}\] We call the function

\[\phi:\mathbb{R}^d\to\mathbb{R}^b\] a feature transform and will be used to make inference more flexible. Sometimes we will write \(\phi(X)\), this is a short-hand notation for the following matrix \[ \phi(X) = \begin{pmatrix} \phi(\boldsymbol{\mathbf{x}}_1) & \phi(\boldsymbol{\mathbf{x}}_2) & \cdots & \phi(\boldsymbol{\mathbf{x}}_n) \\ 1 & 1 & \cdots & 1 \end{pmatrix} \in \mathbb{R}^{(b+1)\times n} \] There are \(2\) main non-bayesian inference methods that we look at:

  • Linear Least Squares: Finds parameters \(\boldsymbol{\mathbf{w}}_{\text{LS}}\in\mathbb{R}^{(b+1)\times 1}\) that minimize the sum of squares residuals

    \[\boldsymbol{\mathbf{w}}_{\text{LS}}:= \argmin_{\boldsymbol{\mathbf{w}}} \sum_{i\in D_0} \left(y_i - f(\phi(\boldsymbol{\mathbf{x}}_i), \boldsymbol{\mathbf{w}})\right)^2 = \left(\phi(X)\phi(X)^\top\right)^{-1}\phi(X)\boldsymbol{\mathbf{y}}^\top\] where \(f(\phi(\boldsymbol{\mathbf{x}}_i), \boldsymbol{\mathbf{w}})\) is the dot product between the \(i^{\text{th}}\) column of \(\phi(X)\) and \(\boldsymbol{\mathbf{w}}\). It is important to understand that it’s called linear because it is linear in the unknown parameters \(\boldsymbol{\mathbf{w}}\), but can be non-linear in the inputs if we apply a non-linear feature transform \(\phi\).

  • Maximum Likelihood Estimation: Finds parameters \(\boldsymbol{\mathbf{w}}_{\text{ML}}\) describing a joint probability density function for \(\boldsymbol{\mathbf{y}}\) that has maximum density values at our data points \[ \boldsymbol{\mathbf{w}}_{\text{ML}} = \argmax_{\boldsymbol{\mathbf{w}}} \,\,\,p(\boldsymbol{\mathbf{y}}\mid \phi(\boldsymbol{\mathbf{x}}_1), \ldots, \phi(\boldsymbol{\mathbf{x}}_n), \boldsymbol{\mathbf{w}}, \sigma) \] Assuming \(y_i\) are IID and normally distributed, Maximum likelihood and Least Squares lead to the same solution \(\boldsymbol{\mathbf{w}}_{\text{LS}} = \boldsymbol{\mathbf{w}}_{\text{ML}}\). The advantage of Maximum Likelihood over Least Squares is that by finding \(\sigma^2_{\text{ML}}\) we obtain a measure of uncertainty on our prediction.

Issues with Non-Bayesian Inference Methods

  • They provide point estimates for our predictions, but we want a predictive distribution instead.
  • Using the feature transform \(\phi\) in the model leads both inference methods to possible overfitting, i.e. learning the noise rather than the structure in the data, when the complexity of the model, governed by \(b\), increases. Cross Validation can be used to select the output dimensionality \(b\) of \(\phi\) to combat overfitting, but dataset is often not IID. A better choice is regularized least squares: \[ \boldsymbol{\mathbf{w}}_{\text{LS-R}}:= \argmin_{\boldsymbol{\mathbf{w}}} \sum_{i\in D_0} \left(y_i - f(\phi(\boldsymbol{\mathbf{x}}_i), \boldsymbol{\mathbf{w}})\right)^2 + \lambda || \boldsymbol{\mathbf{w}}||_2^2 = \left(\phi(X)\phi(X)^\top + \lambda I \right)^{-1} \phi(X)\boldsymbol{\mathbf{y}}^\top \] Regularization combats overfitting because this is often due to large increase in parameters magnitude, and regularization puts a penalty on it. To choose \(\lambda\) we can use cross-validation or a probabilistic perspective, as we’ll see later. Notice that above the penalty is the \(L^2\) norm but can be any \(p-\)norm.
  • They incur in the curse of dimensionality because as we increase complexity \(b\), we include more features, whose number can grow exponentially even just in the case of \(\phi\) being a polynomial transform. This requires a dataset size that is exponential in \(b\).

Semi and Fully Bayesian Inference Methods

Here we assume \(y_i\) are IID and normally distributed. Moreover, we assign a normal prior to the parameters.

  • Maximum-A-Posteriori (MAP): Finds parameter \(\boldsymbol{\mathbf{w}}_{\text{MAP}}\) that maximizes the posterior parameter distribution given our data \(D\): \[ \boldsymbol{\mathbf{w}}_{\text{MAP}} := \argmax_{\boldsymbol{\mathbf{w}}} p(\boldsymbol{\mathbf{w}}\mid D) = \argmax_{\boldsymbol{\mathbf{w}}} \prod_{i\in D} N_{y_i}\left(f(\phi(\boldsymbol{\mathbf{x}}_i), \boldsymbol{\mathbf{w}}), \sigma^2 \right)N_{\boldsymbol{\mathbf{w}}}(\boldsymbol{\mathbf{0}}, \sigma^2_{\boldsymbol{\mathbf{w}}}I) = \boldsymbol{\mathbf{w}}_{\text{LS-R}}\left(\frac{\sigma^2}{\sigma^2_{\boldsymbol{\mathbf{w}}}}\right) \]
  • Fully Bayesian: Finds a predictive distribution for \(\widehat{y}\) given the new \(\widehat{\boldsymbol{\mathbf{x}}}\) \[ p(\widehat{y}\mid \widehat{\boldsymbol{\mathbf{x}}}, D) = \int p(\widehat{y}\mid \widehat{\boldsymbol{\mathbf{x}}}, D)p(\boldsymbol{\mathbf{w}}\mid D)d\boldsymbol{\mathbf{w}}= N_{\widehat{y}}\left(f\left(\phi(\widehat{\boldsymbol{\mathbf{x}}}), \boldsymbol{\mathbf{w}}_{\text{LS-R}}\left(\frac{\sigma^2}{\sigma^2_{\boldsymbol{\mathbf{w}}}}\right)\right), \sigma^2 + \phi(\widehat{\boldsymbol{\mathbf{x}}})^\top \sigma^2\left(\phi(X)\phi(X)^\top + \frac{\sigma^2}{\sigma^2_\boldsymbol{\mathbf{w}}} I \right)^{-1}\phi(\widehat{\boldsymbol{\mathbf{x}}})\right) \] which is a normal distribution whose mean is the prediction given by regularized least squares.

Classification

Problem Statement and Optimal Bayes Classifier

Now we consider \(y\in \left\{+1, -1\right\}\). We are looking for a decision function \(f(\boldsymbol{\mathbf{x}})\) whose level set \(f(\boldsymbol{\mathbf{x}}) = 0\), called decision boundary, separates feature space into two regions:

  • \(R_{+}\) where \(f(\boldsymbol{\mathbf{x}})\geq 0\) when \(\boldsymbol{\mathbf{x}}\in R_{+}\)
  • \(R_{-}\) where \(f(\boldsymbol{\mathbf{x}})\leq 0\) when \(\boldsymbol{\mathbf{x}}\in R_{-}\)

The optimal Bayes classifier \[f(\boldsymbol{\mathbf{w}}) = p(\boldsymbol{\mathbf{w}}, y=+1) - p(\boldsymbol{\mathbf{x}}, y= -1)\] minimizes the probability of making a mistake given the decision function \(f\), that is it minimizes \[p(\text{mistake} \mid f) = \int_{R_{+}} p(\boldsymbol{\mathbf{x}}, y=-1)d\boldsymbol{\mathbf{x}}+ \int_{R_{-}} p(\boldsymbol{\mathbf{x}}, y=+1)d\boldsymbol{\mathbf{x}}\] However, this classifier is unpractical because in reality

  • We don’t have access to \(p(\boldsymbol{\mathbf{x}}, y)\)
  • Different mistakes bear different losses.

Expected Loss Minimization

Therefore, we introduce a loss function \(L(y, y_0)\) which is described by a loss matrix whose entries give the loss of predicting \(y_0\) when the actual value is \(y\). Our goal then becomes to minimize expected loss \[\widehat{y}= \argmin_{y_0} \mathbb{E}_{p(y\mid \boldsymbol{\mathbf{x}}, D)}\left[L(y, y_0) \mid \boldsymbol{\mathbf{x}}\right]\] where \(p(y \mid \boldsymbol{\mathbf{x}}, D)\) can be found with either the Fully Bayesian approach, or we can find \(\boldsymbol{\mathbf{w}}_{\text{ML}}\) or \(\boldsymbol{\mathbf{w}}_{\text{MAP}}\) first, and then plug them in to obtain the distribution.

Notice that in general there are two strategies to find \(p(y \mid \boldsymbol{\mathbf{x}}, D)\):

  • Discriminative Approach: Infer \(p(y \mid \boldsymbol{\mathbf{x}}, D)\) directly using \(D\). This method is best used when our only goal is prediction.
  • Generative Approach: Infer \[ p(y \mid \boldsymbol{\mathbf{x}}, D) \propto p(\boldsymbol{\mathbf{x}}\mid y, D) p(y) \] using Bayes rule, where \(p(y)\) is simply the proportion of \(+\) and \(-\) in \(D\). This gives us a way of generating new data points, but at an extra cost because \(p(\boldsymbol{\mathbf{x}}\mid y , D)\) can be very high-dimensional.

Once we have \(p(y \mid \boldsymbol{\mathbf{x}}, D)\) we can start taking decisions. Sometimes, we might be uncertain about our decision. In such cases, we can reject a data point \(\boldsymbol{\mathbf{x}}\) when both probabilities are lower than a threshold \(\theta\) \[\max\left\{ p(y = +1\mid \boldsymbol{\mathbf{x}}), \, p(y=-1\mid \boldsymbol{\mathbf{x}})\right\} < \theta\]

Decision Making in Regression Tasks

\[\widehat{y}\approx \begin{cases} \mathbb{E}_{p(y\mid \boldsymbol{\mathbf{x}}, D)}\left[L(y, y_0)\mid \boldsymbol{\mathbf{x}}\right] && \text{if } L=(y-y_0)^2 \\ \text{Predictive median of } p(y \mid \boldsymbol{\mathbf{x}}, D) && \text{if } L=|y-y_0| \end{cases}\] Basically if we use a squared loss, our prediction \(\widehat{y}\) will be the same as the one found by using the fully-Bayesian approach or the MAP approach.

Probability Review: Gaussian Identities

Multivariate Gaussian Definition

We say that a random variable \(X\sim N_x(\mu, \sigma^2)\) has a univariate Gaussian distribution if We say that a random vector \(\boldsymbol{\mathbf{X}}= (X_1, \ldots, X_d)^\top\) has a multivariate Gaussian distribution if \[p(\boldsymbol{\mathbf{x}}) = N_{\boldsymbol{\mathbf{x}}}(\boldsymbol{\mathbf{\mu}}, \Sigma) = \frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}(\boldsymbol{\mathbf{x}}-\boldsymbol{\mathbf{\mu}})^\top \Sigma^{-1} (\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})\right)\] where \(\boldsymbol{\mathbf{\mu}}\in\mathbb{R}^{d}\) and \(\Sigma\in \mathbb{R}^{d\times d}\) and the variance-covariance matrix needs to be positive definite \(\Sigma \succ 0\). Issues with MVN:

  • Number of parameters grows quadratically with dimensionality \(d\).
  • Unimodal

A very useful formula that is used to derive the formulae below regards the kernel of the MVN \[-\frac{1}{2}(\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})^\top \Sigma^{-1}(\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}}) = -\frac{1}{2}\boldsymbol{\mathbf{x}}^\top \Sigma^{-1}\boldsymbol{\mathbf{x}}+ \boldsymbol{\mathbf{x}}^\top \Sigma^{-1} \boldsymbol{\mathbf{\mu}}+ \text{const}\]

Geometry of the Multivariate Normal Distribution

Eigendecomposing \(\Sigma^{-1}\) and using the change of variable \(\boldsymbol{\mathbf{y}}= U^\top (\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})\), corresponding to centering the observations and rotating using the eigenvector matrix, we find that every MVN under a linear coordinate transform is a product of \(d\) univariate normals: \[p(\boldsymbol{\mathbf{y}}) = \prod_{i=1}^d N_{y_i}(0, \lambda_i) \qquad \text{where } \lambda_i \text{ eigenvectors of } \Sigma^{-1}\]

Conditional Gaussian Distributions

If two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. That is if we partition \[ \boldsymbol{\mathbf{x}}= \begin{pmatrix} \boldsymbol{\mathbf{x}}_a \\ \boldsymbol{\mathbf{x}}_b \end{pmatrix} \qquad \boldsymbol{\mathbf{\mu}}= \begin{pmatrix} \boldsymbol{\mathbf{\mu}}_a \\ \boldsymbol{\mathbf{\mu}}_b \end{pmatrix} \qquad \Sigma = \begin{pmatrix} \Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{bb} & \Sigma_{bb} \end{pmatrix} \qquad \Lambda = \Sigma^{-1} = \begin{pmatrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb} \end{pmatrix} \qquad \text{where } \Lambda \text{ is the $\textbf{precision}$ matrix} \] Then we have the mean and covariance matrix of \(p(\boldsymbol{\mathbf{x}}_a\mid\boldsymbol{\mathbf{x}}_b)\) are given by \[ \begin{align} \boldsymbol{\mathbf{\mu}}_{a\mid b} &= \boldsymbol{\mathbf{\mu}}_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\boldsymbol{\mathbf{x}}_b - \boldsymbol{\mathbf{\mu}}_b) = \boldsymbol{\mathbf{\mu}}_a - \Lambda_{aa}^{-1}\Lambda_{ab}(\boldsymbol{\mathbf{x}}_b - \boldsymbol{\mathbf{\mu}}_b) \qquad \textbf{linear} \text{ function of } \boldsymbol{\mathbf{x}}_b\\ \Sigma_{a\mid b} &= \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba} = \Lambda_{aa}^{-1} \qquad \textbf{independent} \text{ of } \boldsymbol{\mathbf{x}}_b \end{align} \] The general form \(p(\boldsymbol{\mathbf{y}}\mid \boldsymbol{\mathbf{x}})\) with mean being a linear function of \(\boldsymbol{\mathbf{x}}\) and covariance independent of \(\boldsymbol{\mathbf{x}}\) is called a linear Gaussian model.

Marginal Distribution

The marginal of a MVN is again a Gaussian: \[ p(\boldsymbol{\mathbf{x}}_a) = \int p(\boldsymbol{\mathbf{x}}_a, \boldsymbol{\mathbf{x}}_b) d\boldsymbol{\mathbf{x}}_b = N(\boldsymbol{\mathbf{\mu}}_a, \Sigma_{aa}) \]

Gaussian Linear Model

Suppose we have a

  • Prior: \(p(\boldsymbol{\mathbf{x}}) =N_{\boldsymbol{\mathbf{x}}}(\boldsymbol{\mathbf{\mu}}, \Lambda^{-1})\)
  • Likelihood: \(p(\boldsymbol{\mathbf{y}}\mid \boldsymbol{\mathbf{x}}) = N_{\boldsymbol{\mathbf{y}}}(A\boldsymbol{\mathbf{x}}+ \boldsymbol{\mathbf{b}}, L^{-1})\)

Then we can find:

  • Marginal: \(p(\boldsymbol{\mathbf{y}}) = N_{\boldsymbol{\mathbf{y}}}(A\boldsymbol{\mathbf{\mu}}+ \boldsymbol{\mathbf{b}}, L^{-1} + A\Lambda^{-1}A^\top)\)
  • Posterior: \(p(\boldsymbol{\mathbf{x}}\mid \boldsymbol{\mathbf{y}}) = N_{\boldsymbol{\mathbf{x}}}(\Sigma[A^\top L(\boldsymbol{\mathbf{y}}- \boldsymbol{\mathbf{b}}) + \Lambda \boldsymbol{\mathbf{y}}], \Sigma)\) where \(\Sigma=(\Lambda + A^\top L A)^{-1}\)

Maximum Likelihood Estimation of Multivariate Gaussian

Suppose we have \(N\) IID multivariate Gaussian observations \(\left\{\boldsymbol{\mathbf{x}}_1, \ldots, \boldsymbol{\mathbf{x}}_N\right\}\), where \(\boldsymbol{\mathbf{x}}_i\sim N(\boldsymbol{\mathbf{\mu}}, \Sigma)\). Then \[ \begin{align} \boldsymbol{\mathbf{\mu}}_{\text{ML}} &= \frac{1}{N} \sum_{i=1}^N \boldsymbol{\mathbf{x}}_i \\ \Sigma_{\text{ML}} &= \frac{1}{N} \sum_{i=1}^N (\boldsymbol{\mathbf{x}}_i - \boldsymbol{\mathbf{\mu}}_{\text{ML}})(\boldsymbol{\mathbf{x}}_i - \boldsymbol{\mathbf{\mu}}_{\text{ML}})^\top \end{align} \] Notice that while the maximum likelihood estimate of the mean is unbiased \(\mathbb{E}[\boldsymbol{\mathbf{\mu}}_{\text{ML}}] = \boldsymbol{\mathbf{\mu}}\) the maximum likelihood estimate of the covariance matrix is biased \[\mathbb{E}\left[\Sigma_{\text{ML}}\right] = \frac{N - 1}{N}\Sigma\]

Bias-Variance Decomposition

The training error \(E[D, \boldsymbol{\mathbf{w}}_{\text{LS}}]\) is evaluated only on the specific dataset \(D\). We call in-sample error the training error on all possible datasets \[\mathbb{E}_D\left[E[D, \boldsymbol{\mathbf{w}}_{\text{LS}}]\right] = \sum_{i=1}^n \mathbb{E}_D\left[[y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]\mid\boldsymbol{\mathbf{x}}_i\right]\] where \(\mathbb{E}_D\left[[y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]\mid\boldsymbol{\mathbf{x}}_i\right]\) is called expected error and can be decomposed as follows \[ \mathbb{E}_D\left[[y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]\mid\boldsymbol{\mathbf{x}}_i\right] = \underbrace{\text{Var}_{\epsilon}[\epsilon_i]}_{\text{irreducible error}} + \underbrace{(g(\boldsymbol{\mathbf{x}}_i) - \mathbb{E}_{\epsilon}[f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})])^2}_{\text{bias}} + \underbrace{\text{Var}_{\epsilon}[f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]}_{\text{variance}} \] This cannot be calculated cause we don’t know the distribution of \(\epsilon\) and we don’t know \(g(\boldsymbol{\mathbf{x}}_i)\). Thus we actually use the out-sample error: \[ \mathbb{E}_{\boldsymbol{\mathbf{x}}}[\mathbb{E}_{\epsilon}[(y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}}))]] = \mathbb{E}_{p(y, \boldsymbol{\mathbf{x}})}[(y-f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}}))^2] \] If we have a new batch of data \(D':=\left\{(y_i', \boldsymbol{\mathbf{x}}_i')\right\}_{i=1}^{n'}\) we see that the out-sample error can be approximated by the testing error thanks to the Law of Large Numbers: \[\frac{1}{n'}\sum_{i=1}^{n'} (y_i' - f(\boldsymbol{\mathbf{x}}_i'; \boldsymbol{\mathbf{w}}_{\text{LS}}))^2 \quad \overset{\text{LLN}}{\longrightarrow} \quad \mathbb{E}_{p(y, \boldsymbol{\mathbf{x}})}\left[(y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}}))^2\right]\]