Statistical Methods 1

Course Description
Regression Problems
Classification
Probability Review: Gaussian Identities
Bias-Variance Decomposition

$\DeclareMathOperator*{\argmin}{arg\,min}$ $\DeclareMathOperator*{\argmax}{arg\,max}$ $\newcommand{\var}{\mathrm{Var}}$

Course Description

Repository containing summary of notes for the Statistical Methods 1 course in the Computational Statistics and Data Science (COMPASS) Center for Doctoral Training (CDT) PhD at the University of Bristol.

Regression Problems

Problem Statement and Assumptions

Given a training dataset \[D_0:= \left\{(\boldsymbol{\mathbf{x}}_i, y_i)\right\}_{i=1}^n\] of $n$ pairs of inputs \[\boldsymbol{\mathbf{x}}_i\ = (x_{1i}, \ldots, x_{di})\in\mathbb{R}^{d\times 1}\] and outputs $y_i\in \mathbb{R}$ we want to predict the output $\widehat{y}$ for a new input $\widehat{\boldsymbol{\mathbf{x}}}$ by inferring the parameters $\boldsymbol{\mathbf{w}}\in \mathbb{R}^{(d+1)\times 1}$ of a prediction function $f(\boldsymbol{\mathbf{x}}, \boldsymbol{\mathbf{w}})$.

Measuring Performance and Model Selection

We want $f(\boldsymbol{\mathbf{x}}, \boldsymbol{\mathbf{w}})$ to perform well on unseen testing data $D_1$. In other words, we want our model to generalize well. We define the error on a dataset $D'$ given the inferred parameters $\boldsymbol{\mathbf{w}}^*$ as

\[E(D', \boldsymbol{\mathbf{w}}^*):=\sum_{i\in D'} \left(y_i - f(\boldsymbol{\mathbf{x}}_i, \boldsymbol{\mathbf{w}}^*)\right)^2\]

This gives two performance metrics:

Training Error : $E(D_0, \boldsymbol{\mathbf{w}}^*)$
Testing Error : $E(D_1, \boldsymbol{\mathbf{w}}^*)$

We could split our dataset into training set $D_0$ and testing set $D_1$, however data is often scarce so we want to use as much of it as possible. To fully exploit our dataset we can use Cross Validation:

Split the dataset into $s$ groups $D_1, \ldots, D_s$ called folds. We then call it s-fold cross validation. In the extreme case in which $s=n$, i.e. each group consists of one data point, we call it leave-one-out cross validation.
Use $s-1$ groups to infer $\boldsymbol{\mathbf{w}}^*$. Test performance of $f(\boldsymbol{\mathbf{x}}, \boldsymbol{\mathbf{w}}^*)$ on the remaining unseen group, and calculate $E(D_i, \boldsymbol{\mathbf{w}}^{*^{(i)}})$. Repeat this until all groups have been used for testing exactly once.
At the end, calculate the mean test error \[E_{\text{cv}} = \frac{1}{s+1}\sum_{i=1}^s E(D_i, \boldsymbol{\mathbf{w}}^{*^{(i)}})\]

Cross validation can be used to choose between competing models. Thus, it is often used for model selection when a model has one or more hyperparameters that needs to be estimated. The main problems with Cross Validation are:

Dataset needs to be IID, but this is often not the case. For instance in a time-dependent dataset.
Computational complexity of Cross Validation algorithm increases by a factor of $s$ for each hyperparameter.

Non-Bayesian Inference Methods

We adopt the following notation: each column of the matrix $X$ represents a single observation of all the explanatory variable features, and each row represents all observations of a single feature.

\[X = \begin{pmatrix} \boldsymbol{\mathbf{x}}_1 & \boldsymbol{\mathbf{x}}_2 & \cdots & \boldsymbol{\mathbf{x}}_n \\ 1 & 1 & \cdots 1 \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \vdots & \vdots \\ x_{d1} & x_{d2} & \cdots & x_{dn} \\ 1 & 1 & \cdots & 1 \end{pmatrix} \in \mathbb{R}^{(b+1)\times n} \]

The response vector is a row vector containing all $n$ observations of the target variable. \[\boldsymbol{\mathbf{y}}= (y_1, \ldots, y_n)\in \mathbb{R}^{1\times n}\] We call the function

\[\phi:\mathbb{R}^d\to\mathbb{R}^b\] a feature transform and will be used to make inference more flexible. Sometimes we will write $\phi(X)$, this is a short-hand notation for the following matrix \[ \phi(X) = \begin{pmatrix} \phi(\boldsymbol{\mathbf{x}}_1) & \phi(\boldsymbol{\mathbf{x}}_2) & \cdots & \phi(\boldsymbol{\mathbf{x}}_n) \\ 1 & 1 & \cdots & 1 \end{pmatrix} \in \mathbb{R}^{(b+1)\times n} \] There are $2$ main non-bayesian inference methods that we look at:

Linear Least Squares: Finds parameters $\boldsymbol{\mathbf{w}}_{\text{LS}}\in\mathbb{R}^{(b+1)\times 1}$ that minimize the sum of squares residuals

\[\boldsymbol{\mathbf{w}}_{\text{LS}}:= \argmin_{\boldsymbol{\mathbf{w}}} \sum_{i\in D_0} \left(y_i - f(\phi(\boldsymbol{\mathbf{x}}_i), \boldsymbol{\mathbf{w}})\right)^2 = \left(\phi(X)\phi(X)^\top\right)^{-1}\phi(X)\boldsymbol{\mathbf{y}}^\top\] where $f(\phi(\boldsymbol{\mathbf{x}}_i), \boldsymbol{\mathbf{w}})$ is the dot product between the $i^{\text{th}}$ column of $\phi(X)$ and $\boldsymbol{\mathbf{w}}$. It is important to understand that it’s called linear because it is linear in the unknown parameters $\boldsymbol{\mathbf{w}}$, but can be non-linear in the inputs if we apply a non-linear feature transform $\phi$.
Maximum Likelihood Estimation: Finds parameters $\boldsymbol{\mathbf{w}}_{\text{ML}}$ describing a joint probability density function for $\boldsymbol{\mathbf{y}}$ that has maximum density values at our data points \[ \boldsymbol{\mathbf{w}}_{\text{ML}} = \argmax_{\boldsymbol{\mathbf{w}}} \,\,\,p(\boldsymbol{\mathbf{y}}\mid \phi(\boldsymbol{\mathbf{x}}_1), \ldots, \phi(\boldsymbol{\mathbf{x}}_n), \boldsymbol{\mathbf{w}}, \sigma) \] Assuming $y_i$ are IID and normally distributed, Maximum likelihood and Least Squares lead to the same solution $\boldsymbol{\mathbf{w}}_{\text{LS}} = \boldsymbol{\mathbf{w}}_{\text{ML}}$. The advantage of Maximum Likelihood over Least Squares is that by finding $\sigma^2_{\text{ML}}$ we obtain a measure of uncertainty on our prediction.

Issues with Non-Bayesian Inference Methods

They provide point estimates for our predictions, but we want a predictive distribution instead.
Using the feature transform $\phi$ in the model leads both inference methods to possible overfitting, i.e. learning the noise rather than the structure in the data, when the complexity of the model, governed by $b$, increases. Cross Validation can be used to select the output dimensionality $b$ of $\phi$ to combat overfitting, but dataset is often not IID. A better choice is regularized least squares: \[ \boldsymbol{\mathbf{w}}_{\text{LS-R}}:= \argmin_{\boldsymbol{\mathbf{w}}} \sum_{i\in D_0} \left(y_i - f(\phi(\boldsymbol{\mathbf{x}}_i), \boldsymbol{\mathbf{w}})\right)^2 + \lambda || \boldsymbol{\mathbf{w}}||_2^2 = \left(\phi(X)\phi(X)^\top + \lambda I \right)^{-1} \phi(X)\boldsymbol{\mathbf{y}}^\top \] Regularization combats overfitting because this is often due to large increase in parameters magnitude, and regularization puts a penalty on it. To choose $\lambda$ we can use cross-validation or a probabilistic perspective, as we’ll see later. Notice that above the penalty is the $L^2$ norm but can be any $p-$norm.
They incur in the curse of dimensionality because as we increase complexity $b$, we include more features, whose number can grow exponentially even just in the case of $\phi$ being a polynomial transform. This requires a dataset size that is exponential in $b$.

Semi and Fully Bayesian Inference Methods

Here we assume $y_i$ are IID and normally distributed. Moreover, we assign a normal prior to the parameters.

Maximum-A-Posteriori (MAP): Finds parameter $\boldsymbol{\mathbf{w}}_{\text{MAP}}$ that maximizes the posterior parameter distribution given our data $D$: \[ \boldsymbol{\mathbf{w}}_{\text{MAP}} := \argmax_{\boldsymbol{\mathbf{w}}} p(\boldsymbol{\mathbf{w}}\mid D) = \argmax_{\boldsymbol{\mathbf{w}}} \prod_{i\in D} N_{y_i}\left(f(\phi(\boldsymbol{\mathbf{x}}_i), \boldsymbol{\mathbf{w}}), \sigma^2 \right)N_{\boldsymbol{\mathbf{w}}}(\boldsymbol{\mathbf{0}}, \sigma^2_{\boldsymbol{\mathbf{w}}}I) = \boldsymbol{\mathbf{w}}_{\text{LS-R}}\left(\frac{\sigma^2}{\sigma^2_{\boldsymbol{\mathbf{w}}}}\right) \]
Fully Bayesian: Finds a predictive distribution for $\widehat{y}$ given the new $\widehat{\boldsymbol{\mathbf{x}}}$ \[ p(\widehat{y}\mid \widehat{\boldsymbol{\mathbf{x}}}, D) = \int p(\widehat{y}\mid \widehat{\boldsymbol{\mathbf{x}}}, D)p(\boldsymbol{\mathbf{w}}\mid D)d\boldsymbol{\mathbf{w}}= N_{\widehat{y}}\left(f\left(\phi(\widehat{\boldsymbol{\mathbf{x}}}), \boldsymbol{\mathbf{w}}_{\text{LS-R}}\left(\frac{\sigma^2}{\sigma^2_{\boldsymbol{\mathbf{w}}}}\right)\right), \sigma^2 + \phi(\widehat{\boldsymbol{\mathbf{x}}})^\top \sigma^2\left(\phi(X)\phi(X)^\top + \frac{\sigma^2}{\sigma^2_\boldsymbol{\mathbf{w}}} I \right)^{-1}\phi(\widehat{\boldsymbol{\mathbf{x}}})\right) \] which is a normal distribution whose mean is the prediction given by regularized least squares.

Classification

Problem Statement and Optimal Bayes Classifier

Now we consider $y\in \left\{+1, -1\right\}$. We are looking for a decision function $f(\boldsymbol{\mathbf{x}})$ whose level set $f(\boldsymbol{\mathbf{x}}) = 0$, called decision boundary, separates feature space into two regions:

$R_{+}$ where $f(\boldsymbol{\mathbf{x}})\geq 0$ when $\boldsymbol{\mathbf{x}}\in R_{+}$
$R_{-}$ where $f(\boldsymbol{\mathbf{x}})\leq 0$ when $\boldsymbol{\mathbf{x}}\in R_{-}$

The optimal Bayes classifier \[f(\boldsymbol{\mathbf{w}}) = p(\boldsymbol{\mathbf{w}}, y=+1) - p(\boldsymbol{\mathbf{x}}, y= -1)\] minimizes the probability of making a mistake given the decision function $f$, that is it minimizes \[p(\text{mistake} \mid f) = \int_{R_{+}} p(\boldsymbol{\mathbf{x}}, y=-1)d\boldsymbol{\mathbf{x}}+ \int_{R_{-}} p(\boldsymbol{\mathbf{x}}, y=+1)d\boldsymbol{\mathbf{x}}\] However, this classifier is unpractical because in reality

We don’t have access to $p(\boldsymbol{\mathbf{x}}, y)$
Different mistakes bear different losses.

Expected Loss Minimization

Therefore, we introduce a loss function $L(y, y_0)$ which is described by a loss matrix whose entries give the loss of predicting $y_0$ when the actual value is $y$. Our goal then becomes to minimize expected loss \[\widehat{y}= \argmin_{y_0} \mathbb{E}_{p(y\mid \boldsymbol{\mathbf{x}}, D)}\left[L(y, y_0) \mid \boldsymbol{\mathbf{x}}\right]\] where $p(y \mid \boldsymbol{\mathbf{x}}, D)$ can be found with either the Fully Bayesian approach, or we can find $\boldsymbol{\mathbf{w}}_{\text{ML}}$ or $\boldsymbol{\mathbf{w}}_{\text{MAP}}$ first, and then plug them in to obtain the distribution.

Notice that in general there are two strategies to find $p(y \mid \boldsymbol{\mathbf{x}}, D)$:

Discriminative Approach: Infer $p(y \mid \boldsymbol{\mathbf{x}}, D)$ directly using $D$. This method is best used when our only goal is prediction.
Generative Approach: Infer \[ p(y \mid \boldsymbol{\mathbf{x}}, D) \propto p(\boldsymbol{\mathbf{x}}\mid y, D) p(y) \] using Bayes rule, where $p(y)$ is simply the proportion of $+$ and $-$ in $D$. This gives us a way of generating new data points, but at an extra cost because $p(\boldsymbol{\mathbf{x}}\mid y , D)$ can be very high-dimensional.

Once we have $p(y \mid \boldsymbol{\mathbf{x}}, D)$ we can start taking decisions. Sometimes, we might be uncertain about our decision. In such cases, we can reject a data point $\boldsymbol{\mathbf{x}}$ when both probabilities are lower than a threshold $\theta$ \[\max\left\{ p(y = +1\mid \boldsymbol{\mathbf{x}}), \, p(y=-1\mid \boldsymbol{\mathbf{x}})\right\} < \theta\]

Decision Making in Regression Tasks

\[\widehat{y}\approx \begin{cases} \mathbb{E}_{p(y\mid \boldsymbol{\mathbf{x}}, D)}\left[L(y, y_0)\mid \boldsymbol{\mathbf{x}}\right] && \text{if } L=(y-y_0)^2 \\ \text{Predictive median of } p(y \mid \boldsymbol{\mathbf{x}}, D) && \text{if } L=|y-y_0| \end{cases}\] Basically if we use a squared loss, our prediction $\widehat{y}$ will be the same as the one found by using the fully-Bayesian approach or the MAP approach.

Probability Review: Gaussian Identities

Multivariate Gaussian Definition

We say that a random variable $X\sim N_x(\mu, \sigma^2)$ has a univariate Gaussian distribution if We say that a random vector $\boldsymbol{\mathbf{X}}= (X_1, \ldots, X_d)^\top$ has a multivariate Gaussian distribution if \[p(\boldsymbol{\mathbf{x}}) = N_{\boldsymbol{\mathbf{x}}}(\boldsymbol{\mathbf{\mu}}, \Sigma) = \frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}(\boldsymbol{\mathbf{x}}-\boldsymbol{\mathbf{\mu}})^\top \Sigma^{-1} (\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})\right)\] where $\boldsymbol{\mathbf{\mu}}\in\mathbb{R}^{d}$ and $\Sigma\in \mathbb{R}^{d\times d}$ and the variance-covariance matrix needs to be positive definite $\Sigma \succ 0$. Issues with MVN:

Number of parameters grows quadratically with dimensionality $d$.
Unimodal

A very useful formula that is used to derive the formulae below regards the kernel of the MVN \[-\frac{1}{2}(\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})^\top \Sigma^{-1}(\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}}) = -\frac{1}{2}\boldsymbol{\mathbf{x}}^\top \Sigma^{-1}\boldsymbol{\mathbf{x}}+ \boldsymbol{\mathbf{x}}^\top \Sigma^{-1} \boldsymbol{\mathbf{\mu}}+ \text{const}\]

Geometry of the Multivariate Normal Distribution

Eigendecomposing $\Sigma^{-1}$ and using the change of variable $\boldsymbol{\mathbf{y}}= U^\top (\boldsymbol{\mathbf{x}}- \boldsymbol{\mathbf{\mu}})$, corresponding to centering the observations and rotating using the eigenvector matrix, we find that every MVN under a linear coordinate transform is a product of $d$ univariate normals: \[p(\boldsymbol{\mathbf{y}}) = \prod_{i=1}^d N_{y_i}(0, \lambda_i) \qquad \text{where } \lambda_i \text{ eigenvectors of } \Sigma^{-1}\]

Conditional Gaussian Distributions

If two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. That is if we partition \[ \boldsymbol{\mathbf{x}}= \begin{pmatrix} \boldsymbol{\mathbf{x}}_a \\ \boldsymbol{\mathbf{x}}_b \end{pmatrix} \qquad \boldsymbol{\mathbf{\mu}}= \begin{pmatrix} \boldsymbol{\mathbf{\mu}}_a \\ \boldsymbol{\mathbf{\mu}}_b \end{pmatrix} \qquad \Sigma = \begin{pmatrix} \Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{bb} & \Sigma_{bb} \end{pmatrix} \qquad \Lambda = \Sigma^{-1} = \begin{pmatrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb} \end{pmatrix} \qquad \text{where } \Lambda \text{ is the $\textbf{precision}$ matrix} \] Then we have the mean and covariance matrix of $p(\boldsymbol{\mathbf{x}}_a\mid\boldsymbol{\mathbf{x}}_b)$ are given by \[ \begin{align} \boldsymbol{\mathbf{\mu}}_{a\mid b} &= \boldsymbol{\mathbf{\mu}}_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\boldsymbol{\mathbf{x}}_b - \boldsymbol{\mathbf{\mu}}_b) = \boldsymbol{\mathbf{\mu}}_a - \Lambda_{aa}^{-1}\Lambda_{ab}(\boldsymbol{\mathbf{x}}_b - \boldsymbol{\mathbf{\mu}}_b) \qquad \textbf{linear} \text{ function of } \boldsymbol{\mathbf{x}}_b\\ \Sigma_{a\mid b} &= \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba} = \Lambda_{aa}^{-1} \qquad \textbf{independent} \text{ of } \boldsymbol{\mathbf{x}}_b \end{align} \] The general form $p(\boldsymbol{\mathbf{y}}\mid \boldsymbol{\mathbf{x}})$ with mean being a linear function of $\boldsymbol{\mathbf{x}}$ and covariance independent of $\boldsymbol{\mathbf{x}}$ is called a linear Gaussian model.

Marginal Distribution

The marginal of a MVN is again a Gaussian: \[ p(\boldsymbol{\mathbf{x}}_a) = \int p(\boldsymbol{\mathbf{x}}_a, \boldsymbol{\mathbf{x}}_b) d\boldsymbol{\mathbf{x}}_b = N(\boldsymbol{\mathbf{\mu}}_a, \Sigma_{aa}) \]

Gaussian Linear Model

Suppose we have a

Prior: $p(\boldsymbol{\mathbf{x}}) =N_{\boldsymbol{\mathbf{x}}}(\boldsymbol{\mathbf{\mu}}, \Lambda^{-1})$
Likelihood: $p(\boldsymbol{\mathbf{y}}\mid \boldsymbol{\mathbf{x}}) = N_{\boldsymbol{\mathbf{y}}}(A\boldsymbol{\mathbf{x}}+ \boldsymbol{\mathbf{b}}, L^{-1})$

Then we can find:

Marginal: $p(\boldsymbol{\mathbf{y}}) = N_{\boldsymbol{\mathbf{y}}}(A\boldsymbol{\mathbf{\mu}}+ \boldsymbol{\mathbf{b}}, L^{-1} + A\Lambda^{-1}A^\top)$
Posterior: $p(\boldsymbol{\mathbf{x}}\mid \boldsymbol{\mathbf{y}}) = N_{\boldsymbol{\mathbf{x}}}(\Sigma[A^\top L(\boldsymbol{\mathbf{y}}- \boldsymbol{\mathbf{b}}) + \Lambda \boldsymbol{\mathbf{y}}], \Sigma)$ where $\Sigma=(\Lambda + A^\top L A)^{-1}$

Maximum Likelihood Estimation of Multivariate Gaussian

Suppose we have $N$ IID multivariate Gaussian observations $\left\{\boldsymbol{\mathbf{x}}_1, \ldots, \boldsymbol{\mathbf{x}}_N\right\}$, where $\boldsymbol{\mathbf{x}}_i\sim N(\boldsymbol{\mathbf{\mu}}, \Sigma)$. Then \[ \begin{align} \boldsymbol{\mathbf{\mu}}_{\text{ML}} &= \frac{1}{N} \sum_{i=1}^N \boldsymbol{\mathbf{x}}_i \\ \Sigma_{\text{ML}} &= \frac{1}{N} \sum_{i=1}^N (\boldsymbol{\mathbf{x}}_i - \boldsymbol{\mathbf{\mu}}_{\text{ML}})(\boldsymbol{\mathbf{x}}_i - \boldsymbol{\mathbf{\mu}}_{\text{ML}})^\top \end{align} \] Notice that while the maximum likelihood estimate of the mean is unbiased $\mathbb{E}[\boldsymbol{\mathbf{\mu}}_{\text{ML}}] = \boldsymbol{\mathbf{\mu}}$ the maximum likelihood estimate of the covariance matrix is biased \[\mathbb{E}\left[\Sigma_{\text{ML}}\right] = \frac{N - 1}{N}\Sigma\]

Bias-Variance Decomposition

The training error $E[D, \boldsymbol{\mathbf{w}}_{\text{LS}}]$ is evaluated only on the specific dataset $D$. We call in-sample error the training error on all possible datasets \[\mathbb{E}_D\left[E[D, \boldsymbol{\mathbf{w}}_{\text{LS}}]\right] = \sum_{i=1}^n \mathbb{E}_D\left[[y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]\mid\boldsymbol{\mathbf{x}}_i\right]\] where $\mathbb{E}_D\left[[y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]\mid\boldsymbol{\mathbf{x}}_i\right]$ is called expected error and can be decomposed as follows \[ \mathbb{E}_D\left[[y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]\mid\boldsymbol{\mathbf{x}}_i\right] = \underbrace{\text{Var}_{\epsilon}[\epsilon_i]}_{\text{irreducible error}} + \underbrace{(g(\boldsymbol{\mathbf{x}}_i) - \mathbb{E}_{\epsilon}[f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})])^2}_{\text{bias}} + \underbrace{\text{Var}_{\epsilon}[f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}})]}_{\text{variance}} \] This cannot be calculated cause we don’t know the distribution of $\epsilon$ and we don’t know $g(\boldsymbol{\mathbf{x}}_i)$. Thus we actually use the out-sample error: \[ \mathbb{E}_{\boldsymbol{\mathbf{x}}}[\mathbb{E}_{\epsilon}[(y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}}))]] = \mathbb{E}_{p(y, \boldsymbol{\mathbf{x}})}[(y-f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}}))^2] \] If we have a new batch of data $D':=\left\{(y_i', \boldsymbol{\mathbf{x}}_i')\right\}_{i=1}^{n'}$ we see that the out-sample error can be approximated by the testing error thanks to the Law of Large Numbers: \[\frac{1}{n'}\sum_{i=1}^{n'} (y_i' - f(\boldsymbol{\mathbf{x}}_i'; \boldsymbol{\mathbf{w}}_{\text{LS}}))^2 \quad \overset{\text{LLN}}{\longrightarrow} \quad \mathbb{E}_{p(y, \boldsymbol{\mathbf{x}})}\left[(y_i - f(\boldsymbol{\mathbf{x}}_i; \boldsymbol{\mathbf{w}}_{\text{LS}}))^2\right]\]