A Peek Behind the ML Curtain

Posted in Projectblog ∙ 10 mins read

Basic prinicples you need to know.

Estimator Properties
1. Bias
2. Variance
Bias-Variance Tradeoff
Gradient Descent
Conclusion

In the previous articles about Machine Learning (ML) I touched advanced topics such as Computer Vision, Tensorflow Cats Vs Dogs I, Chat Bots, and many more. However, I never got into the nitty gritty details how Data Scientists and Machine Learning Enginneers deal with the model training and the maths behind the curtain. Let’s get to it now:

Often it is impossible or impractical to compute certain values exactly to the last digit. This might be because it would take too long computationally or because we do not have not enough information. Instead, these values can be estimated with the rigor of our choosing. In statistics, this concept is formalized in estimation theory. This article shows how this can be applied to ML and how we quantify the quality of a model. Since ML models can generally not be perfect for complex problems, it is useful to try to describe how good they are.

Estimation theory is useful to understand different versions of gradient descent. Typically, the gradient is only estimated using methods like (1) mini-batch or (2) stochastic gradient descent. Here, estimation theory can be used to explain the ideas behind these techniques.

Estimator Properties

An estimator is a function (provided by your ML library) that estimates a value based on other observations. This process can involve randomness. For example, because the function itself is random or because there is random noise in the observations it uses.

Bias

One measure for the quality of an estimator \(\tilde{X}\) (sometimes called \(\hat{X}\)) is its bias or how far off its estimate is on average from the true value \(X\): \[\begin{aligned} \mathrm{bias} ( \tilde{X} ) = \mathrm{E} \lbrack \tilde{X} \rbrack − X \end{aligned}\]

where the expected value is over the randomness involved in \(\tilde{X}\).

If the bias of an estimator is \(0\), it is called an unbiased estimator. This is generally a desirable property to have because it means that the estimator is correct on average. If one samples for long enough from the estimator, the average converges to the true value \(X\). This is due to the law of large numbers - but in reality we rarely have this luxury, unless we can horde immense amounts of data.

Theorem: If \(k\) estimators all produce unbiased estimates \(\tilde{X}_1, \dots, \tilde{X}_k\) of \(X\), then any weighted average of them is also an unbiased estimator. The full estimate is given by \[\begin{aligned} \tilde{X} = w_1 * \tilde{X} _1 + \dots + w_k * \tilde{X} _k \end{aligned}\]

where the sum of weights \(\textstyle\sum_{i=1}^k w_i = 1\) needs to be normalized.

Proof: The unbiasedness is due to the linearity of expectation: \[\begin{aligned} \mathrm{E} \lbrack \tilde{X} \rbrack &= \mathrm{E} \lbrack w_1 * \tilde{X} _1 + \dots + w_k * \tilde{X} _k \rbrack \\[2em] &= w_1 * \mathrm{E} \lbrack \tilde{X} \rbrack + \dots + w_k * \mathrm{E} \lbrack \tilde{X} _k \rbrack \\[2em] &= w_1 * X + \dots + w_k * X \\[2em] &= X \\[2em] \end{aligned}\]

This theorem about unbiased estimators is going to prove to be useful later.

Variance

However, even if we have an unbiased estimator, its individual estimates can still be far off from the true value. To quantify how consistently an estimator is close to the true value, another statistic is required. Commonly, the variance of the estimator is considered here: \[\begin{aligned} \mathrm{Var} \lbrack \tilde{X} \rbrack = \mathrm{E} \lbrack ( \tilde{X} _k - X ) ^2 \rbrack \end{aligned}\]

It is defined as the mean squared distance between the estimate and the value to be estimated.

photo one

Bias-Variance Tradeoff

Many different things can be analyzed using estimators. For example, statistical and ML models can be seen as estimators. They use observations (samples, data…) to make predictions. These predictions are generally not perfect because randomness is involved and only a limited amount of information is available. Thus, it makes sense to analyze statistical models in terms of bias and variance.

A central problem when building models is balancing underfitting and overfitting. If the training data is just memorized, the model does not generalize well to new data. This is a case of overfitting. The opposite issue, only barely matching the pattern in the training data, is called underfitting.

This problem is also known as the bias-variance tradeoff:

If the model has a high bias, its predictions are off, which corresponds to underfitting.

If overfitting occurred, i.e., the data is learned too well, the estimates have a high variance. By resampling the data that the model was built on, totally different estimates are generated. This is because the model is now based on different random noise.

Generally, it is not possible to perfectly optimize both, bias and variance, so they need to be balanced here. In other words, we accept a certain bias of the model to keep its variance low. A good tradeoff between the two needs to be achieved.

photo two

Gradient Descent

In supervised machine learning, we compare our model’s predictions to the true labels. This is done using a loss function. If a set of data points \(x_1, \dots , x_n\) and labels \(y_1, \dots , y_n\) is given, then the full loss is defined by \[\begin{aligned} L = \frac{1}{n} \sum_{i=1}^n \mathrm{loss} (f(x_i),y_i) \end{aligned}\]

where \(\mathrm{loss}\) is a function that compares a prediction \(p\) to the correct answer \(y\). One choice for the loss function might be the quadratic error (or Mean Absolute Error, Mean Squared Error, Median Absolute Error… etc.): \[\begin{aligned} \mathrm{loss} (p,y) = (p-y)^2 \end{aligned}\]

Gradient descent optimizes the parameters used in \(f\) by computing the gradient of the loss with respect to these parameters. This gradient is then used to continually improve the parameters step by step during the descent.

Full-Batch Gradient Descent

To compute the gradient \(ΔL\) of the loss, we can make use of the linearity of the gradient operator: \[\begin{aligned} ΔL &= Δ\frac{1}{n} \sum_{i=1}^n \mathrm{loss}(f(x_i),y_i) \\[2em] &= \frac{1}{n} \sum_{i=1}^n Δ \mathrm{loss}(f(x_i),y_i) \end{aligned}\]

The method that uses the gradient given above is sometimes referred to as full-batch gradient descent because it fully uses the available training data in each iteration. Sometimes it also only called “Batch Gradient Descent” but that can be imprecise and confusing. In many cases, the full set of samples \(n\) is a very large value and computing the full update \(ΔL\) is expensive. Since computing the gradient is by far the most expensive part of gradient descent, it makes sense to try to make this more efficient.

Computing the gradient as shown above is especially inefficient if there is duplicated training data. If the training set consists of 10 copies of a different dataset, then the evaluation of the formula above is unnecessarily expensive. Every required calculation is repeated 10 times. While this is an extreme example, it does happen in practice that much of the training data is similar. To save time, it often makes sense to only use a part of the data to estimate the gradient.

Stochastic Gradient Descent

In stochastic gradient descent (SGD), one single data point \(x\) and label \(y\) are sampled uniformly from the training set. This makes it the opposite extreme of Full-Batch Gradient Descent. The true gradient \(ΔL\) is then estimated using only this one data point and label: \[\begin{aligned} Δ\tilde{L} = Δ \mathrm{loss} (f(x),y) \end{aligned}\]

It is easy to see that \(Δ\tilde{L}\) is an unbiased estimator of \(ΔL\): \[\begin{aligned} \mathrm{E} \lbrack Δ \tilde{L} \rbrack &= \sum_{i=1}^n Δ \mathrm{loss} (f(x_i),y_i) \\[2em] &= \frac{1}{n} Δ \sum_{i=1}^n \mathrm{loss} (f(x_i),y_i) \\[2em] &= ΔL \end{aligned}\]

The computations for SGD can be performed very quickly but still give us an unbiased estimate of the true gradient. This property is the reason why optima can be found using this algorithm. While individual estimates are off, the randomness averages out over iterations and the parameters still move in a sensible direction overall. Since iterations are much cheaper, many more of them can be performed and this is a major improvement to computing the full gradient.

Mini-Batch Gradient Descent

These individual SGD estimates can have a large variance however, leading to noisy and jumpy gradients that show as zig-zaggy training histories. A further improvement over this method is mini-batch gradient descent. Instead of just taking one sample, we sample a small batch of \(k\) samples \(1 < k \ll n\), which makes it a compromise between Full-Batch Gradient Descent and Stochastic Gradient Descent. The estimated gradient is an average of all \(k\) single estimates.

Each of these individual estimators is unbiased since SGD itself is unbiased. As shown in the theorem earlier, a weighted combination of them still remains an unbiased estimator. Thus, mini-batch gradient descent is also an unbiased way of computing gradient estimates.

Mini-batch gradient descent does have much less variance, however, because more data is used to compute the estimate. This makes the optimization process less noisy and smoother compared to using SGD.

Most gradient computations can be formulated using linear algebra operations. These calculations can be vectorized (parallelized very well on grphics cards that with CUDA (Don’t wait for hours). So with appropriate add-in boards for your computer there is no significant performance penalty for using Mini-Batch Gradient Descent to compute the model. Thus, Mini-Batch Gradient Descent is the preferred method that leads to a more stable optimization process.

photo three

Conclusion

Estimators provide an elegant way of analyzing the quality of estimates. In ML, estimates play an important role because data contains a lot of random noise and because it is often more practical to only estimate values (i.e., predictions).
The quality of statistical models can be described in terms of bias and variance. Too much bias corresponds to underfitting, while too much variance is equivalent to overfitting. The training process needs to find a tradeoff between these two.
Computing the gradient for the optimization process with all samples of a large dataset can be expensive. By randomly sampling them, we can compute unbiased estimates in a much faster way (i.e., Stochastic Gradient Descent). If this is done using a large enough sample of the total dataset (Mini-Batch Gradient Descent), the variance of these estimates does not have to be large. By properly choosing the sample size, the optimization process can thus be accellerated significantly.

If you liked this mathematical representation of the formulas in this article and would like to adopt it, check out https://katex.org/, a LaTeX implementation for the web. It is implemented via gem "kramdown-math-katex"