A Peek Behind the ML Curtain

Basic prinicples you need to know.
In the previous articles about Machine Learning (ML) I touched advanced topics such as Computer Vision, Tensorflow Cats Vs Dogs I, Chat Bots, and many more. However, I never got into the nitty gritty details how Data Scientists and Machine Learning Enginneers deal with the model training and the maths behind the curtain. Let’s get to it now:
Often it is impossible or impractical to compute certain values exactly to the last digit. This might be because it would take too long computationally or because we do not have not enough information. Instead, these values can be estimated
with the rigor of our choosing. In statistics, this concept is formalized in estimation theory. This article shows how this can be applied to ML and how we quantify the quality of a model. Since ML models can generally not be perfect for complex problems, it is useful to try to describe how good they are.
Estimation theory is useful to understand different versions of gradient descent. Typically, the gradient is only estimated using methods like (1) mini-batch or (2) stochastic gradient descent. Here, estimation theory can be used to explain the ideas behind these techniques.
Estimator Properties
An estimator is a function (provided by your ML library) that estimates a value based on other observations. This process can involve randomness. For example, because the function itself is random or because there is random noise in the observations it uses.
Bias
One measure for the quality of an estimator
where the expected value is over the randomness involved in
If the bias of an estimator is
Theorem: If
where the sum of weights
Proof: The unbiasedness is due to the linearity of expectation:
This theorem about unbiased estimators is going to prove to be useful later.
Variance
However, even if we have an unbiased estimator, its individual estimates can still be far off from the true value. To quantify how consistently an estimator is close to the true value, another statistic is required. Commonly, the variance of the estimator is considered here:
It is defined as the mean squared distance between the estimate and the value to be estimated.
Bias-Variance Tradeoff
Many different things can be analyzed using estimators. For example, statistical and ML models can be seen as estimators. They use observations (samples, data…) to make predictions. These predictions are generally not perfect because randomness is involved and only a limited amount of information is available. Thus, it makes sense to analyze statistical models in terms of bias and variance.
A central problem when building models is balancing underfitting and overfitting. If the training data is just memorized, the model does not generalize well to new data. This is a case of overfitting. The opposite issue, only barely matching the pattern in the training data, is called underfitting.
This problem is also known as the bias-variance tradeoff:
If the model has a high bias, its predictions are off, which corresponds to underfitting.
If overfitting occurred, i.e., the data is learned too well, the estimates have a high variance. By resampling the data that the model was built on, totally different estimates are generated. This is because the model is now based on different random noise.
Generally, it is not possible to perfectly optimize both, bias and variance, so they need to be balanced here. In other words, we accept a certain bias of the model to keep its variance low. A good tradeoff between the two needs to be achieved.
Gradient Descent
In supervised machine learning, we compare our model’s predictions to the true labels. This is done using a loss function. If a set of data points
where
Gradient descent optimizes the parameters used in
Full-Batch Gradient Descent
To compute the gradient
The method that uses the gradient given above is sometimes referred to as full-batch gradient descent because it fully uses the available training data in each iteration. Sometimes it also only called “Batch Gradient Descent” but that can be imprecise and confusing. In many cases, the full set of samples
Computing the gradient as shown above is especially inefficient if there is duplicated training data. If the training set consists of 10 copies of a different dataset, then the evaluation of the formula above is unnecessarily expensive. Every required calculation is repeated 10 times. While this is an extreme example, it does happen in practice that much of the training data is similar. To save time, it often makes sense to only use a part of the data to estimate the gradient.
Stochastic Gradient Descent
In stochastic gradient descent (SGD), one single data point
It is easy to see that
The computations for SGD can be performed very quickly but still give us an unbiased estimate of the true gradient. This property is the reason why optima can be found using this algorithm. While individual estimates are off, the randomness averages out over iterations and the parameters still move in a sensible direction overall. Since iterations are much cheaper, many more of them can be performed and this is a major improvement to computing the full gradient.
Mini-Batch Gradient Descent
These individual SGD estimates can have a large variance however, leading to noisy and jumpy gradients that show as zig-zaggy training histories. A further improvement over this method is mini-batch gradient descent. Instead of just taking one sample, we sample a small batch of
Each of these individual estimators is unbiased since SGD itself is unbiased. As shown in the theorem earlier, a weighted combination of them still remains an unbiased estimator. Thus, mini-batch gradient descent is also an unbiased way of computing gradient estimates.
Mini-batch gradient descent does have much less variance, however, because more data is used to compute the estimate. This makes the optimization process less noisy and smoother compared to using SGD.
Most gradient computations can be formulated using linear algebra operations. These calculations can be vectorized (parallelized very well on grphics cards that with CUDA (Don’t wait for hours). So with appropriate add-in boards for your computer there is no significant performance penalty for using Mini-Batch Gradient Descent to compute the model. Thus, Mini-Batch Gradient Descent is the preferred method that leads to a more stable optimization process.
Conclusion
- Estimators provide an elegant way of analyzing the quality of estimates. In ML, estimates play an important role because data contains a lot of random noise and because it is often more practical to only estimate values (i.e., predictions).
- The quality of statistical models can be described in terms of bias and variance. Too much bias corresponds to underfitting, while too much variance is equivalent to overfitting. The training process needs to find a tradeoff between these two.
- Computing the gradient for the optimization process with all samples of a large dataset can be expensive. By randomly sampling them, we can compute unbiased estimates in a much faster way (i.e., Stochastic Gradient Descent). If this is done using a large enough sample of the total dataset (Mini-Batch Gradient Descent), the variance of these estimates does not have to be large. By properly choosing the sample size, the optimization process can thus be accellerated significantly.
If you liked this mathematical representation of the formulas in this article and would like to adopt it, check out https://katex.org/, a LaTeX implementation for the web. It is implemented via gem "kramdown-math-katex"