These are nice animations. However I've always hesitated to get too enamored with these simple 2D visualizations of gradient descent, because one of the strange takeaways from deep learning is that behavior in high dimensions is very different from behavior in low dimensions.
In a 2D problem with many local minima, like the "eggholder functions" [1], gradient descent will be hopeless. But neural net optimization in high dimensions really is a similar situation with many local minima, except gradient descent does great.
Gradient descent in high dimensions also seems to have the ability to "step over" areas of high loss, which you can see by looking at the loss of a linear interpolation between weights at successive steps of gradient descent. This, again, seems like extremely strange behavior with no low-dimensional analogue.
Does gradient descent really do well for deep learning when the gradient is computed with respect to the whole dataset? I assumed that the noise in SGD played an important role for escaping local minima.
There aren't really local minima in most deep networks. When you get into millions/billions of parameters, there will essentially always be some directions that point downwards. You have to get really really close to the true minimum for there to be no direction to go that improves the loss.
Incidentally this same phenomenon is IMO how evolution is able to build things like the eye. Naively you'd think that since you need so many parts arranged so well, it's impossible to find a step by step path where fitness goes up at every step, i.e. if you just have a retina with no lens or vice-versa, it doesn't work. However, due to the high dimensionality of DNA, there is essentially guaranteed to be a path with monotonically increasing fitness just because there are so many different possible paths from A to B in the high dimensional space that at least one is bound to work.
Now this isn't strictly true for every high dimensional system. You need to have a degree of symmetry or redundancy in the encoding for it to work. For example, in convolutional neural networks, you see this phenomenon where some filters get "stuck" in training, and those are local minima for that subspace. What happens though is that if one filter gets stuck, the network will just use another one that had a better initialization. This is why pruning works, lottery tickets, etc. Things like residual connections enhance this effect since you can even be stuck in a whole layer and the training process can just bypass it.
You see the same thing with life, where you could put a sequence for the same protein in different parts of the genome and it could still be produced, regardless of the position. There are also many different ways to encode the exact same protein, and many different possible proteins that will have the same shape in the critical areas. Life finds a way.
If that was the case we would be finding globally optimal solutions for complicated non-convex optimization problems.
The reality is different, you need to really explore the space to find the truly global optimal solution.
A better explanation is that for ml you don't want a globally optimal solution that overindexes on your training set. You want a suboptimal solution that might also fit an unseen data set.
That is what people thought until around 2018, but it was wrong. It turns out that deep learning optimization problems have many global optima. In fact, when the #parameters exceeds the #data, SGD reliably finds parameters that interpolate the training data with 0 loss. Surprisingly, most of these generalize well and overfitting is not a big problem.
In other words, deep learning is a very special nonconvex optimization problem. A lot of our old intuition about optimization for ML is invalid in the overparameterized regime.
I have read this in several places and want to learn more. Do you have a reference handy?
This is something I saw a talk about a while ago. There are probably more recent papers on this topic, you might want to look browse the citations of this one
https://arxiv.org/abs/2003.00307
[1] Was an empirical paper that inspired much theoretical follow-up.
[2] Is one such follow-up, and the references therein should point to many of the other key works in the years between.
[3] Introduces the neural tangent kernel (NTK), a theoretical tool used in much of this work. (Not everyone agrees that reliance on NTK is the right way towards long-term theoretical progress.)
[4] Is a more recent paper I haven't read yet that goes into more detail on interpolation. Its authors were well known in more "clean" parts of ML theory (e.g. bandits) and recently began studying deep learning.
---
[1] Understanding deep learning requires rethinking generalization. Zhang et al., arXiv, 2016. https://arxiv.org/abs/1611.03530
[2] Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization. Azizan et al., arXiv, 2019. https://arxiv.org/abs/1906.03830.
[3] Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Jacot et al., NeurIPS, 2018. https://proceedings.neurips.cc/paper/2018/hash/5a4be1fa34e62...
[4] A Universal Law of Robustness via Isoperimetry. Bubeck et al., NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/f197002b9a085...
Why DL generalizes well is still an open research problem AFAIK. I've read numerous papers that tries to argue one way or another why this works, and they are all interesting! One paper (that I found compelling, even though it didn't propose a thorough solution) showed that SGD successfully navigated around "bad" local minimas (with bad generalization) and ended up in a "good" local minima (that generalized well), and their interpretation was that due to the S in SGD, it will only find wide loss basins, and thus the conclusion was that solutions that generalize well seem to have wider basins (for some reason).
Another paper showed that networks trained on roughly the same dataset but initialized from different random initializations, had a symmetry relation in the loss landscape by a permutation of all the weights. You could always find a permutation that allowed you to then linearly interpolate between the two weight sets without climbing over a loss mountain. Also very interesting even if it wasn't directly related to generalization performance. It has potential applications in network merging I guess.
How so?
If there are no local minima other than the global one there are convex optimization methods that are far faster than SGD or Adam. The only reason these methods exist is because deep networks is a non-convex optimization problem.
There are many nonconvex functions where every local minimum is global, and even many nonconvex functions with a unique local minimum (which is de facto global). Convex methods can fail on those.
The reason why GD and friends are a good choice in deep learning is that computing the gradient is cheap (and approximating it even more so). Every descent method relies on solving a subproblem of sorts, typically projecting the current iterate on a sublevel set of an approximation of the function, for some definition of projection. With GD, it's as cheap as it gets, just subtract a shrinked version of the gradient. Subproblems in other algorithms are a lot more expensive computationally, particularly at high dimensions. So more efficient as in requiring fewer function evaluations, yes, but at the cost of doing a lot more work at each step.
My understanding of this phenomenon in DL is that its not due to anything intrinsic to gradient descent, the same principles and understanding apply.
Rather, it is that with very large dimensionality the probability that you spuriously get all derivatives to be zero is vanishingly small. That is, local minima are less likely because you need a lot of dimensions to agree that df(x_i)/dx_i = 0.
I may be wrong though!
if the probability that you get a derivative close to 0 is small, say only 10%, then you need just 3 dimensions to get that multiplicative probability equal to a tenth of a percent. 3 is hardly “very large dimensionality”
you can assign different numbers, but still you will find you don’t need more than say 10 dimensions for this effect to happen.
Behavior in high dimensions is even weirder than you might think, because you're nearly always close to a saddle point. There's way more saddle points than minima or maxima (consider modeling the sign of eigenvalues for the Jacobian as a binomial distribution -- how likely is it they're all positive or all negative?).
However usually only a relatively small subset of dimensions are important (read: represent a stable optimum), see the 'lottery ticket' hypothesis for more.
What's also weird is that "convergence" in these cases isn't really convergence, it's just slowly floating between saddle points as you begin to overfit. Hence early stopping.
This has much less to do with gradients per se and way more to do with step size. It's an artifact of discretization, not of the algorithm per se.
Might I suggest an insightful seminar for anyone curious about why gradient descent is even tractable in such high dimensions:
Stanford Seminar - A Picture of the Prediction Space of Deep Networks by Pratik Chadhari:
https://youtu.be/ZD2cL-QoI5g?si=hVFU5_4CeoLYtyuB
Not super strange if you think of it going the other way. Think of a saddle like construct, in one dimension you get stuck easily, in another tgere is a way forward where over time the other previously 'stuck' dimension is freed up again. In higher dimensions you have a much higher chance of havibg such alternative pathways. During training this looks lime breif plateaus in loss reduction followed by a sudden rapid decrease again. Sorry for typos, bumpy train, fix one get another, will try edit later.
Keeping that in mind it is still useful. Maybe one useful addition would be to reduce dimensionality to show that if we pick some dimensions we don't have a local minimum, hence we don't have a local minimum in the larger dimension we started with.