# Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks

@article{Oymak2020TowardMO, title={Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks}, author={Samet Oymak and Mahdi Soltanolkotabi}, journal={IEEE Journal on Selected Areas in Information Theory}, year={2020}, volume={1}, pages={84-105} }

Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that… Expand

#### 151 Citations

On the Convergence of Deep Networks with Sample Quadratic Overparameterization

- Computer Science
- ArXiv
- 2021

A tight finite-width Neural Tangent Kernel (NTK) equivalence is derived, suggesting that neural networks trained with this technique generalize well at least as good as its NTK, and it can be used to study generalization as well. Expand

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

- Computer Science, Mathematics
- AISTATS
- 2020

Under a rich dataset model, it is shown that gradient descent is provably robust to noise/corruption on a constant fraction of the labels despite overparameterization and shed light on the empirical robustness of deep networks as well as commonly adopted heuristics to prevent overfitting. Expand

Nearly Minimal Over-Parametrization of Shallow Neural Networks

- Computer Science, Mathematics
- ArXiv
- 2019

It is established that linear overparametrization is sufficient to fit the training data, using a simple variant of the (stochastic) gradient descent. Expand

Benefits of Jointly Training Autoencoders: An Improved Neural Tangent Kernel Analysis

- Computer Science, Mathematics
- IEEE Transactions on Information Theory
- 2021

This paper rigorously proves the linear convergence of gradient descent in two weakly-trained and jointly-trained regimes and indicates the considerable benefits of joint training over weak training in finding global optima, achieving a dramatic decrease in the required level of over-parameterization. Expand

An Improved Analysis of Training Over-parameterized Deep Neural Networks

- Computer Science, Mathematics
- NeurIPS
- 2019

An improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters is provided. Expand

Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks

- Computer Science, Mathematics
- AAAI
- 2021

The theory presented addresses the following core question: "should one train a small model from the beginning, or first train a large model and then prune?", and analytically identifies regimes in which, even if the location of the most informative features is known, the authors are better off fitting a large models and thenPruning rather than simply training with the known informative features. Expand

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

- Mathematics
- 2019

Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that they… Expand

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

- Computer Science, Mathematics
- NeurIPS
- 2019

The expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent and random initialization can be bounded by the training Loss of a random feature model induced by the network gradient at initialization, which is called a neural tangent random feature (NTRF) model. Expand

Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping

- Computer Science, Mathematics
- COLT
- 2021

This work explores the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent and proposes an early stopping rule that allows them to show optimal rates. Expand

Overparameterized Nonlinear Optimization with Applications to Neural Nets

- Computer Science
- 2019 13th International conference on Sampling Theory and Applications (SampTA)
- 2019

This talk shows that solution found by first order methods, such as gradient descent, has the property that it has near shortest distance to the initialization of the algorithm among all other solutions, and advocates that shortest distance property can be a good proxy for the simplest explanation. Expand

#### References

SHOWING 1-10 OF 50 REFERENCES

A Convergence Theory for Deep Learning via Over-Parameterization

- Computer Science, Mathematics
- ICML
- 2019

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

- Computer Science, Mathematics
- ICLR
- 2018

This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model. Expand

An Improved Analysis of Training Over-parameterized Deep Neural Networks

- Computer Science, Mathematics
- NeurIPS
- 2019

An improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters is provided. Expand

Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks

- Computer Science, Mathematics
- IEEE Transactions on Information Theory
- 2019

It is shown that with the quadratic activations, the optimization landscape of training, such shallow neural networks, has certain favorable characteristics that allow globally optimal models to be found efficiently using a variety of local search heuristics. Expand

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

- Computer Science, Mathematics
- ICLR
- 2019

Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

- Computer Science, Mathematics
- ICML
- 2019

This paper demonstrates the utility of the general theory of (stochastic) gradient descent for a variety of problem domains spanning low-rank matrix recovery to neural network training and develops novel martingale techniques that guarantee SGD never leaves a small neighborhood of the initialization, even with rather large learning rates. Expand

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

- Computer Science, Mathematics
- NeurIPS
- 2019

It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. Expand

Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

- Computer Science, Mathematics
- NeurIPS
- 2018

It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions. Expand

Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian

- Computer Science, Mathematics
- ArXiv
- 2019

A data-dependent optimization and generalization theory which leverages the low-rank structure of the Jacobian matrix associated with the network and shows that even constant width neural nets can provably generalize for sufficiently nice datasets. Expand

Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression

- Mathematics, Computer Science
- ArXiv
- 2018

This work proves that under Gaussian input, the empirical risk function employing quadratic loss exhibits strong convexity and smoothness uniformly in a local neighborhood of the ground truth, for a class of smooth activation functions satisfying certain properties, including sigmoid and tanh, as soon as the sample complexity is sufficiently large. Expand