EURO-Online login
- New to EURO? Create an account
- I forgot my username and/or my password.
- Help with cookies
(important for IE8 users)
1781. Unveiling the Power of Adaptive Methods Over SGD: A Parameter-Agnostic Perspective
Invited abstract in session WA-32: Adaptive and Polyak step-size methods, stream Advances in large scale nonlinear optimization.
Wednesday, 8:30-10:00Room: 41 (building: 303A)
Authors (first author is the speaker)
1. | Xiang Li
|
Department of Computer Science, ETH Zurich |
Abstract
Adaptive gradient methods are popular in optimizing modern machine learning models, yet their theoretical benefits over vanilla Stochastic Gradient Descent (SGD) remain unclear. This presentation examines the convergence of SGD and adaptive gradient methods when optimizing stochastic nonconvex functions without the need for setting algorithm hyper-parameters based on problem-specific knowledge. First, we explore smooth functions and compare SGD to well-known adaptive methods like AdaGrad, Normalized SGD with Momentum (NSGD-M), and AMSGrad. Our findings reveal that while untuned SGD can reach the optimal convergence rate, it comes at the expense of an unavoidable catastrophic exponential dependence on the smoothness constant. Adaptive methods, on the other hand, eliminate this reliance without needing to know the smoothness constant in advance. We then look at a broader group of functions characterized by (L0, L1) smoothness. Here, SGD is shown to fail without proper tuning. We present the first instance of tuning-free convergence with adaptive methods in this context, specifically with NSGD-M, achieving near-optimal rate despite an exponential dependence on the L1 constant. We also demonstrate that this dependency is unavoidable for a family of normalized momentum methods.
Keywords
- Stochastic Optimization
- Continuous Optimization
- Machine Learning
Status: accepted
Back to the list of papers