Treat abstract

> Treat abstract

1781. Unveiling the Power of Adaptive Methods Over SGD: A Parameter-Agnostic Perspective

Invited abstract in session WA-32: Adaptive and Polyak step-size methods, stream Advances in large scale nonlinear optimization.

Wednesday, 8:30-10:00
Room: 41 (building: 303A)

Authors (first author is the speaker)

1.	Xiang Li
	Department of Computer Science, ETH Zurich

Abstract

Adaptive gradient methods are popular in optimizing modern machine learning models, yet their theoretical benefits over vanilla Stochastic Gradient Descent (SGD) remain unclear. This presentation examines the convergence of SGD and adaptive gradient methods when optimizing stochastic nonconvex functions without the need for setting algorithm hyper-parameters based on problem-specific knowledge. First, we explore smooth functions and compare SGD to well-known adaptive methods like AdaGrad, Normalized SGD with Momentum (NSGD-M), and AMSGrad. Our findings reveal that while untuned SGD can reach the optimal convergence rate, it comes at the expense of an unavoidable catastrophic exponential dependence on the smoothness constant. Adaptive methods, on the other hand, eliminate this reliance without needing to know the smoothness constant in advance. We then look at a broader group of functions characterized by (L0, L1) smoothness. Here, SGD is shown to fail without proper tuning. We present the first instance of tuning-free convergence with adaptive methods in this context, specifically with NSGD-M, achieving near-optimal rate despite an exponential dependence on the L1 constant. We also demonstrate that this dependency is unavoidable for a family of normalized momentum methods.

Keywords

Stochastic Optimization
Continuous Optimization
Machine Learning

Status: accepted

Back to the list of papers

> Treat abstract

This part of the site is hosted by EURO. Feedback. Privacy policy

Username:
Password:

EURO-Online login

1781. Unveiling the Power of Adaptive Methods Over SGD: A Parameter-Agnostic Perspective