Nonuniform smoothness: The curvature that vanishes at the finish line

This blog post is about our recent ICML accepted paper: Convergence of Steepest Descent and Adam under Non-Uniform Smoothness, by Sharan Vaswani, Yifan Sun, and Reza Babanezhad Harikandeh 

Why does it always feel like key, fundamental optimization results in machine learning don’t seem that predictive in practice? For example, our rates on Nesterov-accelerated gradient descent have always been thought of as “optimal” over the class of convex functions, yet in practice its oversensitivity to step size choice often leads practitioners to prefer simpler methods. For nonconvex functions, it’s not fundamentally and theoretically obvious that Adam should outshine SGD — yet, in practice it does. So, those of us in the business of deriving convergence rates have to consistently sit on our hands, and admit that all that hard work we did the past year shaving off exponents didn’t actually leave us with a theoretical framework that can actually act as a guide to practitioners as to what method should actually be used.

The glass is half empty — so there’s an opportunity to fill it!

One promising direction in bridging this gap is to take three steps back and seriously question the fundamental problem assumptions themselves. While many variations of assumptions exist in optimization theory, in machine learning and first order methods, we’ve largely in the past been concerned with two types of assumptions, which largely tell us how close to a quadratic a function $f$ is:

  1. $f$ is $L$-smooth if $\nabla^2 f(\theta) \preceq LI$ for all $\theta$ (e.g. the gradient doesn’t change too fast)
  2. $f$ is $\mu$-strongly convex if $\nabla^2 f(\theta) \succeq \mu I$ for all $\theta$ (e.g. the curvature guides the gradient well)

Then, many convergence rates depend on $\kappa = L/\mu$ the problem’s condition number. Note that this comes from linear systems, e.g. for a linear ODE $\dot x = Ax$, then $\kappa$ is the condition number of $A$ and indicates how fast $x$ stabilizes (assuming it does at all). So, because we humans are inherently lazy, we’ve built our entire optimization empire on our inability to move past inherently linear systems.

Nonuniform smooth: when $\kappa$ changes over the training trajectory.

The idea of nonuniformly smooth has now been studied for some time [1-10]. For me personally, I find it easiest to grasp the main idea by focusing my attention on the logistic loss, which is widely used in so many applications, despite being very not quadratic.  Let’s take a journey down first-year-machine-learning lane, where we studied logistic regression

$$
f(\theta) = \sum_{i=1}^m \log(1+\exp(y_ix_i^T\theta)) \tag{logistic regression}
$$

and multiclass logistic regression

\[
f(\theta_1,…,\theta_K) = \sum_{i=1}^m \theta_{y_i}^Tx_i – \log \sum_{j=1}^K \exp(x_i^T\theta_j) \tag{multiclass log reg}
\]

Several things of interest when first studying these functions:

  • These functions are convex.
  • They are both $L$-smooth, and it is not hard to identify what $L$ is.
  • However, the minimizer is not always attained. This usually corresponds between two cases: the data is separable (a 0% training error solution exists) or it is not separable.
  • Using only smoothness and convexity assumptions, the best rates we can obtain is $O(1/t)$ for gradient descent, and $O(1/t^2)$ using the optimal gradient-based  method.

However, these functions are really not your typical convex function. What I mean is, the methods designed to optimally solve convex functions are not that optimal for log. reg. and multiclass log.reg.

This sets the stage for our three assumptions

Specifically, we impose:

  1. $f$ is twice-differentiable and nonnegative, i.e. for all $\theta$, $f(\theta) \geq 0$. This is true of most loss functions of interest in machine learning.
  2.   $f$ is $(H_0,H_1)$ nonuniformly smooth if $\|\nabla^2 f(\theta)\|_{p\to q} \leq H_0 + H_1 f(\theta)$
  3. $f$ is non-uniform Lojasiewicz  if $\|\nabla f(\theta)\|_q \geq \mu\, [f(\theta)-f^*]$

Assumptions 2. and 3. say that the curvature and smoothness of $f$ diminish  as $f(\theta)$ decreases, or as $\theta \to \theta^*$. We can compare this to other variations, like   $\|\nabla^2 f(\theta)\|_2 \leq L_c + L_g \|\nabla f(\theta)\|_2$ which use the gradient norm as the proxy. In particular…

  • The exponential loss function is nonuniformly smooth and nonuniformly Lojasiewicz. Intuitively, we can just think about the $f:\mathbb  R\to \mathbb R$ 1-D case, and
    $$
    f”(\theta) = \exp(-\theta) = f(\theta)
    $$
    so it is unsurprising that curvature is upper and lower bounded by function value.
  • Logistic regression, and multiclass logistic regression, are nonuniformly smooth, and nonuniformly Lojasiewicz when $\theta$ is such that the data is separated. Again, the proof is in the paper, but for intuition, we can just use a graphing calculator — logistic loss is always upper bounded by exponential loss, and the curvature of logistic loss decays faster than that of exponential loss (while asymptotically matching). On the other hand, when the loss is high (e.g. when data is not yet separated) the logistic loss is very close to its linearization, and has barely any curvature.

    Compare logistic and exponential loss
  • The quadratic loss is uniformly smooth, but not nonuniformly smooth. Since quadratic losses have constant curvature, this is not surprising. But it does illustrate an important point: methods with optimized convergence rates for quadratic losses are not suited for non-uniform-smooth functions, and vice versa.
  • In the paper, we also show that the multi-armed bandit with softmax loss, and the two-layer NN with a frozen layer, are examples of nonuniform smooth losses as well.

Since most machine learning loss functions are nonuniformly smooth, this illustrates why studying nonuniform smooth functions is essential.

Ok, you made me care, but now what?

In our ICML paper, we then use this construct to analyze Adam, to achieve improved convergence rates under nonuniform smoothness — and therefore, explain why Adam does better on logistic regression than SGD. We start by studying steepest descent methods

$$
\theta_{t+1} = \theta-\eta_t d_t, \qquad d_t := \arg\max_{\|d\|_p\leq 1}\, \langle d,\nabla f(\theta_t)\rangle.
$$

Specifically, when $p = 1$, this is signed greedy coordinate descent; when $p = 2$, it is normalized gradient descent; and when $p = \infty$ it is Sign GD. Why do we care about these methods? Well, if you take the Adam method (and its non-momentum version RMS-Prop), tilt your head, and squint a little, it actually looks a lot like Sign GD

RMSProp (operations applied coordinatewise)

$v_t = \beta v_{t-1} + (1-\beta)\nabla f(\theta_t)^2$

$\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{v_t}+\epsilon}\nabla f(\theta_t)$

Adam (operations applied coordinatewise)

$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla f(\theta_t)$
$v_t = \beta_2 v_{t-1} + (1-\beta_2)\nabla f(\theta_t)^2$
$\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t$

And if we take $\beta, \beta_1 \to 0$ and $\epsilon = 0$ then RMSProp is exactly SignGD, and Adam its version with momentum.

Then for all methods (SignGD, RMSProp, and Adam) we show:

Theorem. If $f$ is $(H_0, H_1)$-nonuniformly smooth and nonuniformly Łojasiewicz, then the method converges to a suboptimal $f(\theta_t) \leq \epsilon = O(H_0/H_1)$ linearly, and $O(1/\epsilon^s + \ln(H_1/H_0))$ thereafter, where $s = 1$ for steepest descent, and $s = 3$ for RMSProp and Adam.

In specific cases of (binary and multiclass) logistic regression (as well as softmax policy gradient and a version of the 2-layer neural network) the results are even stronger, since in those cases $H_0 = 0$.

Corollary. If $f$ is $(0, H_1)$-nonuniformly smooth and nonuniformly Łojasiewicz, then the method converges linearly to any $\epsilon$.

In particular, when $H_0 = 0$, this implies all methods achieve linear convergence. This establishes that Adam and steepest descent methods linearly converge on logistic regression — which does not contradict prior results showing they cannot do so on quadratic functions, since the two problem classes have fundamentally different geometry.

The rest of the paper shows detailed results for each method, over individual problems. I just want to highlight specifically the breakdown of logistic regression:

Theorem. For logistic regression on separable data with margin $\gamma_p$, NSD with $\theta_1 = 0$, constant stepsize $\eta_t = \eta = O(1)$, and $T = O\!\left(\frac{1}{H_1 \gamma_p^2}\left(n^2 + \ln\frac{n}{\epsilon}\right)^2\right)$ iterations guarantees $f(\theta_{T+1}) \leq \epsilon$.

The extra $\frac{n^2}{L_1\gamma_p^2}$ overhead term comes from the initial chase toward data separation — note that term disappears if we replace this with exponential loss.

We also talk a bit about the situation when  data is not separable. Here, the loss is still nonuniformly smooth with $L_0=0$, but it is not nonuniformly Łojasiewicz — so fast convergence rates will require diminishing step sizes. Actually, in a funny way, this showcases the weakness of forcing better rates  for methods like Adam using only uniformly smooth assumptions — the curvature and unique attained optima is working against you, when your step size is approximately normalized.

Clear superiority

Finally, we demonstrate the performance separation between Adam and steepest descent,  over SGD, Adagrad, AMSGrad,, and  Heavy ball momentum. Specifically, in the paper, we show that the second class of methods (and any method of the form)
$$
\theta_{t+1} = \theta_t – \eta_t m_t, \qquad m_t = (1-\beta)\sum_{s=1}^t \beta^{t-s}\nabla f(\theta_s)
$$

has a lower bound of $\Omega(1/T)$ iteration complexity over a simple 1-D nonuniform smooth function. After staring at pictures and thinking about landscape geometry, this shouldn’t be so surprising — when curvature and smoothness dies out as you get close to the optimum, then these non-normalizing methods with constant step size just can’t make a lot of progress after a while. (Adagrad is especially bad in practice because the denominator also blows up so fast.)  By establishing separation in both upper and lower bounds, we’ve closed the loop on an important piece of intuition: that when we characterize loss functions in terms of nonuniform smoothness, we can clearly see the benefit of methods in both theory and practice.

Check out our paper for more details!

References

[1] Alimisis, F., Islamov, R., and Lucchi, A. Why do we need warm-up? A theoretical perspective. arXiv:2510.03164, 2025.[2] Chen, Z., Zhou, Y., Liang, Y., and Lu, Z. Generalized smooth nonconvex optimization is as efficient as smooth nonconvex optimization. ICML, pp. 5396–5427. PMLR, 2023.

[3] Gorbunov, E., Tupitsa, N., Choudhury, S., Aliev, A., Richtárik, P., Horváth, S., and Takáč, M. Methods for convex (L0, L1)-smooth optimization: Clipping, acceleration, and adaptivity. arXiv:2409.14989, 2024.

[4] Vankov, D., Rodomanov, A., Nedich, A., Sankar, L., and Stich, S. U. Optimizing (L0, L1)-smooth functions by gradient methods. arXiv:2410.10800, 2024.
[5] Vaswani, S. and Harikandeh, R. B. Armijo line-search can make (stochastic) gradient descent provably faster. ICML, 2025.

[6] Wang, B., Zhang, H., Meng, Q., Sun, R., Ma, Z. M., and Chen, W. On the convergence of Adam under non-uniform smoothness: Separability from SGDM and beyond.

[7] Yang, Y., Tripp, E. E., Sun, Y., Zou, S., and Zhou, Y. Adaptive Gradient Normalization and Independent Sampling for (Stochastic) Generalized-Smooth Optimization.

[8] Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. 2019.

[9] Zhang, B., Jin, J., Fang, C., and Wang, L. Improved analysis of clipping algorithms for non-convex optimization. Advances in Neural Information Processing Systems, 33:15511–15521, 2020.

[10] Zhang, R., Wu, J., Lin, L., and Bartlett, P. L. Minimax optimal convergence of gradient descent in logistic regression via large and adaptive stepsizes. arXiv:2504.04105, 2025.

 

Leave a Reply