Post by Yifan Sun, Jan 8, 2021
Convergence proofs are the bane of every optimizer’s existence. They are tricky, extremely long, and mostly unintuitive. Additionally, they tend to be tribal; there are “families” of convergence proof techniques that seem to cover a particular type of method under a particular class of functions, and while efforts are made to make them more “inclusive”, long story short it rarely works.
In the next couple posts, I will investigate some major convergence proof techniques on three simple methods:
- Vanilla gradient descent (VGD)
$$x^{(t+1)} = x^{(t)}- \alpha \nabla f(x^{(t)})$$ - Polyak’s heavy ball method (PHB)
$$x^{(t+1)} = x^{(t)}- \alpha \nabla f(x^{(t)}) +\beta (x^{(t)}-x^{(t-1)})$$ - Nesterov’s accelerated method (NAG) $$x^{(t+1)} = x^{(t)}- \alpha \nabla f(x^{(t)}) +\beta (x^{(t)}-x^{(t-1)}) – \beta \alpha (\nabla f (x^{(t)})-\nabla f(x^{(t-1)}))$$
In this post, I’m restricting my attention to just constant step size versions, e.g. where $\alpha$ and $\beta$ are held constant. When the function $f$ is convex but not strongly convex, changing step sizes can often improve rates, but for strongly convex objectives, at least in the case of VGD and NAG, optimal convergence is already reached with a clever choice of constant step size.
tl;dr: While the quadratic approximation approach is elegant and beautiful, it does not extend to general $L$-smooth and strongly convex functions beyond gradient descent.
Quadratic minimization
In the quadratic approximation framework, we restrict our attention to the problem $$ \min_{x} \quad f(x):=\tfrac{1}{2}x^TQx \qquad \text{(QUAD)} $$ where $Q$ is symmetric positive semidefinite and has smallest nonzero eigenvalue $\lambda_{\min} = \mu$ and largest eigenvalue $\lambda_{\max} = L$. Then, knowing that $x^* = 0$, our problem is reduced to that of bounding $\|x^{(t)}\|_2$.
Gradient descent
To show linear convergence, VGD simply runs the contraction
$$
\|x^{(t)}\|_2 = \|(I-\alpha Q)x^{(t-1)}\|_2 = \|(I-\alpha Q)^t x^{(0)}\|
$$ which gives an overall rate of $$
\|x^{(t)}\|_2 \leq \rho(I-\alpha Q)^t\|x^{(0)}\|_2 $$ where $\rho(M)$ is the spectral radius of the symmetric matrix $M$. Note that the eigenvalues of $M = I-\alpha Q$ fall in the range $(-\delta, \delta)$ where $\delta = \max{|1-\alpha \mu|, |1-\alpha L|}$. Therefore, $$\rho(M) = \max\{|1-\alpha \mu|, |1-\alpha L|\}$$ and is minimized when $$|1-\alpha \mu| = |1-\alpha L| \iff 1-\alpha \mu = \alpha L -1\iff \alpha = \frac{2}{L+\mu}.$$
This results in a linear convergence rate $$\|x^{(t)}-x^*\|_2 \leq \rho^t \|x^{(0)}-x^*\|_2, \qquad \rho = \frac{L-\mu}{L+\mu}.$$
Stability
It is worth questioning what happens if $\alpha$ is chosen wrong. In general, there are several values it can take, but let’s specifically look at cases where $\rho > 1$, which means the contraction step doesn’t hold. This happens if either $|1-\alpha \mu| >1 $ or $|1-\alpha L| > 1$, corresponding to $$ \alpha < 0\quad \text{ or }\quad \alpha \mu > 2 \quad\text{ or } \quad\alpha L > 2.$$ Since $L > \mu$, this implies that we require $$
0 < \alpha < \frac{2}{L} $$ to guarantee stability.
Polyak’s heavy ball method (PHB)
The analysis given by Polyak [1] is also based on this quadratic approximation framework. The simple extension is done by splitting the evolution on $x$ to two variables; for example, via
$$\left[ \begin{matrix}
x^{(t+1)}\\
x^{(t)}\end{matrix}\right] = \underbrace{\left[ \begin{matrix} I – \alpha Q + \beta I & – \beta I\\ I & 0
\end{matrix}\right]}_{=T} \left[ \begin{matrix} x^{(t)}\\ x^{(t-1)} \end{matrix}\right]$$ Note that the matrix $T$ is not symmetric, but we can still discuss its singular values. In fact, considering the extreme and illustrative case where $Q = \mathbf{diag}(\mu, L)$, the matrix can be rearranged to be $$ T = \left[ \begin{matrix} 1 – \alpha L + \beta & 0 & -\beta & 0 \\ 0 & 1 – \alpha \mu + \beta & 0 & -\beta \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{matrix}\right]= \Pi\left[ \begin{matrix} 1 – \alpha L + \beta & -\beta &0& 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0& 1 – \alpha \mu + \beta & -\beta \\ 0 & 0 &1 & 0 \end{matrix}\right]\Pi^T $$ where the singular values of the two diagonal blocks are $$ \lambda{i,j} = \frac{1-\alpha d + \beta \pm \sqrt{(1-\alpha d + \beta)^2-4\beta}}{2}$$ for $d \in \{\mu,L\}$. To see what the optimal choices of $\alpha$ and $\beta$ are, we see that the radius of convergence must be
$$ \rho(T) = \max\left\{ \frac{\alpha L -1- \beta + \sqrt{(1-\alpha L + \beta)^2-4\beta}}{2}, \frac{1-\alpha \mu + \beta + \sqrt{(1-\alpha \mu + \beta)^2-4\beta}}{2}\right\}. $$ The region of stability is when $\rho(T) < 1$, and the optimum (minimum) value of $\rho(T)$ is achieved when the two terms in the max are equal, which forces $$ \alpha = \frac{2(\beta+1)}{L+\mu}.$$ Then minimizing for $\beta$, we arrive at the optimal parameter choices:
$$\alpha = \frac{4}{(\sqrt{L}+\sqrt{\mu})^2}, \quad \beta = \left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^2, \quad \rho = \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}$$
which matches the results given in [1], Ch. 3 Th. 1.
NAG with constant step size
For strongly convex functions $f$, Nesterov’s accelerated method [2] iterates with the specific choice of $$\alpha = \frac{1}{L}\in (0,1), \qquad \beta = \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\in (0,1) $$ Using the same bag of tricks, we can get
$$\left[\begin{matrix}
x^{(t+1)}\\
x^{(t)} \end{matrix}\right] = \underbrace{\left[\begin{matrix}
(1-\beta)(I – \alpha Q) & \beta (\alpha Q-I) \\ I & 0
\end{matrix}\right]}_{=T} \left[\begin{matrix} x^{(t)}\\ x^{(t-1)}. \end{matrix}\right] $$ which reorganizes into a block diagonal matrix with each diagonal block as $$ T_k = \left[\begin{matrix} (1-\beta)(1-\alpha \gamma_k) & \beta(\alpha \gamma_k -1) \\ 1 & 0 \end{matrix}\right] $$ where $\gamma_k$ are the eigenvalues of $Q$, and are in the range $(\mu,L)$. The eigenvalues of $T_k$ are therefore $$ \lambda{k,i} = \frac{(1-\beta)(1-\alpha \gamma) \pm \sqrt{(1-\beta)^2(1-\alpha \gamma)^2 + 4\beta (\alpha \gamma-1)}}{2} $$ which, for the recommended choice of $\alpha$ and $\beta$, can be shown to be smaller than $ \rho \leq 1-\sqrt{\frac{\mu}{L}}$.
Stability of PHB and NAG
In all the methods, convergence, and even stability, depend on $\alpha$ and $\beta$ lying in some prescribed range. This is unsurprising, since even runing a simple method like gradient descent on $f(x) = |x|$ will never converge for any fixed step size. So, implicit in all the proofs is that the function $f$ is $L$-smooth. The convergence guarantees for NAG and VGD do not depend on more than just $L$-smooth and $\mu$-strongly convex, and it can be shown that PHD achieves $O(1-\mu/L)$ rates with these assumptions alone. However, it is known that second order Hessian smoothness is required to achieve $O(1-\sqrt{\mu/L})$ rates in general. This is of course trivial for quadratic objectives.
General strongly convex functions
Now we generalize our attention on minimizing unconstrained objectives under the constraint that, for all $x$ and $y$, $$\frac{\mu}{2} \|x-y\|_2^2 \leq f(x)-f(y) – \nabla f(y)^T(x-y) \leq \frac{L}{2}\|x-y\|_2^2.$$ If the Hessian exists, we can further bound
$$\mu I \preceq \nabla^2 f(x) \preceq LI, \quad \forall x.$$
Gradient descent
In the case of gradient descent,
\begin{eqnarray*} \|x^{(t+1)}-x^*\|_2^2 &=&
\|\left(x^{(t)}-\alpha \nabla f(x^{(t)})\right)-\left(x^* – \alpha \nabla f(x^*)\right)\|_2^2 \\ &\overset{(*)}{=}& \|x^{(t)}-x^*\|_2^2 -2\alpha \underbrace{ (\nabla f(x^{(t)})-\nabla f(x^*))^T(x^{(t)}-x^*)}_{\geq \mu \|x-x^*\|_2^2} + \alpha^2 \underbrace{\| \nabla f(x^{(t)})-\nabla f(x^*)\|_2^2}_{\leq L^2\|x^{(t)}-x^*\|_2^2}\\
&\leq & (1-2\alpha \mu+\alpha^2L^2)\|x^{(t)}-x^*\|_2^2\end{eqnarray*} and $\rho = 1-2\alpha \mu+\alpha^2L^2 = 1-\mu^2/L^2$. Of course we know this is not the optimal choice of $\alpha$ nor the optimal $\rho$. To actually get here, we need to use a special relationship from [2] Thm. 2.1.12
\begin{equation}
(\nabla f(x)-\nabla f(y))^T(x-y) \geq \frac{\mu L}{\mu + L} |x-y|_2^2 + \frac{1}{\mu + L}|\nabla f(x)-\nabla f(y)|_2^2
\qquad (\triangle)
\end{equation}
which is not intuitively obvious, given our exercise with quadratics. From this, we go from $(*)$ in the above derivation to \begin{eqnarray}
\|x^{(t+1)}-x^*\|_2^2 &=& \left (1- \frac{2\alpha\mu L}{\mu + L}\right)\|x^{(t)}-x^*\|_2^2 + \underbrace{\left(\alpha^2-\frac{2\alpha}{\mu + L}\right)}_{\text{require } \alpha > 2/(\mu+L)} \underbrace{\| \nabla f(x^{(t)})-\nabla f(x^*)\|_2^2}{\leq L^2\|x^{(t)}-x^*\|_2^2}\\
&\leq & \underbrace{\left(1- 2\alpha L+ \alpha^2 L^2\right)}_{(\rho(\alpha))^2}\|x^{(t)}-x^*\|_2^2 \end{eqnarray}
The function $(\rho(\alpha))^2$ is quadratic in $\alpha$, and is minimized at $\alpha = 1/L > 2/(\mu+L)$, and thus we pick $\alpha = 2/(\mu+L)$ to be the smallest viable candidate. This gives a final radius of $\rho = \frac{L-\mu}{L+\mu}$, as desired. Note that here, no discussion of the Hessian is necessary; the problem may not even be twice-differentiable.
Tight quadratic bound
Before moving on, it is worth staring a bit further at $(\triangle)$, and seeing why this is such a better bound than the ones we tried to use rather blindly. Take again the quadratic example, with $\nabla f(x) = Qx$ and $z = x-y$. Then the bounds we tried to use the first time are just
$$z^TQz \geq \mu z^Tz \text{ and } z^TQ^TQz \leq L^2 z^Tz. \qquad (\square)$$
In fact, $(\triangle)$ reduces to
$$ (\mu+L)z^TQz \geq \mu L z^Tz + z^TQ^TQz$$
which, for some eigenvalue $\lambda$ of $Q$, in fact establishes the bound
$$\lambda^2 – (\mu + L) \lambda + \mu L \geq 0 \iff
(\lambda – \mu)(\lambda- L) \leq 0. \qquad (\triangle’)$$
To understand why the second bound is tighter, assume that $z$ is an eigenvector of $Q$, and $\lambda z = Qz$. Then if $\lambda = \mu$, the second bound in $(\square)$ is loose; equivalently, if $\lambda = L$, the first in $(\square)$ is loose. In other words, using the technique in spired in $(\square)$, depending on what $x^{(t)}-x^*$ is, at least one of the bounds must be loose. However, by combining them to one inequality in $(\triangle’)$, the two terms “multiply the looseness”. While it is still possible that the inequality is not tight, there exists at least two values of $z$ where the inequality is satisfied with equality.
Polyak’s heavy ball method (PHB)
Applying the same basket of tricks gets us the same rate as gradient descent. We don’t need 2nd order smoothness, but we can’t get optimal rates. In particular, we already have that
$$\|x^{(t)} – \alpha \nabla f(x^{(t)}) -x^*\|_2 \leq ( \alpha L-1) \|x^{(t)} -x^*\|_2$$ So we can do
\begin{eqnarray} \|x^{(t+1)}-x^*\|_2 &\leq& (1+\beta) \|x^{(t)} – \frac{\alpha}{1+\beta} \nabla f(x^{(t)}) -x^*\|_2 + \beta \|x^{(t-1)}-x^*\|_2 \\
&\leq& (\alpha L-1-\beta) \|x^{(t)} -x^*\|_2 + \beta \|x^{(t-1)}-x^*\|_2
\end{eqnarray} or, in matrix form $$ \left[\begin{matrix}\|x^{(t+1)}-x^*\|_2\\
\|x^{(t)}-x^*\|_2 \end{matrix}\right] = \underbrace{ \left[\begin{matrix}
\alpha L – 1 – \beta & \beta\\
1 & 0
\end{matrix}\right]}_{T}
\left[\begin{matrix}
|x^{(t)}-x^* |_2\\
|x^{(t-1)}-x^* |_2
\end{matrix}\right] $$ However, though this is very similar to the recursion created in the previous section, the sign flip on the upper-right $\beta$ is significant; the roots
$$ \rho(T) = \max\left\{
\frac{\alpha L -1- \beta + \sqrt{(1-\alpha L + \beta)^2+4\beta}}{2},
\frac{1-\alpha \mu + \beta + \sqrt{(1-\alpha \mu + \beta)^2+4\beta}}{2}\right\} > 1$$
are minimized when
$$\alpha = \frac{2(\beta+1)}{L+\mu}, \quad \beta = 0$$
which corresponds to the vanilla gradient method.
NAG with constant step size
We repeat the exercise again with NAG, arriving at
\begin{eqnarray} \|x^{(t+1)}-x^*\|_2 &\leq& (1+\beta)\|x^{(t)} – \alpha\nabla f(x^{(t)}) -x^*\|_2 + \beta \|x^{(t-1)}- \alpha\nabla f(x^{(t)})-x^*\|_2 \\ &\leq& (1+\beta)(\alpha L -1) \|x^{(t)} -x^*\|_2 + \beta( \alpha L-1) \|x^{(t-1)}-x^*\|_2 \end{eqnarray}
which, in matrix form, is $$\left[\begin{matrix}
\|x^{(t+1)}-x^* \|_2\\
\|x^{(t)}-x^* \|_2\end{matrix}\right] = \underbrace{\left[\begin{matrix}
(1+\beta)(\alpha L-1) & \beta(\alpha L-1)\\
1 & 0
\end{matrix}\right]}_{T}
\left[\begin{matrix}
\|x^{(t)}-x^*\|_2\\
\|x^{(t-1)}-x^* \|_2
\end{matrix}\right]$$ Picking the prescribed $\alpha = 1/L$, then the spectral radius of $T$ here is 1, which shows no contraction again.
This leaves us with an open question: Can this analysis work with a different choice of step size? But somehow this isn’t very satisfying, since the main slack between the “function version” and the “quadratic version” is the triangle inequality, which clearly makes the bound worse. So, if the bound isn’t tight to begin with, even if changing parameters makes the bound better it does not suggest a better method in practice.
Addendum
Jan 9, 2021
It is worth being more explicit about when quadratic approximation does apply. In particular, the quadratic approximation applies for $Q = \nabla^2 f(z)$, where $z$ is some vector associated with $x^{(t)}$ and $x^{(t+1)}$. More precisely, Taylor’s theorem tells us that, as long as the Hessians are continuous everywhere, then for each $t$, there always exists some $z$ where $$\nabla^2 f(z) (x^{(t)}-x^*) = \nabla f(x^{(t)}) – \nabla f(x^*).$$ Under this assumption, we can directly apply the results from the quadratic analysis, which is how Polyak in [1] explains the optimal guarantee for general $L$-smooth $\mu$-strongly convex functions. Specifically, the only additional assumption we need for PHB method to converge at the optimal rate is that it is second-order differentiable everywhere. Then, for the same optimal choice of $\alpha$ and $\beta$, $$\left[\begin{matrix} x^{(t+1)} -x^*\\ x^{(t)}-x^* \end{matrix}\right] = \left[\begin{matrix} x^{(t)} -\alpha \nabla f(x^{(t)}) + \beta (x^{(t)}-x^{(t-1)})\\ x^{(t)}\end{matrix}\right] -\left[\begin{matrix} x^*\\x^*\end{matrix}\right]=\underbrace{ \left[\begin{matrix} (1+\beta)I – \alpha \nabla^2 f(z) & -\beta I \\ I & 0 \end{matrix}\right] }_{\lambda_{\max}\leq \rho := \frac{\sqrt{L}+\sqrt{\mu}}{\sqrt{L}-\sqrt{\mu}}}\left[\begin{matrix} x^{(t)} -x^*\\ x^{(t-1)}-x^* \end{matrix}\right] $$ which gives the same one-step contraction, and therefore same overall rate.
The same extension is a little less certain for NAG, since the $Q$ matrix appears twice, and you would need two Hessians to approximate both $\nabla f(x^{(t)})$ and $\nabla f(x^{(t-1)})$. In general, those two Hessians will not have the same eigenvectors, so we cannot apply the same simultaneous diagonalization. Of course that’s a less pressing issues, since there are other ways of deriving optimal rates for NAG, of which don’t quite work for PHB.
Addendum 2
April 2, 2021
After much discussion, it has been solidified that the first addendum is wrong. The issue here is a mismatch between the maximum magnitude eigenvalue and the spectral norm of an assymetric matrix. In general, we only know that
$$
\text{max magnitude eigenvalue of $T$} \leq \text{spectral norm of $T$}
$$
and the inequality is only guaranteed to be tight if $T$ is normal (the most useful subset being symmetric matrices). Since this is not true, we have to instead ask ourselves not what $\rho(T)$ does for us, but what $\rho(T^TT) = \rho(TT^T)$ does for us. After much Mathematica pounding, it seems that, in fact, $\rho(TT^T)\not\leq 1$ in general, unless $\mu = L$. In more detail, the characteristic equation for the eigenvalues of $TT^T$ are the solutions to the equation
$$(\beta^2-\lambda )(\lambda -1) + (1 + \beta – \alpha \gamma)^2 \lambda = 0$$
where $\gamma$ is an eigenvalue of $H$. The roots are
$$\lambda = \frac{1}{2} (B \pm
\sqrt{-4 \beta^2 + B^2}), \qquad B = 2 + 2 \beta + 2 \beta^2 – 2 \alpha \gamma – 2 \alpha \beta \gamma +
\alpha^2 \gamma^2$$
In particular, the positive root has a unique minimizer at $\gamma = \frac{1+\beta}{\alpha}$, in which case the value is exactly 1. In other words, there is only one value of $\gamma$ that is allowed, when in fact we would like the value of $\gamma$ to move freely between $\mu$ and $L$.
In other words, this analysis does not extend past quadratic problems. What a bummer! (But hey, it’s pretty neat when this subtle linear algebra stuff really comes into play, isn’t it?)
Addendum to addendum: Wait, but even in the case that $L = \mu$, doesn’t this analysis mean that $\rho(T) \geq 1$ no matter what I pick for $\alpha$ and $\beta$? How do I reconcile this? Well, this is actually a consequence of (and I had never heard of this until now) another lemma, which basically states that the gap between the max eigenvalue magnitude and the spectral norm disappears when we compound matrices. Basically, while we cannot really say that $\rho(T) < 1$, we can basically say that $$\rho(T^k)\overset{k\to +\infty}{\to} c \leq \underbrace{(\max_i |\lambda_i(T)|)^k}_{<1} +\underbrace{\epsilon_k}_{\to 0}.$$
This is a corollary of Gelfand’s formula, which states that $\rho(T) = \lim_{k\to\infty} \|A^k\|^{1/k}$ for any matrix norm. Since in quadratic problems, we are indeed compounding the same $T$ each time, we get away with this. On a technicality. -_-‘
Addendum much thanks to Sharan and Reza, who really forced me to believe the error of my ways.
Thanks to Sharan and Reza and Adrien for helpful discussions.
Up next: Nesterov’s estimate sequences
References
- [1] Polyak, B. T. (1987). Introduction to optimization
- [2] Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course.