Post by Yifan Sun
I have recently been caught up in the flurry of analyzing popular convex optimization problems as discretizations of continuous ODEs, with the objective of simplifying convergence analysis and constructing better methods. This “flurry” is by no means recent, with a somewhat “but we always knew this” vibe, and is by no means restricted to the couple papers cited in this post. However, since I’m spending these couple posts focused on the three methods VGD, PHB, and NAG, the string of papers I cited are some that directly address continuous ODE interpretations of these methods. While I don’t think I can ever fully untangle the relationships of these three methods to each other, continuous ODEs provide an interesting interpretation, and interestingly enough, also shed light on how to construct good estimate sequences.
tl;dr: continuous time proofs are way easier than their discrete versions, but the “rankings” are somewhat preserved: VGD + strong convexity is easiest, followed by VGD + convex, with a significant jump in difficulty for the two accelerated methods. The flavor of the proofs are also not that different than the discretized proofs, where telescoping $\to$ integrating, estimate sequence functions $\to$ energy functions, etc. Still, the results are somewhat surprising; namely, there is no significant difference between PHD and NAG iteration scheme; however, each choice of optimal parameters $\alpha$ and $\beta$ does make a significant difference.
The big idea
The most relevant continuous ODE in convex optimization is gradient flow $$\dot z(t) = -\nabla f(z(t)).\qquad \text{(Gradient Flow)}$$ where many have made the observation that the forward Euler discretization scheme results in the VGD method $$x_{k+1} \approx x_k + h \dot x_k = x_k – h\nabla f(x_k). \qquad \text{(VGD)}$$. There are two advantages of analyzing this ODE over VGD itself:
- The analysis of (ODE) has very little gradient smoothness requirements. In fact, as long as the gradient exists and is continuous everywhere, we are golden; no need for $L$-smoothness with any specified $L$. This is because problems caused by bad $L$ estimation translates into discretization errors, which can be swept under the carpet of discretization convergence rates, which is well known in numerical analysis.
- The analysis is “easier to generalize”. Often, the proofs of convergence for standard methods can be long and involved, and somewhat unintuitive to a newcomer. In contrast, the proofs for the ODE, even if they are involved, somewhat correlate with standard differential equations results. Additionally, if the methods are relatable to physical systems (we see words like “momentum” and “friction” pop up) then this makes the methods seem a bit less opaque.
However, nothing comes for free. In my view, the primary disadvantage of continuous time analysis is that there are different ways to discretize, and different discretizations seem to lead to different results.
Throughout this post, I will use $z(t)$ as a continuous trajectory of a vector in $\mathbb R^n$. The discretization samples will be represented as $x_k = z(hk)$, e.g. uniformly sampled at time intervals $h$ apart.
Vanilla gradient descent
Strongly convex case
In the case that the objective function $f$ is $\mu$-strongly convex, we use the PL-inequality version which states that $$\frac{1}{2}\|\nabla f(x)\|_2^2 \geq \mu(f(x)-f^*), \quad \forall x.$$ Then, describing a Lyapunov function (think energy function) as $$\mathcal E(t) = f(z)-f^*,$$ we have a first-order ODE: $$\dot{\mathcal E}(t) = \nabla f(z)^T\dot z = -\|\nabla f(z)\|_2^2 \overset{PL}{\leq} 2\mu (f(x)-f^*) = 2\mu \mathcal E(t)$$ with standard solutions $$\mathcal E(t) = \mathcal E_0 e^{-2\mu t}.$$ This immediately gives linear convergence with radius $e^{-2\mu}$.
Convex not strongly case
Although the standard convergence proof is often harder when we drop the strong convexity requirement, the rate for gradient norm convergence in the continuous case can be derived with a very cute trick. We simply have the observation that, using the same Lyapunov function, $$\mathcal E(t) -\mathcal E(0)= \int_0^t \dot{\mathcal E}(t) dt = -\int_0^t \|\nabla f(z(t))\|_2^2 dt.$$ Since $\mathcal E(t) > 0$ for all $t$, we can rearrange to write $$\frac{1}{t} \int_0^t \|\nabla f(z(t))\|_2^2 \leq \frac{\mathcal E(0)}{t}$$ or, in other words, the average squared gradient norm decays as $O(1/t)$. We can also see that the quantity $\|\nabla f(z(t))\|_2^2$ is always nonincreasing as long as $f$ is convex, since $$\frac{\partial}{\partial t} \|\nabla f(z)\|_2^2 = 2\nabla f(z)^T\nabla^2 f(z) \dot z = -2\nabla f(z)^T\nabla^2 f(z) \nabla f(z) \leq 0.$$ Therefore we can strengthen our statement to simply $\|\nabla f(z(t))\|_2^2 \leq \frac{\mathcal E(0)}{t}$. (Of course, one may argue that a telescoping argument in the discretized case does not require that many more lines, but it does require first deriving the descent lemma, which we don’t need to do here. Translation, I didn’t need to care about $L$-smoothness.)
Accelerated methods
After playing with the equations for (more than a short) while, I have come to the conclusion that there is good reason as to why extending this analysis to accelerated methods results in lengthy publications. It is not only a bit nontrivial as to how to do the analysis, but even the right discretization scheme seems not that obvious. The two accelerated methods can be written in one line as
\begin{align*}x_{k+1} &= x_k – \alpha\nabla f(x_k) + \beta(x_k-x_{k-1}) & \text{(PHB)}\\x_{k+1} &= x_k – \alpha\nabla f(x_k) + \beta(x_k-x_{k-1}) – \beta \alpha (\nabla f(x_k)-\nabla f(x_{k-1}) & \text{(NAG)}\\\end{align*}
where the optimal choices for $\alpha$ and $\beta$ are author (method) -dependent
- Polyak’s: $ \alpha_P = \frac{4}{(\sqrt{L}-\sqrt{\mu})^2}$, $\beta_P = \frac{(\sqrt{L}-\sqrt{\mu})^2}{(\sqrt{L}+\sqrt{\mu})^2}. $
- Nesterov’s: $ \alpha_N = \frac{1}{L}$, $\beta_N = \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}.$
We make a combination of forward and backward Euler substitutions: $$\ddot z(t) = \frac{x_{k+1}-2x_k+x_{k-1}}{h^2}, \quad \dot{z(t)} = \frac{x_{k}-x_{k-1}}{h}, \quad \dot{\nabla f}(z(t)) = \frac{\nabla f(x_{k})-\nabla f(x_{k-1})}{h}$$
and the corresponding ODEs, parametrized by the discretization $h$, is
\begin{align*}\ddot zh^2 &+ \alpha \nabla f(z) + (1-\beta) h \dot z = 0 & \text{(PHB)}\\\ddot z h^2 & + \alpha \nabla f(z) + (1-\beta) h \dot z + \beta \alpha h \dot{\nabla f}(z) = 0 & \text{(NAG)}\\\end{align*}
where $z = z(t)$ is time-varying and we fix $\alpha$ and $\beta$ to be constant. (This is not the case when $f$ is not strongly convex, which is discussed in [3,4] but I’m ignoring here.) In other words, the key difference between (NAG) and (PHB), in the continuous time setting, is the extra $\tfrac{\partial}{\partial t} \nabla f(z)$ term. Intuitively, this makes sense; chain ruling out $\tfrac{\partial}{\partial t}\nabla f(z) = \nabla^2 f(z) \dot z$ shows the participation of a second order “dragging” term in a first order method.
The choice of parameters $\alpha$, $\beta$, and discretization $h$ all matter. At first glance, I assumed that $h = \alpha = O(1/L)$ was the most logical choice, since it is the choice used in gradient flow. However, using that, we immediately see that taking the limit of $h\to 0$ then the 2nd order term $\ddot z$ must disappear, which loses any interesting 2nd order behavior. Instead, we should follow the procedure of the authors in [3] and pick $h = 1/\sqrt{L}$. We now consider our two choices of optimal $\alpha$ and $\beta$. The two choices are different, but interestingly, at the limit of $L\to +\infty$, both choices result in $\alpha = O(1/L)$, $\beta = O(1)$, $1-\beta = O(1/\sqrt{L})$. To be more specific and accounting for constants, in this limit,
\begin{align*}\ddot z(t) &+ O(1) \nabla f(z(t)) + O(1) \dot z(t) = 0 & \text{(PHB)}\\\ddot z(t) & + O(1) \nabla f(z(t)) + O(1) \dot z(t) + O(h) \dot{\nabla f}(z(t))= 0 & \text{(NAG)}\\\end{align*}
and in the limit of $h\to 0$, the two methods are identical, with the choice of best $\alpha$, $\beta$ changing some constants in the first-order terms. This is the observation given in several papers, including [3,4], and acknowledged in the followup [5], which studies more precisely what happens when we bring back the $h\dot{\nabla f}(z(t))$ term. Since the continuous time limit only considers $h\to 0$, we must interpret this as a discretization effect, and this simple example highlights the importance of studying such effects. (Without it, continuous time PHB and NAG, are identical up to choice of $\alpha$, $\beta$!)
But, I am lazy, it is Friday, and thus I am going to sweep this under the rug and ignore it. I will simply consider a simplified ODE as
$$\ddot z + a\nabla f(z) +b\dot z = 0 \qquad (\star)$$
and if $\alpha = \alpha_P$ and $\beta = \beta_P$ then
$$a = \frac{4}{(1-h\sqrt{\mu})^2} \overset{h\to 0}{\to} 4, \quad b = \frac{4}{(1+h\sqrt{\mu})^2}\overset{h\to 0}{\to} 4;$$
if $\alpha = \alpha_N$ and $\beta = \beta_N$ then
$$a = 1, \quad b = \frac{2\sqrt{\mu}}{1+h\sqrt{\mu}} \overset{h\to 0}{\to} 2\sqrt{\mu}.$$
I am now basing the derivations on what is presented in [4], which derives the $O(1/t^2)$ rate for continuous time PHB and NAG (along with many interesting generalizations). However, I am somewhat more interested in the $\mu$-strongly convex case, so I will slightly modify the Lyapanov function, to
$$\mathcal E(t) =\frac{1}{2} \|A(z-x^*)+B\dot z\|_2^2 + \frac{C}{2}\|z-x^*\|_2^2 + D (f(z)- f^*) $$
where $A$, $B$, $C$, and $D$ are nonnegative time-varying scalar parameters. My basis for the extension of $\mathcal E$ is that it seems to better imitate the mixing behavior of quadratic estimate sequences when $\mu > 0$. (To show a $O(1/t^2)$ rate for $\mu = 0$, take $A = B = 1$, $C = 0$.) Now, the goal is to show that $\dot {\mathcal E} \leq 0$ to show it is nonincreasing, and use this to conclude
$$D(t) (f(z)-f^*) \leq \mathcal E(t) \leq \mathcal E(0) \; \Rightarrow \; f(z)-f^* \leq \frac{\mathcal E(0)}{D(t)}$$
which isolates the rate of decay of $f(z(t))-f^*$ to the decay rate of $1/D(t)$. (Note the extreme similarity to the technique in the previous post!)
Just as the difficult step in the previous post was showing that $\phi^*_t > f(x^{(t)})$ at each step, now the difficult step is showing that $\dot {\mathcal E}(t) < 0$. The effect is somewhat analogous, which shows that in math, as in life, there are no real shortcuts. We write
\begin{eqnarray*}\dot{\mathcal E}(t) &=& (A(z-x^*)+B\dot z)^T(\dot A(z-x^*) + A\dot z +\dot B\dot z + B \ddot z) + \frac{\dot C}{2}\|z-x^*\|_2^2 \\&&\qquad + C(z-x^*)^T\dot z + \dot D (f(z)- f^*) + D \nabla f(z)^T\dot z\end{eqnarray*}
Inserting in $(\star)$
\begin{eqnarray*}\dot{\mathcal E} &=& \left(\dot A\left( B + A(A+\dot B-bB)\right)+C\right)\dot z^T (z-x^*) + (B\dot z)^T( (A+\dot B – bB)\dot z ) – ABa(z-x^*)^T\nabla f(z) \\&&\qquad + \dot D (f(z)- f^*) + \left(D + aB^2\right) \nabla f(z)^T\dot z + \left(A\dot A + \frac{\dot C}{2}\right)\|z-x^*\|_2^2\\
&=& -ABa(z-x^*)^T\nabla f(z) + \dot D (f(z)- f^*) + \left(A\dot A + \frac{\dot C}{2}\right)\|z-x^*\|_2^2\\
\end{eqnarray*}
where in the second line we force $A+\dot B – bB = 0$, $\dot A B + C = 0$, $D+aB^2=0$. Adding in $\mu$-strong convexity of $f$,
\begin{eqnarray*}
\dot{\mathcal E} &\leq& (\dot D-ABa)\underbrace{(z-x^*)^T\nabla f(z)}_{\geq 0} + \left(A\dot A + \frac{\dot C}{2} – \frac{\dot D \mu}{2}\right)\|z-x^*\|_2^2\\
\end{eqnarray*}
and thus $\mathcal E$ is monotonically nonincreasing whenever $\dot D \leq ABa$, $A\dot A + \frac{\dot C}{2} \leq \frac{\dot D \mu}{2}$. Since ideally we want $1/D$ to decay to show a convergence rate, we additionally require $0 < \dot D.$
Now, rather than solving some complicated (but single variable) differential inequalities, I will instead insert the guess that $D \propto e^{c_D t}$, since anyway I know I am after a linear convergence rate. In order for this to happen, it is most likely that all the scalar players go by exponential rates, e.g. $A \propto e^{c_A t}$, $B \propto e^{c_Bt}$, etc. Let’s start with $B = B_0e^{c_B t}.$ Then
\begin{eqnarray*}D &=& -aB^2 = -aB_0e^{2c_B t}, \\ A &=& bB-\dot B = B_0(b-c_B)e^{c_B t}, \\ C &=& -\dot A B = -B_0^2(b-c_B)c_Be^{2c_B t}.\end{eqnarray*}
Inserting these into our constraints on $\dot D$,
$$ABa =aB_0^2 (b-c_B)e^{2c_B t} \geq \dot D \geq \max\{0,\frac{2A\dot A + \dot C}{\mu} \}\geq 0. $$
Since we would like $D$ to increase as quickly as possible, we pick the upper end, and get
$$\dot D = B_0^2a (b-c_B)e^{2c_B t} \Rightarrow D = D_0 e^{2c_B t}$$
for some $D_0 > 0$ (with the stipulation that $b> c_B \geq 0$).
Finally, the question is what is the radius of convergence? Well, we want $c_B$ to be as big as possible, but it must be strictly smaller than $b$. So, ultimately, we should observe a rate of convergence upper bounded by $O(e^{-2bt})$. Interestingly, while Polyak’s parameter choice leads to $b = 4$ and isn’t too illuminating, Nesterov’s parameter choice gives $b = 2\sqrt{\mu}$ and seems to exactly recover the expected (accelerated) $O(e^{-\sqrt{\mu}})$ radius of convergence.
Final thoughts
The idea that both Polyak’s method and Nesterov’s method can be analyzed using the same continuous time ODE, with acceleration seemingly dependent only on the choice of parameters $\alpha$ and $\beta$, is somewhat unsettling, and is addressed in further works, such as [5]. (While I have only shown this in the strongly convex case, in [4] it is also shown in the convex-not-strongly case.) Note also that the PHB assumption for Hessian continuity is also not required; in general, smoothness needs are thrown away in the continuous-time regime, and seem to be isolated in the discretization stability part.
Finally (and this was a main theme of all 3 posts) although we have pulled out many differences between PHB and NAG (assumptions, provable rates, proof techniques) they still have not illuminated much on the actual method behavior, namely that NAG accelerates more but PHB is less sensitive to parameters. Thus, the search continues…
Thanks to illuminating discussions with my grad students Zhaoyue Chen and Mokhwa Lee, whose excellent paper presentations clarified many of the sticky derivations.
References
- [1] Polyak, B. T. (1987). Introduction to optimization
- [2] Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course.
- [3] Weijie Su, Stephen Boyd, and Emmanuel Candes. “A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights.”
- [4] Andre Wibisono, Ashia Wilson, Michael I. Jordan. (2016). “A variational perspective on accelerated methods in optimization.”
- [5] Bin Shi, Simon S. Du, Michael I. Jordan, and Weijie J. Su. “Understanding the acceleration phenomenon via high-resolution differential equations.”