Newton’s method II : Self concordant functions

tl;dr: Self-concordance. definition, some examples, some interesting properties. Proof of Newton’s rates left as exercise for reader 🙂

An interesting generalization of strongly convex functions are these self-concordant functions, which can be defined as
$$
|D^3 f(x)[u,u,u]|\leq 2M\|u\|_{\nabla^2 f(x)}^2
$$
Here, the notation $$D^3 f(x) [u,v,w] = \sum_{i,j,k}\frac{\partial^3 f(x)}{\partial x_i \partial x_j \partial x_k} \, u_i \, v_j\,  w_k$$ is a way of denoting the 3rd order tensor derivative of $f$, and we write $\|u\|_H = \sqrt{|u^THu|}$ to denote the norm induced by the symmetric square matrix $H$.

This class of functions is usually studied in association with Newton’s method, which, as shown in a previous post, has quadratic convergence if the function is $\mu$-strongly-convex, and local quadratic convergence if the function has a continuous Hessian map and is $\mu$-strongly-convex at a unique minimizer $x^*$.

In fact, the notion can be made even more general by the observation that Newton’s method is affine invariant. That is, consider two equivalent problems where $g(u) = f(Au)$ for $x = Au$:
$$
\min_x\; f(x)\quad \iff \quad \min_u g(u). $$ Then the Newton step is identical, e.g. $$ x^+ – x \,=\,  – \beta (\nabla^2 f(x))^{-1} \nabla f(x)  \,=\, – \beta A (\nabla^2 f(u))^{-1} \nabla f(u) \,=\, Au^+-Au. $$

So here’s a fun fact. There can be functions which are $\mu$-strongly-convex, but, after an affine change of variables, are no longer strongly convex at all! (An easy example of this is the above one where $A$ is not full row rank). Somehow this restriction isn’t so interesting, though, because as we know even gradient descent is somewhat agnostic to low-rank affine transformations. (The gradient is always orthogonal to nullspaces of this transformation.) The issue is conditioning, e.g. when $A$ has a wide spectral range. As an example, suppose that $A = \mathbf{diag}(1/k)$ for $k = 1,…,n$. Then gradient descent will have trouble with the coordinates corresponding to the smallest eigenvalues of $A$, which will be poorly represented in comparison to the largest eigenvalues. On the other hand, Newton’s method would invert $A$ in its gradient computation, somehow evening this depression so that no effect is felt. (Of course the issue with multiplying a large number with its inverse still has to be dealt with numerically…)

By the way, in case you weren’t aware, quadratic convergence is very fast. Like, very, very, very fast.

 

 

The main example of a self-concordant and not-strongly-convex function was the log barrier $$
\phi(x) = -\sum_i \log(b_i-a_i^Tx)
$$ which could be used to enforce linear inequality constraints $a_i^Tx\leq b_i$ for $i = 1,…,m$. To see that it is not strongly convex, just take as a simple example $m = 1$, $b = 0$, and $a = -1$. Then $$ \phi(x) = -\log(x), \quad \phi'(x) = -\frac{1}{x}, \quad \phi”(x) = \frac{1}{x^2}$$ and taking $x\to +\infty$  shows that $\phi(x)$ cannot be $\mu$-strongly convex.

More generally, defining $c_i = \left(\frac{a_i^Tu}{b_i-a_i^Tx}\right)^2$, we see that

$$
|D^3 \phi(x)[u,u,u]| = \sum_i c_i^3 = \sum_i c_i^{3/2} \leq  \left(\sum_i c_i\right)^{3/2} = (u^T\nabla^2 \phi(x)u)^{3/2}
$$

since in general, for $a\geq 0$ and $b\geq 0$, $\log(a^{3/2}+b^{3/2}) \leq \tfrac{3}{2}\log(a+b)$.

I’m not really going to talk much more about barrier functions, though, because as beautiful as they are, they don’t really seem to be very present in machine learning applications. In contrast, since in ML / Big Data stuff we generally use gradient methods (and are whining enough when we have to use a full gradient rather than a stochastic one), the heavy machinery needed for this generalization doesn’t seem necessary. In particular, while log barriers are self-concordant, logistic regression and exponential margin regression are not. Take

$$f(x) =  \log(\sigma(a^Tx)),\quad \sigma(\xi) = \frac{1}{1+e^{-\xi}}$$

then

$$
u^T\nabla^2 f(x) u = \sigma(a^Tx)(1-\sigma(a^Tx)) (a^Tu)^2 \overset{\|x\|_2\to +\infty}{\to} 0
$$

$$
|D^3 f(x)[u,u,u] |= |(\sigma(1-a^Tx)^2 -\sigma(a^Tx)^2)(a^Tu)^3| \overset{\|x\|_2\to +\infty}{\to} |a^Tu|^3
$$

and it is clear that at this asymptotic limit, self-concordance is not present.

Ok, so why even care about this property?

Well, to me, it seems that even for first order methods, we are always trying to use quadratic approximations, usually with respect to the Euclidean norm. More recently, people have been all up in arms about using Bregman divergences instead as a metric, because they are more geometrically representative, or using Hessian estimates of Fisher Information. Well, in a way, self-concordant properties are kind of like the “gold standard” of such methods, because the idea is to use as much quadratic information as possible,  but with respect to the norm induced by the current Hessian. Without proving, here are some very familiar looking inequalities I pulled from [1]:

Hessian conditioning

  • If $f$ is globally  $\mu$-strongly convex:
    $$
    \mu I  \preceq \nabla^2 f(x) \preceq L I, \qquad \forall x
    $$
  • If $f$ is locally strongly convex at $x^*$, globally $M$-continuous Hessian
    $$
    0  \preceq \nabla^2 f(x) \preceq (L+M) I, \qquad \|x-x^*\|_2 \leq \frac{\mu}{M}
    $$
  • If $f$ is $M$-self-concordant
    $$
    (1-Mr)^2\nabla^2 f(x) \preceq \nabla^2 f(x^*) \preceq \frac{1}{(1-Mr)^2}\nabla^2 f(x), \qquad \|x-x^*\|_{\nabla^2 f(x)} \leq  1/M.
    $$

Optimality

  • If $f$ is globally  $\mu$-strongly convex:
    $$f(x)-f^* \geq \frac{\mu}{2}\|x-x^*\|_2^2 $$
  • If $f$ is locally strongly convex at $x^*$, globally $M$-continuous Hessian
    \begin{eqnarray*}
    f(x)-f^* &=& \int_0^1 \nabla f\underbrace{(x^*+t(x-x^*))^T}_{=z}(x-x^*)d\tau \\
    &=&  \int_0^1 (\nabla f(z)-\underbrace{\nabla f(x^*)}_{=0})^T(x-x^*)d\tau \\
    &\geq& \int_0^1 \frac{\mu}{t}\|z-x^*\|_2^2 \\
    &=& \mu \|x-x^*\|_2^2 \int_0^1 tdt \\
    &=& \frac{\mu}{2}\|x-x^*\|_2^2
    \end{eqnarray*}
  • If $f$ is $M$-self-concordant
    $$f(x)-f^* \geq \frac{1}{M^2}\omega(M \|x-x^*\|_{\nabla^2 f(x^*)}), \quad \omega(\tau) = \tau-\log(1+\tau).$$

Convex conjugates

Recall that for the conjugate function $f^*(z) = \sup_x\;x^Ty-f(x)$, then if both functions are twice differentiable, then via Fenchel Young,

$$
\nabla f(x) = z \iff x = \nabla f^*(z)
$$

and

$$\nabla^2 f^*(\nabla f(x)) = (\nabla^2 f(x))^{-1}.$$

Well, from that, it is evident that if $f$ is $L$-smooth then $f^*$ is $1/L$ strongly convex, and vice versa.

In comparison, if $f$ is $M$-self-concordant, then $f^*$ is $M$-self-concordant as well! Somehow the “gap” between $L$ and $\mu$ is omitted! To see this, note that, first note that
$$
D^3 f^*(z)[u,u,u] = \lim_{t\to 0} \frac{u^T(\nabla^2 f^*(z+tu)-\nabla^2 f^*(z))u}{t} =g'(0)$$
where for $\nabla f(x(t)) = z+tu$ and we write $x = x(0)$,
$$g(t) = u^T\nabla^2 f^*(z+tu)u = u^T\nabla^2 f(x(t))^{-1} u.
$$
The matrix $H(t) = \nabla^2 f(x(t))$ is a matrix map, and applying chain rule,
$$
g'(0) =\frac{\partial}{\partial t} u^TH(t)^{-1}u|_{t=0} = u^TH(0)^{-1} \left(\lim_{t\to 0} \frac{H(t)-H(0)}{t}\right)H(0)^{-1} u = u^T\nabla^2 f^*(z) \left(\lim_{t\to 0} \frac{H(t)-H(0)}{t}\right)\nabla^2 f^*(z)u.
$$
Next, using the first order Taylor expansion, $\nabla f(x+tv) \approx \nabla f(x) + t\nabla^2 f(x)v$, and thus
$$ \nabla^2 f(x+tv)^{-1} = \nabla^2 f^*(\nabla f(x+tv)) \approx \nabla^2 f^*(\nabla f(x)+t\nabla^2 f(x) v)$$
with equality as $t\to 0$. Taking $z = \nabla f(x)$, then $u = \nabla^2 f(x) v \iff v = \nabla^2 f^*(z)u$, and therefore at $t\to 0$,
$$
H(t)  = \nabla^2 f(x(t)) =\nabla^2 f(x+tv) = \nabla^2 f(x+  t\nabla^2 f^*(z) u)$$
and $$g'(0) = D^3 f^*(z)[u,u,u]  =D^3 f(x)[\nabla^2 f^*(z) u,\nabla^2 f^*(z) u,\nabla^2 f^*(z) u].$$
From the self-concordance of $f$, it follows that
$$
D^3 f(x)[\nabla^2 f^*(z) u,\nabla^2 f^*(z) u,\nabla^2 f^*(z) u] \leq 2M(u^T\nabla^2 f^*(z)\nabla^2 f(x)\nabla^2 f^*(z)u)^{3/2} =2M(u^T\nabla^2 f^*(z)u)^{3/2}
$$
Thus, $f$ is $M$-self-concordant if and only if $f^*$ is $M$-self-concordant, for convex $f$.

 

 

Rates

Using these properties, we can show that if a function $f$ is $M$ self concordant, then it has a local quadratic rate, and the number of non-local steps is bounded and usually small in practice. In summary, we use the Hessian-warped residual measure  $\lambda(x) = \|\nabla f(x)\|_{\nabla^2 f(x)^{-1}} $, and at each step we take $$x^{(t+1)} = x^{(t)}-\beta \nabla^2 f(x^{(t)})^{-1} \nabla f(x^{(t)}).$$

  • Global convergence:  when $\lambda(x) \geq \frac{1}{2M}$, then using $\beta = \frac{1}{1+M\lambda(x^{(t)})}$, we have a guaranteed decrease of $$f(x^{(t+1)}) \leq f(x^{(t)}) – \frac{C}{M^2},$$ for some computable constant $C$. This implies that in a finite, computable number of steps, we reach local convergence.
  • Local convergence: when $\lambda(x) \leq \frac{1}{2M}$, then using no damping ($\beta = 1$), we are guaranteed quadratic convergence; that is, $$f(x^{(t)})-f(x^*)=O(c^{2^t}).$$

Since the proofs of these results are both extensive, esoteric, and well-documented in many sources, I will forgo them.

Extensions

There are many interesting questions still to ask, and to be honest, a lot of people are already asking them. For example:

  • What about sketched Hessians? When there is error in computing the Hessian, can we still guarantee a local quadratic rate? The answer, according to [2], is unfortunately no; there can be superlinear convergence under special circumstances, but it is not obvious how to recover the quadratic local rate. Somehow this isn’t too surprising; even vanilla Newton’s method with $\beta \neq 1$ can’t achieve this rate easily.
  • What happens if $f$ is nonconvex? It turns out that, like $L$-smoothness, that self-concordance doesn’t necessarily imply convexity. While there is not a ton of work in this area, one notable one is [3], showing that the idea of at least self-concordant convex-concave saddle functions has been explored.

However, overall, it seems that though we like this property of “near-quadratic-ness”, maybe because it is rather involved to work with or simply because we don’t really do that much second-order optimization in large-scale settings, this property (compared to $L$-smoothness and $\mu$-strong-convexity) is relatively less explored.

References

  1. Lectures on Convex Optimization, by Yurii Nesterov (Section 5.1)

  2. “Newton Sketch: A Linear-time Optimization Algorithm with Linear-Quadratic Convergence,” by Mert Pilanci and Martin J. Wainwright.
  3. “On Self-Concordant Convex-Concave Functions”, by Arkadi Nemirovski.

Convergence proofs IV: My journey with automatic proof generators

Post by Yifan Sun

In comparison to past posts, this post is really about some (relatively) recent work by a bunch of people ([1],[2],[3] to name a few). I therefore tried to spend a lot less time detailing how these things work, as these authors have posted their own excellent blog posts and presentations; however, some of it was necessary to make the point I wanted to make later.  The goal here is to use PEP to understand the rates of the three first-order methods, VGD, PHB, and NAG.

tl;dr PEP is a construct to create automatic proof generators for methods by formulating the proof itself as the solution to a QCQP, whose semidefinite optimization relaxation turns out to be tight. It verifies the big-O complexity for the three methods (VGD, PHB, NAG), but can sometimes give tighter constants.  I give here some performance examples on VGD, PHB, NAG, and show how the dual variables can be used to pull out the worst-case problem for VGD (verifying that the guarantee is tight and attainable).

The Procedure

Imagine the convex minimization problem $$\min_{x\in \mathbb R^d}\; f(x)$$ where we run our methods for $T$ iterations. For all three first order methods, we can summarize them as  $$X = G A + x_1\mathbf 1^T$$ where $X = [x_1,x_2,…,x_{T}]\in \mathbb R^{d\times T}$,  $G = \frac{1}{L}[\nabla f(x_1), \nabla f(x_2), …, \nabla f(x_T)]$, and $A$ describes the method iteration scheme. For example, taking $T = 5$, the VGD rule can be summarized simply as,
$$
A_{VGD} = \left[\begin{matrix} 0 & -1 & -1 & -1 & -1 \\ 0 & 0 & -1 & -1 & -1 \\0 & 0 & 0 & -1 & -1 \\0 & 0 & 0 & 0 & -1 \\0 & 0 & 0 & 0 & 0\end{matrix}\right]$$
and
$$
A_{PHB} = \left[\begin{matrix} 0 & -1 & -1 – \beta & -1-\beta-\beta^2 & -1-\beta-\beta^2-\beta^3 \\ 0 & 0 & -1 & -1 – \beta & -1- \beta – \beta^2 \\0 & 0 & 0 & -1 & -1 – \beta \\0 & 0 & 0 & 0 & -1 \\0 & 0 & 0 & 0 & 0 \end{matrix}\right].$$
It’s a bit tricker to do this for $A_{NAG}$ at least symbolically, but numerically, we can arrive at
$$
A_{NAG} = \left[\begin{matrix} 0 & -1 &  -1 &  -1 &  -1\\
0 & 0 &  -1.2818  & -1.4040 & -1.4690\\
0 & 0 & 0 &  -1.4340 &  -1.6645\\
0 & 0 & 0 & 0 & -1.5311\\
0 & 0 & 0 & 0 &0 \end{matrix}\right].
$$

(I’m sure there is also a nice closed form expression for $A_{NAG}$, and I have a nagging feeling it isn’t pretty.) But the point is that they exist, and they are strictly upper triangular, e.g. the system $X = GA$ is explicit.

Now, our goal is to solve a primal (nonconvex, inifinite-dimesional) optimization problem abstractly defined as

$$
\begin{array}{ll}
\underset{f,x_i,x^*}{\mathrm{maximize}} & f(x_T)\\
\mathrm{subject~to} & X = GA\\
& f\in \mathcal C_L^{1,1}
\end{array}$$

where $\mathcal C_L^{1,1}$ are the class of $L$-smooth convex functions mapping $\mathbb R^d\to \mathbb R$. Unless there’s some secret fairy godmother out there granting magic wishes, this problem is not tractable.

So, what PEP proposes to do instead, is to take this difficult constraint $f\in \mathcal C_L^{1,1}$, and replace it with sufficient conditions; namely, they impose some known inequalities at only the “important” points. Specifically, in [1], the inequality of choice is $$ f(y) – f(z) \geq \nabla f(z)^T(y-z) + \frac{1}{2L}\|\nabla f(y)-\nabla f(z)\|_2^2. \qquad (\triangle)$$ Later, they show how to integrate a whole host of nice inequalities, which really generalizes the methodology, but since I am uncreative by nature, I decided to just use this one, especially since it summarizes $L$-smoothness and convexity all in one nifty package. The inequality $(\triangle)$ is then imposed on all $\tfrac{(T+1)T}{2}$ pairs  $(y,z)\in \{x_1,…,x_T,x^*\}^2$. Then, the abstract concepts $f(x)$ and $\nabla f(x)$ are replaced by scalar variables $\delta_i = \tfrac{1}{L}(f(x_i)-f^*)$ and $g_i = \tfrac{1}{L}(\nabla f(x_i))$ for $i = 1,…,T$. This problem, though still nonconvex, is now finite-dimensional. (The authors in [1],[2],[3] call problems satisfying these constraints $\mathcal C_L^{1,1}$-interpolatable.)

$$
\begin{array}{ll}
\underset{\delta_i,g_i,x_i,x^*}{\mathrm{maximize}} & \delta_T\\
\mathrm{subject~to} & X = GA\\
& \delta_i-\delta_j \geq  g_j^T(x_i-x_j) + \frac{1}{2L}\|g_i-g_j\|_2^2,\; i,j\in \{1,…,T,*\}.
\end{array}
\qquad\text{(P)}$$

The dualization is then done by defining some Lagrange variables

$$\begin{array}{rrcl}
u_t: & -\delta_t &\geq&     \frac{1}{2L}\|g_t\|_2^2\\
v_t: &-\delta_t &\geq& g_t^T(x_t-x^*) + \frac{1}{2L}\|g_t\|_2^2\\
W_{i,j}:&\delta_i-\delta_j &\geq& g_j^T(x_i-x_j) + \frac{1}{2L}\|g_i-g_j\|_2^2,\;i\neq j.
\end{array}$$

I’ll leave it up to the papers to give the full derivation, but there are two very cute steps that I think are nontrivial.

  • After minimizing out the $\delta_i$ terms, the Lagrangian appears like
    $$
    \mathcal L=\mathbf{tr}(GMG^T)-v^TG^T(x_1-x^*)
    $$
    which implies that the optimal (read: worst case) $G$ is rank-1, with row space determined by $x_1-x^*$; e.g. $G = \xi(x_1-x^*)^T$.
  • The problem is then transformed to an SDP with the nifty observation that
    $$
    \min_{\xi}\; \xi^TM\xi – v^T\xi = \eta \quad \iff \quad  \left[\begin{matrix}M & -v/2\\-v^T/2 & \eta \end{matrix}\right]\text{ is PSD}
    $$

This finally gives the (dual) semidefinite optimization problem

$$
\begin{array}{ll}
\underset{{\eta,u,v,W}}{\mathrm{minimize}} & \eta \cdot \frac{\|x_1-x^*\|_2^2}{L}\\
\mathrm{subject~to} & e_T = v-u+\sum_j\sum_{k\neq j} (W_{k,j}-W_{j,k})e_j\\
&\left[\begin{matrix}M & -v/2\\-v^T/2 & \eta \end{matrix}\right]\text{ is PSD}\\
& u\geq 0, v\geq 0, W\geq 0
\end{array}
\qquad\text{(D)}$$
where $M = \frac{1}{2}M_0M_0^T$ for
$$
M_0 = \sum_{t=1}^T \frac{u_t+v_t}{2} e_te_t^T – v_tAe_te_t^T + \sum_{i\neq j} W_{ij} A (e_i-e_j)^Te_j + \frac{1}{2}(e_ie_i+e_je_j-e_ie_j-e_je_i).
$$

Here $e_k = [\underbrace{0,…,0}_{k-1\text{ $0$s}},1,0,…,0]^T$ the standard basis.

The worst problem

I’ll give the procedure for finding the worst case problem for VGD, which at the very least gives some insight as to how simple these “worst” problems can be. I am still in the process of figuring out PHB and NAG, so I will leave those questions hanging.

  • First, note that dimensionality never came into play; none of the dual variables depend on $d$. Therefore, we can just assume $d =1$, e.g. all the variables $x_i$ are scalars.
  • Second, recall that the optimal $G$ is rank 1. That means that in the worst case problem, we can assume that all the gradients $\nabla f(x_i) = c_i(x_1-x^*)$, where $c_i$ is just a scalar quantity! This further reduces the search space.
  • Finally (and this comes from the optimization problem) In the VGD case, the dual variables can be observed to be optimized at $u = 0$ and $v_i = 0$ for all $i < T$, and $v_T = 1$. The $W$ terms are trickier; numerically, I did not observe much sparsity in $W_{ij}$ for off-diagonal values ($W_{ii} = 0$ because there are no constraints for $i =j$ encoded.) However, for VGD, if you explicitly set $W_{ij} = 0$ for all $|i-j|\geq 2$, you will not see any difference in the objective value achieved! Therefore, you know that the proof over a worst case problem can exist involving only inequalities
    \begin{eqnarray*}
    f(x_t)-f(x_{t+1}) &\geq& \nabla f(x_t)^T(x_t-x_{t+1}) + \frac{1}{2L}\|\nabla f(x_t)-\nabla f(x_{t+1})\|_2^2, \; \quad t = 1,…,T\\
    f(x_{t+1})-f(x_{t}) &\geq& \nabla f(x_{t+1})^T(x_{t+1}-x_{t}) + \frac{1}{2L}\|\nabla f(x_t)-\nabla f(x_{t+1})\|_2^2, \; \quad t = 1,…,T\\
    f(x_T)-f^* &\geq& \nabla f(x_T)^T(x_T-x^*) + \frac{1}{2L}\|\nabla f(x_T)\|_2^2
    \end{eqnarray*}
    and moreover, because of complementary slackness, these inequalities can all be set to equality in the worst case problem!

Putting these altogether, I have the following 2 inequalities:

\begin{eqnarray*}
f(x_{t+1})-f(x_t) &=& \frac{1}{2L}(c_t^2-2c_tc_{t+1}+c_{t+1}^2)\\
f(x_t)-f(x_{t+1}) + \frac{1}{L} c_tc_{t+1} &=& \frac{1}{2L}(c_t^2-2c_tc_{t+1}+c_{t+1}^2)
\end{eqnarray*}

which implies that

$$c_t^2+c_tc_{t+1}=0 \iff c_t = 0 \text{ or } c_t=c_{t+1}$$

To avoid trivial functions, this tells me that, for all my discretization points (not including $x^*)$, I have $$\nabla f(x_t) = c_t(x_t-x^*) = c(x_t-x^*)$$. Without loss of generality I can pick $x^* = 0$ and it is clear that I have a function that, for all the points sampled, is linear with slope $c$.

Note that simply ending the story with $f(x) = cx$ is not sufficient, as this function has $L\to +\infty$, and additionally $x^* \neq 0$.  The rest of this part took a bit of trial and error, but if you pick this Huber function
$$ f(x) = \begin{cases} \frac{x^2}{2} & x\leq \alpha \\ \alpha x – \frac{\alpha^2}{2} & x \geq \alpha\end{cases}$$
and in particular set $\alpha = x_T/T$, you can control the slope and force $L=1$ at the same time, achieving the exact worst case bound predicted by PEP. Therefore, this bound is tight.

With that, I leave you with a cute photo of my new houseguest. #ratsarethebest

Thanks to Adrien who introduced me to this tool and kept nudging me to look at it more closely

References

[1] Drori, Y, and Teboulle, M. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming  (2014).

[2] Taylor, A.B., Hendrickx, J.M. and Glineur, F. Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In  Annual Conference on Decision and Control (CDC) (2017)

[3] Drori, Y. and Taylor, A.B., Efficient first-order methods for convex minimization: a constructive approach. Mathematical Programming (2020)

[4] Su, W., Boyd, S. and Candes, E.J. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. JMLR (2016).

Newton’s method I: Quadratic convergence rate

The next couple posts will focus on our favorite second order method: Newton’s method. I’ve been going through them, partly as “review” (in quotes because I ended up learning a lot of new things) and partly to develop some intuition as to when acceleration-by-leveraging-second-ordery-info might actually help. This first post will be super short, and focus on analysis of Newton’s method under two strong-convexity-like conditions.

Consider the unconstrained minimization of $f(x)$ where $f$ is everywhere twice differentiable. Then Newton’s method iterates as $$x^{(t+1)} = x^{(t)} – (\nabla^2 f(x^{(t)}))^{-1}\nabla f(x^{(t)}).$$ Unlike gradient descent, there is no “step size”; the curvature information is already incorporated in the Hessian inverse. Also, as a responsible optimist, I must point out that in practice we never take the Hessian inverse, but instead compute the matrix-inverse-vector-product using some iterative method, like conjugate gradient.

Suppose the Hessian is $M$-smooth $$\|\nabla^2 f(x)-\nabla^2 f(y)\|_2 \leq M\|x-y\|_2.\quad (\star)$$ We all learn Newton’s method as having a locally quadratic convergence rate. The usual question is then: what is the global convergence rate? Well, actually, if you stare and squint at the proofs given in Nesterov’s works, you see the answer is actually right in front of you.

  • If the function is convex but not strongly convex, then Newton’s method may not work at all. In fact, for any point $x^{(t)}$, if $\nabla^2 f(x^{(t)})$ has 0 eigenvalues, you can either take a damped Newton step (via line search) or switch to gradient steps. So, unsurprisingly, the rate here should be $O(1/t)$, the same as in gradient descent.
  • If the function is $\mu$-strongly convex and $\mu > 0$, then Newton’s method has a locally quadratic convergence rate, and the proof is like 3 lines.
  • If the function is $\mu$-strongly convex at the optimum, then we still cannot say anything globally, but we can say that the method has local quadratic convergence. The proof is a bit longer, and the point at which quadratic convergence kicks in could be much slower, though.

In other words, it’s an all-or-nothing game. Either I get quadratic convergence, or I am better off switching to gradient descent (at least until we’re close enough to the quadratic regime).

The proof for this convergence is actually really cute, once you know all the pieces. Use the integral form for $\nabla f(x^{(t)})$: $$\nabla f(x^{(t)}) -\nabla f(x^*)= \int_0^1 \nabla^2 f(x^* – \tau (x^{(t)}-x^*) ) (x^{(t)}-x^*) d\tau. $$ Then, since $\nabla f(x^*) = 0$,

\begin{eqnarray*}
x^{(t+1)}-x^* &=& x^{(t)} -( \nabla^2 f(x^{(t)}))^{-1} \nabla f(x^{(t)})\\
&=& x^{(t)} – ( \nabla^2 f(x^{(t)}))^{-1} \left( \int_0^1 \nabla^2 f(x^* – \tau (x^{(t)}-x^*) ) (x^{(t)}-x^*) d\tau\right)\\
&=&( \nabla^2 f(x^{(t)}))^{-1}\left(\int_0^1\nabla^2 f(x^{(t)}) – \nabla^2 f(x^* – \tau (x^{(t)}-x^*) d\tau \right)(x^{(t)}-x^*).
\end{eqnarray*}

Taking norms,

\begin{eqnarray*}
\|x^{(t+1)}-x^*\|_2 &\leq & \| \nabla^2 f(x^{(t)}))^{-1}\|_2\left(\int_0^1 M\|x^*-\tau (x^{(t)}-x^*)-x^*\|_2d\tau\right)\|x^{(t)}-x^*\|_2 \\
&=& \| \nabla^2 f(x^{(t)}))^{-1}\|_2\;\underbrace{(\int_0^1 \tau d\tau)}_{=1/2}\; M \|x^{(t)}-x^*\|_2^2 \\
&=& \frac{M}{2}\| \nabla^2 f(x^{(t)}))^{-1}\|_2 \|x^{(t)}-x^*\|_2^2.
\end{eqnarray*}

Now the question is, what is an upper bound on $\| \nabla^2 f(x^{(t)}))^{-1}\|_2$? Well, if $f$ is $\mu$-strongly convex everywhere, then $\| \nabla^2 f(x^{(t)}))^{-1}\|_2 \leq 1/\mu$ everywhere, and we can say globally that

$$ \|x^{(t+1)}-x^*\|_2 \leq  \frac{M}{2\mu}  \|x^{(t)}-x^*\|_2^2. $$

Now, if $\| \|x^{(t)}-x^*\|_2 < \frac{2\mu}{M}$, this implies contraction, and henceforth, for $t’ > t$, there will be quadratic convergence. If that hasn’t yet occurred, a simple line search add-on should make sure that a damped Newton’s step makes progress, globally.

If, on the other hand, we loosen our requirements to simply that $\lambda_{\min}(\nabla^2 f(x^*)) \geq \mu$, then by ($\star$), $\lambda_{\min}(\nabla^2 f(x^{(t)}) \geq \mu – M\|x^{(t)}-x^*\|_2$, and

$$ \|x^{(t+1)}-x^*\|_2 \leq  \frac{M}{2(\mu-M\|x^{(t)}-x^*\|_2^2)}  \|x^{(t)}-x^*\|_2^2\leq \frac{M}{2\mu(1-\theta))}  \|x^{(t)}-x^*\|_2^2 $$

if $\|x^{(t)}-x^*\|_2 \leq \frac{\mu\theta}{M}$ for some $\theta < 1$. Nesterov picks $\theta = 2/3$, but anything will do.

I don’t think Nesterov himself was very satisfied with this rate; in particular, the second case where we’re adding a new parameter we can’t measure ($\theta$) on top of the $L$-continuous Hessian and $\mu$-strongly convex, it is hard to list exactly which functions are satisfying these conditions. It seems like this was the motivation behind the next two topics: Newton with cubic regularization for generalized functions, and Newton’s method for self-concordant functions.

Reference

[1] Nesterov, Y. Lectures on Convex Optimization. 2018, Chapter 1.2.

 

Edit July 12 2002: Thanks Mark Schmidt for pointing out that there is a serious flaw in my statement that Newton’s method can converge quadratically, globally. We can guarantee globally that $\|x^{(t+1)}-x^*\|_2 \leq  \frac{M}{2\mu}\| \nabla^2 f(x^{(t)}))^{-1}\|_2 \|x^{(t)}-x^*\|_2^2$, but unless $\|x^{(t)}-x^*\|_2 < \frac{2\mu}{M}$, this does not imply contraction. In other words, even if $f$ has an $L$-continuous Hessian and is $\mu$-strongly convex, we still need to use line search and damped Newton in the initial phases until the locally quadratic convergence rate kicks in. tl;dr, no global quadratic convergence (sorry!). This has been amended in the post.

Convergence proofs III: Continuous time interpretation

Post by Yifan Sun

I have recently been caught up in the flurry of analyzing popular convex optimization problems as discretizations of continuous ODEs, with the objective of simplifying convergence analysis and constructing better methods. This “flurry” is by no means recent, with a somewhat “but we always knew this” vibe, and is by no means restricted to the couple papers cited in this post. However, since I’m spending these couple posts focused on the three methods VGD, PHB, and NAG, the string of papers I cited are some that directly address continuous ODE interpretations of these methods. While I don’t think I can ever fully untangle the relationships of these three methods to each other, continuous ODEs provide an interesting interpretation, and interestingly enough, also shed light on how to construct good estimate sequences.

tl;dr: continuous time proofs are way easier than their discrete versions, but the “rankings” are somewhat preserved: VGD + strong convexity is easiest, followed by VGD + convex, with a significant jump in difficulty for the two accelerated methods. The flavor of the proofs are also not that different than the discretized proofs, where telescoping $\to$ integrating, estimate sequence functions $\to$ energy functions, etc. Still, the results are somewhat surprising; namely, there is no significant difference between PHD and NAG iteration scheme; however, each choice of optimal parameters $\alpha$ and $\beta$ does make a significant difference.

The big idea

The most relevant continuous ODE in convex optimization is gradient flow $$\dot z(t) = -\nabla f(z(t)).\qquad \text{(Gradient Flow)}$$ where many have made the observation that the forward Euler discretization scheme results in the VGD method $$x_{k+1} \approx x_k + h \dot x_k = x_k – h\nabla f(x_k). \qquad \text{(VGD)}$$. There are two advantages of analyzing this ODE over VGD itself:

  1. The analysis of (ODE) has very little gradient smoothness requirements. In fact, as long as the gradient exists and is continuous everywhere, we are golden; no need for $L$-smoothness with any specified $L$. This is because problems caused by bad $L$ estimation translates into discretization errors, which can be swept under the carpet of discretization convergence rates, which is well known in numerical analysis.
  2. The analysis is “easier to generalize”. Often, the proofs of convergence for standard methods can be long and involved, and somewhat unintuitive to a newcomer. In contrast, the proofs for the ODE, even if they are involved, somewhat correlate with standard differential equations results. Additionally, if the methods are relatable to physical systems (we see words like “momentum” and “friction” pop up) then this makes the methods seem a bit less opaque.

However, nothing comes for free. In my view, the primary disadvantage of continuous time analysis is that there are different ways to discretize, and different discretizations seem to lead to different results.

Throughout this post, I will use $z(t)$ as a continuous trajectory of a vector in $\mathbb R^n$. The discretization samples will be represented as $x_k = z(hk)$, e.g. uniformly sampled at time intervals $h$ apart.

Vanilla gradient descent

Strongly convex case

In the case that the objective function $f$ is $\mu$-strongly convex, we use the PL-inequality version which states that  $$\frac{1}{2}\|\nabla f(x)\|_2^2 \geq \mu(f(x)-f^*), \quad \forall x.$$ Then, describing a Lyapunov function (think energy function) as $$\mathcal E(t) = f(z)-f^*,$$ we have a first-order ODE: $$\dot{\mathcal E}(t) = \nabla f(z)^T\dot z = -\|\nabla f(z)\|_2^2 \overset{PL}{\leq} 2\mu (f(x)-f^*) = 2\mu \mathcal E(t)$$  with standard solutions $$\mathcal E(t) = \mathcal E_0 e^{-2\mu t}.$$ This immediately gives linear convergence with radius $e^{-2\mu}$.

Convex not strongly case

Although the standard convergence proof is often harder when we drop the strong convexity requirement, the rate for gradient norm convergence in the continuous case can be derived with a very cute trick. We simply have the observation that, using the same Lyapunov function, $$\mathcal E(t) -\mathcal E(0)= \int_0^t \dot{\mathcal E}(t) dt = -\int_0^t \|\nabla f(z(t))\|_2^2  dt.$$ Since $\mathcal E(t) > 0$ for all $t$, we can rearrange to write $$\frac{1}{t} \int_0^t \|\nabla f(z(t))\|_2^2 \leq \frac{\mathcal E(0)}{t}$$ or, in other words, the average squared gradient norm decays as $O(1/t)$. We can also see that the quantity $\|\nabla f(z(t))\|_2^2$ is always nonincreasing as long as $f$ is convex, since $$\frac{\partial}{\partial t} \|\nabla f(z)\|_2^2 = 2\nabla f(z)^T\nabla^2 f(z) \dot z = -2\nabla f(z)^T\nabla^2 f(z) \nabla f(z) \leq 0.$$ Therefore we can strengthen our statement to simply $\|\nabla f(z(t))\|_2^2 \leq \frac{\mathcal E(0)}{t}$. (Of course, one may argue that a telescoping argument in the discretized case does not require that many more lines, but it does require first deriving the descent lemma, which we don’t need to do here. Translation, I didn’t need to care about $L$-smoothness.)

Accelerated methods

After playing with the equations for (more than a short) while, I have come to the conclusion that there is good reason as to why extending this analysis to accelerated methods results in lengthy publications. It is not only a bit nontrivial as to how to do the analysis, but even the right discretization scheme seems not that obvious. The two accelerated methods can be written in one line as

\begin{align*}x_{k+1} &= x_k – \alpha\nabla f(x_k) + \beta(x_k-x_{k-1}) & \text{(PHB)}\\x_{k+1} &= x_k – \alpha\nabla f(x_k) + \beta(x_k-x_{k-1}) – \beta \alpha (\nabla f(x_k)-\nabla f(x_{k-1}) & \text{(NAG)}\\\end{align*}

where the optimal choices for $\alpha$ and $\beta$ are author (method) -dependent

  • Polyak’s: $ \alpha_P = \frac{4}{(\sqrt{L}-\sqrt{\mu})^2}$, $\beta_P = \frac{(\sqrt{L}-\sqrt{\mu})^2}{(\sqrt{L}+\sqrt{\mu})^2}. $
  • Nesterov’s: $ \alpha_N = \frac{1}{L}$,  $\beta_N = \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}.$

We make a combination of forward and backward Euler substitutions: $$\ddot z(t) = \frac{x_{k+1}-2x_k+x_{k-1}}{h^2}, \quad \dot{z(t)} = \frac{x_{k}-x_{k-1}}{h}, \quad \dot{\nabla f}(z(t)) = \frac{\nabla f(x_{k})-\nabla f(x_{k-1})}{h}$$

and the corresponding ODEs, parametrized by the discretization $h$, is

\begin{align*}\ddot zh^2 &+ \alpha  \nabla f(z) + (1-\beta) h \dot z = 0 & \text{(PHB)}\\\ddot z h^2  & +  \alpha  \nabla f(z) + (1-\beta) h \dot z + \beta \alpha  h \dot{\nabla f}(z) = 0 & \text{(NAG)}\\\end{align*}

where $z = z(t)$ is time-varying and we fix $\alpha$ and $\beta$ to be constant. (This is not the case when $f$ is not strongly convex, which is discussed in [3,4] but I’m ignoring here.) In other words, the key difference between (NAG) and (PHB), in the continuous time setting, is the extra $\tfrac{\partial}{\partial t} \nabla f(z)$ term. Intuitively, this makes sense; chain ruling out $\tfrac{\partial}{\partial t}\nabla f(z) = \nabla^2 f(z) \dot z$  shows the participation of a second order “dragging” term in a first order method.

The choice of parameters $\alpha$, $\beta$, and discretization $h$ all matter.  At first glance, I assumed that $h = \alpha = O(1/L)$ was the most logical choice, since it is the choice used in gradient flow. However, using that, we immediately see that taking the limit of $h\to 0$ then the 2nd order term $\ddot z$ must disappear, which loses any interesting 2nd order behavior. Instead, we should  follow the procedure of the authors in [3] and pick $h = 1/\sqrt{L}$. We now consider our two choices of optimal $\alpha$ and $\beta$. The two choices are different, but interestingly, at the limit of $L\to +\infty$, both choices result in $\alpha = O(1/L)$, $\beta = O(1)$, $1-\beta = O(1/\sqrt{L})$. To be more specific and accounting for constants, in this limit,

\begin{align*}\ddot z(t) &+ O(1) \nabla f(z(t)) + O(1) \dot z(t)  = 0 & \text{(PHB)}\\\ddot z(t)   & +  O(1) \nabla f(z(t)) + O(1) \dot z(t) + O(h) \dot{\nabla f}(z(t))= 0 & \text{(NAG)}\\\end{align*}

and in the limit of $h\to 0$, the two methods are identical, with the choice of best $\alpha$, $\beta$ changing some constants in the first-order terms. This is the observation given in several papers, including [3,4], and acknowledged in the followup [5], which studies more precisely what happens when we bring back the $h\dot{\nabla f}(z(t))$ term. Since the continuous time limit only considers $h\to 0$, we must interpret this as a discretization effect, and this simple example highlights the importance of studying such effects. (Without it, continuous time PHB and NAG, are identical up to choice of $\alpha$, $\beta$!)

But, I am lazy, it is Friday, and thus I am going to sweep this under the rug and ignore it. I will simply consider a simplified ODE as

$$\ddot z    + a\nabla f(z) +b\dot z  = 0 \qquad (\star)$$

and if $\alpha = \alpha_P$ and $\beta = \beta_P$ then

$$a = \frac{4}{(1-h\sqrt{\mu})^2} \overset{h\to 0}{\to} 4, \quad b = \frac{4}{(1+h\sqrt{\mu})^2}\overset{h\to 0}{\to} 4;$$

if $\alpha = \alpha_N$ and $\beta = \beta_N$ then

$$a = 1, \quad b = \frac{2\sqrt{\mu}}{1+h\sqrt{\mu}} \overset{h\to 0}{\to} 2\sqrt{\mu}.$$

I am now basing the derivations on what is presented in [4], which derives the $O(1/t^2)$ rate for continuous time PHB and NAG (along with many interesting generalizations). However, I am somewhat more interested in the $\mu$-strongly convex case, so I will slightly modify the Lyapanov function, to

$$\mathcal E(t) =\frac{1}{2} \|A(z-x^*)+B\dot z\|_2^2 + \frac{C}{2}\|z-x^*\|_2^2 + D (f(z)- f^*) $$

where $A$, $B$, $C$, and $D$ are nonnegative time-varying scalar parameters.  My basis for the extension of $\mathcal E$ is that it seems to better imitate the mixing behavior of quadratic estimate sequences when $\mu > 0$. (To show a $O(1/t^2)$ rate for $\mu = 0$, take $A = B = 1$, $C = 0$.) Now, the goal is to show that $\dot {\mathcal E} \leq 0$ to show it is nonincreasing, and use this to conclude

$$D(t) (f(z)-f^*) \leq \mathcal E(t) \leq \mathcal E(0) \; \Rightarrow \; f(z)-f^* \leq \frac{\mathcal E(0)}{D(t)}$$

which isolates the rate of decay of $f(z(t))-f^*$ to the decay rate of $1/D(t)$. (Note the extreme similarity to the technique in the previous post!)

Just as the difficult step in the previous post was showing that $\phi^*_t > f(x^{(t)})$ at each step, now the difficult step is showing that $\dot {\mathcal E}(t) < 0$. The effect is somewhat analogous, which shows that in math, as in life, there are no real shortcuts. We write

\begin{eqnarray*}\dot{\mathcal E}(t) &=& (A(z-x^*)+B\dot z)^T(\dot A(z-x^*) + A\dot z +\dot B\dot z + B \ddot z) + \frac{\dot C}{2}\|z-x^*\|_2^2 \\&&\qquad + C(z-x^*)^T\dot z + \dot D (f(z)- f^*)  + D \nabla f(z)^T\dot z\end{eqnarray*}

Inserting in $(\star)$

\begin{eqnarray*}\dot{\mathcal E} &=& \left(\dot A\left( B + A(A+\dot B-bB)\right)+C\right)\dot z^T (z-x^*)    + (B\dot z)^T( (A+\dot B – bB)\dot z ) – ABa(z-x^*)^T\nabla f(z) \\&&\qquad + \dot D (f(z)- f^*)  + \left(D + aB^2\right) \nabla f(z)^T\dot z    + \left(A\dot A + \frac{\dot C}{2}\right)\|z-x^*\|_2^2\\
&=&   -ABa(z-x^*)^T\nabla f(z) + \dot D (f(z)- f^*)     + \left(A\dot A + \frac{\dot C}{2}\right)\|z-x^*\|_2^2\\
\end{eqnarray*}

where in the second line we force $A+\dot B – bB = 0$, $\dot A B + C = 0$, $D+aB^2=0$. Adding in $\mu$-strong convexity of $f$,

\begin{eqnarray*}
\dot{\mathcal E} &\leq&   (\dot D-ABa)\underbrace{(z-x^*)^T\nabla f(z)}_{\geq 0}  + \left(A\dot A + \frac{\dot C}{2} – \frac{\dot D \mu}{2}\right)\|z-x^*\|_2^2\\
\end{eqnarray*}

and thus $\mathcal E$ is monotonically nonincreasing whenever $\dot D \leq ABa$,  $A\dot A + \frac{\dot C}{2} \leq  \frac{\dot D \mu}{2}$. Since ideally we want $1/D$ to decay to show a convergence rate, we additionally require $0 < \dot D.$

Now, rather than solving some complicated (but single variable) differential inequalities, I will instead insert the guess that $D \propto e^{c_D t}$, since anyway I know I am after a linear convergence rate. In order for this to happen, it is most likely that all the scalar players go by exponential rates, e.g. $A \propto e^{c_A t}$, $B \propto e^{c_Bt}$, etc. Let’s start with  $B = B_0e^{c_B t}.$ Then

\begin{eqnarray*}D &=& -aB^2 = -aB_0e^{2c_B t}, \\ A &=& bB-\dot B = B_0(b-c_B)e^{c_B t}, \\ C &=& -\dot A B = -B_0^2(b-c_B)c_Be^{2c_B t}.\end{eqnarray*}

Inserting these into our constraints on $\dot D$,

$$ABa =aB_0^2 (b-c_B)e^{2c_B t} \geq \dot D \geq \max\{0,\frac{2A\dot A + \dot C}{\mu} \}\geq 0. $$

Since we would like $D$ to increase as quickly as possible, we pick the upper end, and get

$$\dot D = B_0^2a (b-c_B)e^{2c_B t} \Rightarrow D = D_0 e^{2c_B t}$$

for some $D_0 > 0$ (with the stipulation that $b> c_B \geq 0$).

Finally, the question is what is the radius of convergence? Well, we want $c_B$ to be as big as possible, but it must be strictly smaller than $b$. So, ultimately, we should observe a rate of convergence upper bounded by $O(e^{-2bt})$. Interestingly, while Polyak’s parameter choice leads to $b = 4$ and isn’t too illuminating, Nesterov’s parameter choice gives $b = 2\sqrt{\mu}$ and seems to exactly recover the expected (accelerated) $O(e^{-\sqrt{\mu}})$ radius of convergence.

Final thoughts

The idea that both Polyak’s method and Nesterov’s method can be analyzed using the same continuous time ODE, with acceleration seemingly dependent only on the choice of parameters $\alpha$ and $\beta$, is somewhat unsettling, and is addressed in further works, such as [5]. (While I have only shown this in the strongly convex case, in [4] it is also shown in the convex-not-strongly case.) Note also that the PHB assumption for Hessian continuity is also not required; in general, smoothness needs are thrown away in the continuous-time regime, and seem to be isolated in the discretization stability part.

Finally (and this was a main theme of all 3 posts) although we have pulled out many differences between PHB and NAG (assumptions, provable rates, proof techniques) they still have not illuminated much on the actual method behavior, namely that NAG accelerates more but PHB is less sensitive to parameters. Thus, the search continues…

Thanks to illuminating discussions with my grad students Zhaoyue Chen and Mokhwa Lee, whose excellent paper presentations clarified many of the sticky derivations.

References

  • [1] Polyak, B. T. (1987). Introduction to optimization
  • [2] Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course.
  • [3] Weijie Su, Stephen Boyd, and Emmanuel Candes. “A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights.”
  • [4] Andre Wibisono, Ashia Wilson, Michael I. Jordan. (2016). “A variational perspective on accelerated methods in optimization.”
  • [5] Bin Shi, Simon S. Du, Michael I. Jordan, and Weijie J. Su. “Understanding the acceleration phenomenon via high-resolution differential equations.”

 

 

 

Convergence proofs II: (simple) estimate sequences

Post by Yifan Sun

tl;dr: Estimate sequences are a very interesting way of isolating the problem of finding rates over a wide family of unconstrained problems, and can also be used to generate accelerated methods. However it’s unclear whether they can easily be used on non-NAG-like methods, at least when restricted to quadratic functions.

The big idea behind estimate sequences

Suppose I have some sequence of functions $\phi_{t}(u)$. These functions represent some bounding function on some aspect that I wish to go to $0$. Additionally, I have some mixing parameters $\lambda_t\in (0,1)$, $\lambda_t \to 0$. Then, if

  • (U) $f(x^{(t)}) \leq \min_u \; \phi_t(u)$ for all $t$
  • (L) $\phi_t(x) \leq (1-\lambda_t) f(x) + \lambda_t \phi_0(x)$ for some sequence $\lambda_t\in (0,1)$,

then $$ f(x^{(t)}) \leq \min_u \; \phi_t(u) \leq \phi_t(x^*) \leq (1-\lambda_t) f(x^*) + \lambda_t \phi_0(x^*)$$ which implies $$f(x^{(t)})-f(x^*) \leq \lambda_t (\phi_0(x^*)-f(x^*)).$$

In other words, the convergence rate of $f(x^{(t)})$ is upper bounded by that of $\lambda_t.$ In essence, the idea behind estimate sequences is to push the job of computing rates to something agnostic of function characteristics–in this case, the decaying sequence $\lambda_t$. In theory, this gives us a huge benefit in terms of generality; if I can prove that a family of functions $\phi_t$ (called an estimate sequence, or a potential function) satisfy my upper and lower bounding conditions, then all I have to do is to recompute the $\lambda_t$s for each new method, under each new assumption, and get my rate.

Quadratic estimate sequences

There are many possibilities for estimate sequences, but the simplest are quadratics, e.g. $$\phi_t(u) = \phi_t^* + \frac{\gamma_t}{2}\|u-v^{(t)}\|_2^2.$$

Clearly, $\phi_t^* = \min_u\; \phi_t(u)$, with minimizer $u =v^{(t)}$. To be more precise, Nesterov [1] defines these functions recursively: $$\phi_{t+1}(u) := (1-\omega_t) \phi_t(u) + \omega_t \left( f(y^{(t)}) + \nabla f(y^{(t)})^T (u-y^{(t)}) + \frac{\mu}{2}\|u-y^{(t)}\|_2^2\right)$$ where $\omega_t\in (0,1)$ may be user-designed and $\mu$ is the strong convexity parameter of $f$. Then, using induction, the minimizer over $$\phi_{t+1}(u) = (1-\omega_t) \underbrace{\left(\phi_t^* + \frac{\gamma_t}{2}\|u-v^{(t)}\|_2^2\right)}_{\phi_t(u)} + \omega_t \left( f(y^{(t)}) + \nabla f(y^{(t)})^T (u-y^{(t)}) + \frac{\mu}{2}\|u-y^{(t)}\|_2^2\right)$$ can be found by setting the gradient w.r.t. $u$ to $0,$ arriving at $$((1-\omega_t)\gamma_t + \omega_t \mu) v^{(t+1)} = (1-\omega_t)\gamma_t v^{(t)} + \omega_t \mu y^{(t)}-\omega_t\nabla f(y^{(t)}). \qquad (\otimes). $$ Noting that $$\|u-z\|_2^2 – \|v^{(t+1)}-z\|_2^2 = \|u-v^{(t+1)}\|_2^2 + 2(u-v^{(t+1)})^T(v^{(t+1)}-z)$$ we see that $$\phi_{t+1}(u) – \phi_{t+1}^* = \frac{(1-\omega_t) \gamma_t + \omega_t \mu}{2} \|u-v^{(t+1)}\|_2^2$$ which gives the sequence $\gamma_{t+1} = (1-\omega_t) \gamma_t + \omega_t \mu$.

Verify lower bound property

In fact, the lower bounding condition (L) immediately pops out whenever $\phi_{t+1}$ is mixed with $\phi_t$ and a function that lower bounds $f(u)$ $$\phi_{t+1}(u) := (1-\omega_t) \phi_t(u) + \omega_t f_L(u), \qquad f_L(u) \leq f(u) \; \forall u$$ and the sequence $$\lambda_0 = 1, \qquad \lambda_{t+1} = (1-\omega_t) \lambda_t.$$ In this case, we use $$ f_L(u) = f(y^{(t)}) + \nabla f(y^{(t)})^T(u-y^{(t)}) + \frac{\mu}{2}\|u-y^{(t)}\|_2^2,$$ but you can imagine inserting other lower bounds here instead.  The bulk of the “work” in applying these sequences to methods, then, rests in verifying the upper bonuding condition (U).

Vanilla gradient descent

The rule of thumb for picking $y^{(t)}$ is that it is the thing of which you take your gradient in the method. In vanilla gradient descent, there is only one variable, so $y^{(t)} = x^{(t)}$.

Going back to $(\otimes)$, and noting that $x^{(0)} = v^{(0)}$, we see that for $t = 1$, $$\gamma_{1} v^{(1)} = \gamma_1 x^{(0)} – \omega_0 \nabla f(x^{(0)}) $$ which, picking $\omega_0 = \alpha \gamma_1$, assigns $v^{(1)} = x^{(1)}$. In fact, more generally, assigning $$\omega_t = \alpha \gamma_{t+1} \Rightarrow v^{(t+1)} = x^{(t+1)}. $$ Thus, the recursion for the estimate sequences boils down to $$\phi_{t+1}(x^{(t+1)}) = (1-\omega_t) \phi_t^* + \gamma_{t+1}\left(f(x^{(t)}) + \alpha \nabla f(x^{(t)})^T(x^{(t+1)}-x^{(t)}) + \frac{1}{2}\|x^{(t+1)}-x^{(t)}\|_2^2\right). \qquad (\star)$$

Assume that for some $t$, $f(x^{(t)}) \leq \phi_t^*$. Then $(\star)$ reduces to the inequality

\begin{eqnarray*}\phi_{t+1}(x^{(t+1)}) &\geq & f(x^{(t)}) + \gamma_{t+1}\left( \alpha \nabla f(x^{(t)})^T(x^{(t+1)}-x^{(t)}) + \frac{1}{2}\|x^{(t+1)}-x^{(t)}\|_2^2\right)\\&= &f(x^{(t)}) – \frac{\gamma_{t+1}\alpha^2}{2}\| \nabla f(x^{(t)})\|_2^2\end{eqnarray*}

and as long as $$\frac{\gamma_{t+1} \alpha^2}{2} \leq \alpha \left(1-\frac{L\alpha}{2}\right) \iff \alpha \geq \frac{2-\omega_t}{L},$$ we can invoke the descent lemma concludes $$f(x^{(t+1)}) \leq f(x^{(t)}) – \alpha\left(1-\frac{L\alpha}{2}\right)\|\nabla f(x^{(t)})\|_2^2 $$ and we have shown the inductive step $f(x^{(t+1)}) \leq \phi_{t+1}(x^{(t+1)})$. Therefore the upper bounding property is satisfied, and thus the convergence of $f(x^{(t)}) \to f^*$ in VGD depends only on the rate of decay of $\lambda_t$. The actual derivation of the rate is messy, so I moved it to the Appendix.

Nesterov’s accelerated method

The way we get to Nesterov’s accelerated method is to provide some generalizations in this scheme. First, we will not fix $v^{(t)}$ nor $y^{(t)}$, and just allow them to float. Then, Lemma 2.2.3 in [1] directly gives the following

\begin{eqnarray*}\gamma_{t+1} &=& (1-\omega_t) \gamma_t + \omega_t \mu\\ v^{(t+1)} &=& \frac{1}{\gamma_{t+1}} ((1-\omega_t) \gamma_t v_t + \omega_t \mu y^{(t)} – \omega_t \nabla f(y^{(t)}))\\ \phi^*_{t+1} &=& (1-\omega_t) \phi^*_t + \omega_t f(y^{(t)}) – \frac{\omega_t^2}{2\gamma_{t+1}}\|\nabla f(y^{(t)})\|_2^2 \\ &&\qquad + \frac{\omega_t(1-\omega_t)\gamma_t}{\gamma_{t+1}}\left(\frac{\mu}{2}\|y^{(t)}-v^{(t)}\|_2^2 + \nabla f(y^{(t)})^T(v^{(t)}-y^{(t)})\right). \end{eqnarray*}

It can be shown that this is the unique and specific definitions of $\gamma_t$, $v^{(t)}$, and $\phi_t^*$ which preserves the “canonical form” $$\phi_t(u) =\phi_t^* + \frac{\gamma_t}{2}\|u-v^{(t)}\|_2^2.$$

Nesterov’s method pops out by enforcing the  upper bounding property. Using again a recursive argument, we assume that $\phi^*_t\geq f(x^{(t)})$. We can use convexity to “transfer” the $f(x^{(t)})$ to $f(y^{(t)})$: $$ f(x^{(t)}) \geq f(y^{(t)}) + \nabla f(y^{(t)})^T(x^{(t)}-y^{(t)}) $$ and after simplifying, we have

\begin{eqnarray*} \phi^*_{t+1}&\geq & f(y^{(t)}) + \underbrace{(1-\omega_t) \nabla f(y^{(t)})^T(x^{(t)}-y^{(t)}+\frac{\omega_t\gamma_t}{\gamma_{t+1}}(v^{(t)}-y^{(t)}))}_{A} \\ &&\qquad – \frac{\omega_t^2}{2\gamma_{t+1}}\|\nabla f(y^{(t)})\|_2^2 + \underbrace{\frac{\omega_t(1-\omega_t)\gamma_t}{\gamma_{t+1}}\left(\frac{\mu}{2}\|y^{(t)}-v^{(t)}\|_2^2 \right)}_{=B}. \end{eqnarray*}

Term $B$ can be dropped. Nesterov’s method then comes after forcing $A = 0$, e.g. defining  $$x^{(t)}-y^{(t)}+\frac{\omega_t\gamma_t}{\gamma_{t+1}}(v^{(t)}-y^{(t)}) = 0. $$ It is left to force $$f(x^{(t+1)}) \leq f(y^{(t)}) – \frac{\omega_t^2}{2\gamma_{t+1}}\|\nabla f(y^{(t)})\|_2^2$$ which can be done by assigning $$x^{(t+1)} = y^{(t)} – \alpha_t \nabla f(y^{(t)})$$ where $\alpha_t = \frac{\omega_t^2}{\gamma_{t+1}}$.

 

The rest is (somewhat painful) simplification, where $\gamma_t$ and $v^{(t)}$ can be solved out. In particular, when $\mu > 0$, we can choose $\omega_0 = \sqrt{\mu/L}$, $\gamma_0 = \mu$ and the constant step size version reduces to the recursion

\begin{eqnarray*} x^{(t+1)} &=& y^{(t)} – \frac{1}{L} \nabla f(y^{(t)})\\ y^{(t+1)} &=& x^{(t+1)} + \beta (x^{(t+1)}-x^{(t)}) \end{eqnarray*}

for $\beta = \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L} + \sqrt{\mu}}.$

Leftover thoughts

At first glance, it’s hard to not see this whole thing as a glorified homework assignment. However, looking more deeply into recent literature, it is clear that estimate sequences play an important role in both giving rates and in designing more accelerated methods. The term “estimate sequence” is used synonymously with “potential functions”, and suggests that we should view $\phi_t$ as a sequence of Lyaponuv functions, or energy-like functions.

Generalizations of quadratic estimate sequences

The most straightforward generalization of quadratic estimate sequences is to extend to generalized norms $(\|z\|_2 \to \sqrt{z^THz})$, or to general Bregman divergences. The error term itself can also be generalized; here we study $f(x^{(t)}) – f^*$, but both the gradient norm and the variable distance to optimality are also valid errors to bound. The choice of the estimate sequence should inherently depend on the problem geometry; quadratic arises naturally from the $L$-smooth $\mu$-strongly convex assumptions, since they are always with respect to the Euclidean norm; however, if they are taken w.r.t. other norms, then other geometries may be more useful.

Use in automatic proof generators

One of the biggest appeals of estimate sequence analysis is that it is a step toward automating the process of providing convergence proofs. Although there are many knobs to twiddle, somehow the way that they are twiddled are fairly well-defined. This problem space has no doubt been the “holy grail” of convergence-proof-writers since the beginning of time, with many interesting contributions to date.

Appendix: Rate derivations

Vanilla gradient descent

Since we have assigned $$\gamma_{t+1} = (1-\omega_t)\gamma_t + \mu \omega_t = \frac{\omega_t}{\alpha},$$ then $$\omega_t = (1-\omega_t)\omega_{t-1} + \mu \alpha\omega_t \iff 1- \omega_t = \frac{1-\alpha\mu}{1 + \omega_{t-1} – \alpha \mu}.$$ Now let us look at the case of $f$ strongly-convex $(\mu > 0)$ and convex-not-strongly $(\mu = 0)$ separately.

  • Strongly convex case: If $\mu > 0$, we take $ \omega_0 > \mu/L$ and $\alpha = 2/L$. Via recursion, we have  $$ 1- \omega_t < \frac{1-2(\mu/L)}{1 + (\mu/L) – 2(\mu /L)} = 1-\frac{(\mu/L)}{1-(\mu/L)} < 1-\frac{\mu}{L}. $$ This gives the linear convergence rate of $\lambda_t = (1-\frac{\mu}{L})^t$.
  • Convex, not strongly: If instead $\mu = 0$ then $1-\omega_t = \frac{1}{1+\omega_{t-1}}$. Then telescoping gives $$
    \frac{1}{\omega_t} = \frac{1}{\omega_{t-1}} + 1 \quad \overset{\text{telescope}}{\Rightarrow} \quad \frac{1}{\omega_t}-\frac{1}{\omega_0} = t \quad \iff \quad (1-\omega_t) = \frac{1+(t-1)\omega_0}{t\omega_0 + 1}
    $$
    and telescoping again, multiplicatively,
    $$
    \lambda_t = \prod_{\tau=1}^t (1-\omega_\tau) = \prod_{\tau=1}^t \frac{1+(\tau-1) \omega_0}{\tau \omega_0 +1} = \frac{1}{t \omega+1}
    = O(1/t).
    $$

Nesterov’s accelerated method

We see that the recursion on $\omega_t$ is $$L\omega_t^2 = (1-\omega_t) L\omega_{t-1}^2 + \alpha_t \mu.$$

  • Strongly convex case: Assume constant step size $\alpha_t = 1/L$. Then, suppose that $\omega_0 \geq \sqrt{\mu/L}$. Then, recursively, $$L\omega_t^2 \geq \mu \iff \omega_t \geq \sqrt{\mu/L} \iff \lambda_t = \left(1-\sqrt{\frac{\mu}{L}}\right)^t$$
  • Convex, not strongly: In the case that $\mu = 0$, we have $$\lambda_{t+1} = \frac{\omega_t^2}{\omega_{t-1}^2} \lambda_t = \frac{\omega_t^2}{\omega_0^2}.$$

Then,  $$\frac{1}{\sqrt{\lambda_{t+1}}} – \frac{1}{\sqrt{\lambda_t}} = \frac{\lambda_t-\lambda_{t+1}}{\sqrt{\lambda_t\lambda_{t+1}}(\sqrt{\lambda_t}+\sqrt{\lambda_{t+1}})} \overset{\lambda \text{ decreasing}}{\geq}\frac{\lambda_t-\lambda_{t-1}}{2\lambda_t\sqrt{\lambda_{t+1}}} \overset{\lambda_{t+1}=(1-\omega_t)\lambda_t}{=}\frac{\omega_t}{2\sqrt{\lambda_{t+1}}} \geq \frac{\omega_0}{2}. $$

Telescoping and rearranging give $$ \frac{1}{\sqrt{\lambda_t}}-1 \geq \frac{t\sqrt{\gamma_0}}{2\sqrt{L}} \iff \lambda_t \leq \frac{4L}{(2\sqrt{L}+\sqrt{\gamma_0} t)^2} = O(L/t^2).$$

 

References

  • [1] Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course.

Convergence proofs I: quadratic approximations

Post by Yifan Sun, Jan 8, 2021

Convergence proofs are the bane of every optimizer’s existence. They are tricky, extremely long, and mostly unintuitive. Additionally, they tend to be tribal; there are “families” of convergence proof techniques that seem to cover a particular type of method under a particular class of functions, and while efforts are made to make them more “inclusive”, long story short it rarely works.

In the next couple posts, I will investigate some major convergence proof techniques on three simple methods:

  • Vanilla gradient descent (VGD)
    $$x^{(t+1)} = x^{(t)}- \alpha \nabla f(x^{(t)})$$
  • Polyak’s heavy ball method (PHB)
    $$x^{(t+1)} = x^{(t)}- \alpha \nabla f(x^{(t)}) +\beta (x^{(t)}-x^{(t-1)})$$
  • Nesterov’s accelerated method (NAG) $$x^{(t+1)} = x^{(t)}- \alpha \nabla f(x^{(t)}) +\beta (x^{(t)}-x^{(t-1)}) – \beta \alpha (\nabla f (x^{(t)})-\nabla f(x^{(t-1)}))$$

In this post, I’m restricting my attention to just constant step size versions, e.g. where $\alpha$ and $\beta$ are held constant. When the function $f$ is convex but not strongly convex, changing step sizes can often improve rates, but for strongly convex objectives, at least in the case of VGD and NAG, optimal convergence is already reached with a clever choice of constant step size.

tl;dr: While the quadratic approximation approach is elegant and beautiful, it does not extend to general $L$-smooth and strongly convex functions beyond gradient descent.

Quadratic minimization

In the quadratic approximation framework, we restrict our attention to the problem $$ \min_{x} \quad f(x):=\tfrac{1}{2}x^TQx \qquad \text{(QUAD)} $$ where $Q$ is symmetric positive semidefinite and has smallest nonzero eigenvalue $\lambda_{\min} = \mu$ and largest eigenvalue $\lambda_{\max} = L$. Then, knowing that $x^* = 0$, our problem is reduced to that of bounding $\|x^{(t)}\|_2$.

Gradient descent

To show linear convergence, VGD simply runs the contraction
$$
\|x^{(t)}\|_2 = \|(I-\alpha Q)x^{(t-1)}\|_2 = \|(I-\alpha Q)^t x^{(0)}\|
$$ which gives an overall rate of $$
\|x^{(t)}\|_2 \leq \rho(I-\alpha Q)^t\|x^{(0)}\|_2 $$ where $\rho(M)$ is the spectral radius of the symmetric matrix $M$. Note that the eigenvalues of $M = I-\alpha Q$ fall in the range $(-\delta, \delta)$ where $\delta = \max{|1-\alpha \mu|, |1-\alpha L|}$. Therefore, $$\rho(M) = \max\{|1-\alpha \mu|, |1-\alpha L|\}$$ and is minimized when $$|1-\alpha \mu| = |1-\alpha L| \iff 1-\alpha \mu = \alpha L -1\iff \alpha = \frac{2}{L+\mu}.$$
This results in a linear convergence rate $$\|x^{(t)}-x^*\|_2 \leq \rho^t \|x^{(0)}-x^*\|_2, \qquad \rho = \frac{L-\mu}{L+\mu}.$$

Stability

It is worth questioning what happens if $\alpha$ is chosen wrong. In general, there are several values it can take, but let’s specifically look at cases where $\rho > 1$, which means the contraction step doesn’t hold. This happens if either $|1-\alpha \mu| >1 $ or $|1-\alpha L| > 1$, corresponding to $$ \alpha < 0\quad \text{ or }\quad \alpha \mu > 2 \quad\text{ or } \quad\alpha L > 2.$$ Since $L > \mu$, this implies that we require $$
0 < \alpha < \frac{2}{L} $$ to guarantee stability.

Polyak’s heavy ball method (PHB)

The analysis given by Polyak [1] is also based on this quadratic approximation framework. The simple extension is done by splitting the evolution on $x$ to two variables; for example, via
$$\left[ \begin{matrix}
x^{(t+1)}\\
x^{(t)}\end{matrix}\right] = \underbrace{\left[ \begin{matrix} I – \alpha Q + \beta I & – \beta I\\ I & 0
\end{matrix}\right]}_{=T} \left[ \begin{matrix} x^{(t)}\\ x^{(t-1)} \end{matrix}\right]$$ Note that the matrix $T$ is not symmetric, but we can still discuss its singular values. In fact, considering the extreme and illustrative case where $Q = \mathbf{diag}(\mu, L)$, the matrix can be rearranged to be $$ T = \left[ \begin{matrix} 1 – \alpha L + \beta & 0 & -\beta & 0 \\ 0 & 1 – \alpha \mu + \beta & 0 & -\beta \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{matrix}\right]= \Pi\left[ \begin{matrix} 1 – \alpha L + \beta & -\beta &0& 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0& 1 – \alpha \mu + \beta & -\beta \\ 0 & 0 &1 & 0 \end{matrix}\right]\Pi^T $$ where the singular values of the two diagonal blocks are $$ \lambda{i,j} = \frac{1-\alpha d + \beta \pm \sqrt{(1-\alpha d + \beta)^2-4\beta}}{2}$$ for $d \in \{\mu,L\}$. To see what the optimal choices of $\alpha$ and $\beta$ are, we see that the radius of convergence must be
$$ \rho(T) = \max\left\{ \frac{\alpha L -1- \beta + \sqrt{(1-\alpha L + \beta)^2-4\beta}}{2}, \frac{1-\alpha \mu + \beta + \sqrt{(1-\alpha \mu + \beta)^2-4\beta}}{2}\right\}. $$ The region of stability is when $\rho(T) < 1$, and the optimum (minimum) value of $\rho(T)$ is achieved when the two terms in the max are equal, which forces $$ \alpha = \frac{2(\beta+1)}{L+\mu}.$$ Then minimizing for $\beta$, we arrive at the optimal parameter choices:
$$\alpha = \frac{4}{(\sqrt{L}+\sqrt{\mu})^2}, \quad \beta = \left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^2, \quad \rho = \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}$$
which matches the results given in [1], Ch. 3 Th. 1.

NAG with constant step size

For strongly convex functions $f$, Nesterov’s accelerated method [2] iterates with the specific choice of $$\alpha = \frac{1}{L}\in (0,1), \qquad \beta = \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\in (0,1) $$ Using the same bag of tricks, we can get
$$\left[\begin{matrix}
x^{(t+1)}\\
x^{(t)} \end{matrix}\right] = \underbrace{\left[\begin{matrix}
(1-\beta)(I – \alpha Q) & \beta (\alpha Q-I) \\ I & 0
\end{matrix}\right]}_{=T} \left[\begin{matrix} x^{(t)}\\ x^{(t-1)}. \end{matrix}\right] $$ which reorganizes into a block diagonal matrix with each diagonal block as $$ T_k = \left[\begin{matrix} (1-\beta)(1-\alpha \gamma_k) & \beta(\alpha \gamma_k -1) \\ 1 & 0 \end{matrix}\right] $$ where $\gamma_k$ are the eigenvalues of $Q$, and are in the range $(\mu,L)$. The eigenvalues of $T_k$ are therefore $$ \lambda{k,i} = \frac{(1-\beta)(1-\alpha \gamma) \pm \sqrt{(1-\beta)^2(1-\alpha \gamma)^2 + 4\beta (\alpha \gamma-1)}}{2} $$ which, for the recommended choice of $\alpha$ and $\beta$, can be shown to be smaller than $ \rho \leq 1-\sqrt{\frac{\mu}{L}}$.

Stability of PHB and NAG

In all the methods, convergence, and even stability, depend on $\alpha$ and $\beta$ lying in some prescribed range. This is unsurprising, since even runing a simple method like gradient descent on $f(x) = |x|$ will never converge for any fixed step size. So, implicit in all the proofs is that the function $f$ is $L$-smooth. The convergence guarantees for NAG and VGD do not depend on more than just $L$-smooth and $\mu$-strongly convex, and it can be shown that PHD achieves $O(1-\mu/L)$ rates with these assumptions alone. However, it is known that second order Hessian smoothness is required to achieve $O(1-\sqrt{\mu/L})$ rates in general. This is of course trivial for quadratic objectives.

General strongly convex functions

Now we generalize our attention on minimizing unconstrained objectives under the constraint that, for all $x$ and $y$, $$\frac{\mu}{2} \|x-y\|_2^2 \leq f(x)-f(y) – \nabla f(y)^T(x-y) \leq \frac{L}{2}\|x-y\|_2^2.$$ If the Hessian exists, we can further bound
$$\mu I \preceq \nabla^2 f(x) \preceq LI, \quad \forall x.$$

Gradient descent

In the case of gradient descent,
\begin{eqnarray*} \|x^{(t+1)}-x^*\|_2^2 &=&
\|\left(x^{(t)}-\alpha \nabla f(x^{(t)})\right)-\left(x^* – \alpha \nabla f(x^*)\right)\|_2^2 \\ &\overset{(*)}{=}&  \|x^{(t)}-x^*\|_2^2 -2\alpha \underbrace{ (\nabla f(x^{(t)})-\nabla f(x^*))^T(x^{(t)}-x^*)}_{\geq \mu \|x-x^*\|_2^2} + \alpha^2 \underbrace{\| \nabla f(x^{(t)})-\nabla f(x^*)\|_2^2}_{\leq L^2\|x^{(t)}-x^*\|_2^2}\\
&\leq & (1-2\alpha \mu+\alpha^2L^2)\|x^{(t)}-x^*\|_2^2\end{eqnarray*} and $\rho = 1-2\alpha \mu+\alpha^2L^2 = 1-\mu^2/L^2$. Of course we know this is not the optimal choice of $\alpha$ nor the optimal $\rho$. To actually get here, we need to use a special relationship from [2] Thm. 2.1.12
\begin{equation}
(\nabla f(x)-\nabla f(y))^T(x-y) \geq \frac{\mu L}{\mu + L} |x-y|_2^2 + \frac{1}{\mu + L}|\nabla f(x)-\nabla f(y)|_2^2
\qquad (\triangle)
\end{equation}
which is not intuitively obvious, given our exercise with quadratics. From this, we go from $(*)$ in the above derivation to \begin{eqnarray}
\|x^{(t+1)}-x^*\|_2^2 &=& \left (1- \frac{2\alpha\mu L}{\mu + L}\right)\|x^{(t)}-x^*\|_2^2 + \underbrace{\left(\alpha^2-\frac{2\alpha}{\mu + L}\right)}_{\text{require } \alpha > 2/(\mu+L)} \underbrace{\| \nabla f(x^{(t)})-\nabla f(x^*)\|_2^2}{\leq L^2\|x^{(t)}-x^*\|_2^2}\\
&\leq & \underbrace{\left(1- 2\alpha L+ \alpha^2 L^2\right)}_{(\rho(\alpha))^2}\|x^{(t)}-x^*\|_2^2 \end{eqnarray}
The function $(\rho(\alpha))^2$ is quadratic in $\alpha$, and is minimized at $\alpha = 1/L > 2/(\mu+L)$, and thus we pick $\alpha = 2/(\mu+L)$ to be the smallest viable candidate. This gives a final radius of $\rho = \frac{L-\mu}{L+\mu}$, as desired. Note that here, no discussion of the Hessian is necessary; the problem may not even be twice-differentiable.

Tight quadratic bound

Before moving on, it is worth staring a bit further at $(\triangle)$, and seeing why this is such a better bound than the ones we tried to use rather blindly. Take again the quadratic example, with $\nabla f(x) = Qx$ and $z = x-y$. Then the bounds we tried to use the first time are just
$$z^TQz \geq \mu z^Tz \text{ and } z^TQ^TQz \leq L^2 z^Tz. \qquad (\square)$$
In fact, $(\triangle)$ reduces to
$$ (\mu+L)z^TQz \geq \mu L z^Tz + z^TQ^TQz$$
which, for some eigenvalue $\lambda$ of $Q$, in fact establishes the bound
$$\lambda^2 – (\mu + L) \lambda + \mu L \geq 0 \iff
(\lambda – \mu)(\lambda- L) \leq 0. \qquad (\triangle’)$$
To understand why the second bound is tighter, assume that $z$ is an eigenvector of $Q$, and $\lambda z = Qz$. Then if $\lambda = \mu$, the second bound in $(\square)$ is loose; equivalently, if $\lambda = L$, the first in $(\square)$ is loose. In other words, using the technique in spired in $(\square)$, depending on what $x^{(t)}-x^*$ is, at least one of the bounds must be loose. However, by combining them to one inequality in $(\triangle’)$, the two terms “multiply the looseness”. While it is still possible that the inequality is not tight, there exists at least two values of $z$ where the inequality is satisfied with equality.

Polyak’s heavy ball method (PHB)

Applying the same basket of tricks gets us the same rate as gradient descent. We don’t need 2nd order smoothness, but we can’t get optimal rates. In particular, we already have that
$$\|x^{(t)} – \alpha \nabla f(x^{(t)}) -x^*\|_2 \leq ( \alpha L-1) \|x^{(t)} -x^*\|_2$$ So we can do
\begin{eqnarray} \|x^{(t+1)}-x^*\|_2 &\leq& (1+\beta) \|x^{(t)} – \frac{\alpha}{1+\beta} \nabla f(x^{(t)}) -x^*\|_2 + \beta \|x^{(t-1)}-x^*\|_2 \\
&\leq& (\alpha L-1-\beta) \|x^{(t)} -x^*\|_2 + \beta \|x^{(t-1)}-x^*\|_2
\end{eqnarray} or, in matrix form $$ \left[\begin{matrix}\|x^{(t+1)}-x^*\|_2\\
\|x^{(t)}-x^*\|_2 \end{matrix}\right] = \underbrace{ \left[\begin{matrix}
\alpha L – 1 – \beta & \beta\\
1 & 0
\end{matrix}\right]}_{T}
 \left[\begin{matrix}
|x^{(t)}-x^* |_2\\
|x^{(t-1)}-x^* |_2
\end{matrix}\right] $$ However, though this is very similar to the recursion created in the previous section, the sign flip on the upper-right $\beta$ is significant; the roots
$$ \rho(T) = \max\left\{
\frac{\alpha L -1- \beta + \sqrt{(1-\alpha L + \beta)^2+4\beta}}{2},
\frac{1-\alpha \mu + \beta + \sqrt{(1-\alpha \mu + \beta)^2+4\beta}}{2}\right\} > 1$$
are minimized when
$$\alpha = \frac{2(\beta+1)}{L+\mu}, \quad \beta = 0$$
which corresponds to the vanilla gradient method.

NAG with constant step size

We repeat the exercise again with NAG, arriving at
\begin{eqnarray} \|x^{(t+1)}-x^*\|_2 &\leq& (1+\beta)\|x^{(t)} – \alpha\nabla f(x^{(t)}) -x^*\|_2 + \beta \|x^{(t-1)}- \alpha\nabla f(x^{(t)})-x^*\|_2 \\ &\leq& (1+\beta)(\alpha L -1) \|x^{(t)} -x^*\|_2 + \beta( \alpha L-1) \|x^{(t-1)}-x^*\|_2 \end{eqnarray}
which, in matrix form, is $$\left[\begin{matrix}
\|x^{(t+1)}-x^* \|_2\\
\|x^{(t)}-x^* \|_2\end{matrix}\right] = \underbrace{\left[\begin{matrix}
(1+\beta)(\alpha L-1) & \beta(\alpha L-1)\\
1 & 0
\end{matrix}\right]}_{T}
\left[\begin{matrix}
\|x^{(t)}-x^*\|_2\\
\|x^{(t-1)}-x^* \|_2
\end{matrix}\right]$$ Picking the prescribed $\alpha = 1/L$, then the spectral radius of $T$ here is 1, which shows no contraction again.

This leaves us with an open question:  Can this analysis work with a different choice of step size? But somehow this isn’t very satisfying,  since the main slack between the “function version” and the “quadratic version” is the triangle inequality, which clearly makes the bound worse. So, if the bound isn’t tight to begin with, even if changing parameters makes the bound better it does not suggest a better method in practice.

Addendum

Jan 9, 2021

It is worth being more explicit about when quadratic approximation does apply. In particular, the quadratic approximation applies for $Q = \nabla^2 f(z)$, where $z$ is some vector associated with $x^{(t)}$ and $x^{(t+1)}$. More precisely, Taylor’s theorem tells us that, as long as the Hessians are continuous everywhere, then for each $t$, there always exists some $z$ where $$\nabla^2 f(z) (x^{(t)}-x^*) = \nabla f(x^{(t)}) – \nabla f(x^*).$$ Under this assumption, we can directly apply the results from the quadratic analysis, which is how Polyak in [1] explains the optimal guarantee for general $L$-smooth $\mu$-strongly convex functions. Specifically, the only additional assumption we need for PHB method to converge at the optimal rate is that it is second-order differentiable everywhere. Then, for the same optimal choice of $\alpha$ and $\beta$, $$\left[\begin{matrix} x^{(t+1)} -x^*\\ x^{(t)}-x^* \end{matrix}\right] = \left[\begin{matrix} x^{(t)} -\alpha \nabla f(x^{(t)}) + \beta (x^{(t)}-x^{(t-1)})\\ x^{(t)}\end{matrix}\right] -\left[\begin{matrix} x^*\\x^*\end{matrix}\right]=\underbrace{ \left[\begin{matrix} (1+\beta)I – \alpha \nabla^2 f(z) & -\beta I \\ I & 0 \end{matrix}\right] }_{\lambda_{\max}\leq \rho := \frac{\sqrt{L}+\sqrt{\mu}}{\sqrt{L}-\sqrt{\mu}}}\left[\begin{matrix} x^{(t)} -x^*\\ x^{(t-1)}-x^* \end{matrix}\right] $$ which gives the same one-step contraction, and therefore same overall rate.

The same extension is a little less certain for NAG, since the $Q$ matrix appears twice, and you would need two Hessians to approximate both $\nabla f(x^{(t)})$ and $\nabla f(x^{(t-1)})$. In general, those two Hessians will not have the same eigenvectors, so we cannot apply the same simultaneous diagonalization. Of course that’s a less pressing issues, since there are other ways of deriving optimal rates for NAG, of which don’t quite work for PHB.

Addendum 2

April 2, 2021

After much discussion, it has been solidified that the first addendum is wrong. The issue here is a mismatch between the maximum magnitude eigenvalue and the spectral norm of an assymetric matrix. In general, we only know that 

$$
\text{max magnitude eigenvalue of $T$} \leq \text{spectral norm of $T$}
$$

and the inequality is only guaranteed to be tight if $T$ is normal (the most useful subset being symmetric matrices). Since this is not true, we have to instead ask ourselves not what $\rho(T)$ does for us, but what $\rho(T^TT) = \rho(TT^T)$ does for us. After much Mathematica pounding, it seems that, in fact, $\rho(TT^T)\not\leq 1$ in general, unless $\mu = L$. In more detail, the characteristic equation for the eigenvalues of $TT^T$ are the solutions to the equation

$$(\beta^2-\lambda )(\lambda -1) + (1 + \beta – \alpha \gamma)^2 \lambda = 0$$

where $\gamma$ is an eigenvalue of $H$. The roots are

$$\lambda =  \frac{1}{2} (B \pm
\sqrt{-4 \beta^2 + B^2}), \qquad B = 2 + 2 \beta + 2 \beta^2 – 2 \alpha \gamma – 2 \alpha \beta \gamma +
\alpha^2 \gamma^2$$

In particular, the positive root has a unique minimizer at $\gamma = \frac{1+\beta}{\alpha}$, in which case the value is exactly 1. In other words, there is only one value of $\gamma$ that is allowed, when in fact we would like the value of $\gamma$ to move freely between $\mu$ and $L$.

In other words, this analysis does not extend past quadratic problems. What a bummer! (But hey, it’s pretty neat when this subtle linear algebra stuff really comes into play, isn’t it?)

 

Addendum to addendum: Wait, but even in the case that $L = \mu$, doesn’t this analysis mean that $\rho(T) \geq 1$ no matter what I pick for $\alpha$ and $\beta$? How do I reconcile this? Well, this is actually a consequence of (and I had never heard of this until now) another lemma, which basically states that the gap between the max eigenvalue magnitude and the spectral norm disappears when we compound matrices. Basically, while we cannot really say that $\rho(T) < 1$, we can basically say that $$\rho(T^k)\overset{k\to +\infty}{\to} c \leq  \underbrace{(\max_i |\lambda_i(T)|)^k}_{<1} +\underbrace{\epsilon_k}_{\to 0}.$$
This is a corollary of Gelfand’s formula, which states that $\rho(T) = \lim_{k\to\infty} \|A^k\|^{1/k}$ for any matrix norm. Since in quadratic problems, we are indeed compounding the same $T$ each time, we get away with this. On a technicality. -_-‘

Addendum much thanks to Sharan and Reza, who really forced me to believe the error of my ways.

Thanks to Sharan and Reza and  Adrien for helpful discussions.

Up next: Nesterov’s estimate sequences

References

  • [1] Polyak, B. T. (1987). Introduction to optimization
  • [2] Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course.