Convergence proofs II: (simple) estimate sequences

Post by Yifan Sun

tl;dr: Estimate sequences are a very interesting way of isolating the problem of finding rates over a wide family of unconstrained problems, and can also be used to generate accelerated methods. However it’s unclear whether they can easily be used on non-NAG-like methods, at least when restricted to quadratic functions.

The big idea behind estimate sequences

Suppose I have some sequence of functions $\phi_{t}(u)$. These functions represent some bounding function on some aspect that I wish to go to $0$. Additionally, I have some mixing parameters $\lambda_t\in (0,1)$, $\lambda_t \to 0$. Then, if

  • (U) $f(x^{(t)}) \leq \min_u \; \phi_t(u)$ for all $t$
  • (L) $\phi_t(x) \leq (1-\lambda_t) f(x) + \lambda_t \phi_0(x)$ for some sequence $\lambda_t\in (0,1)$,

then $$ f(x^{(t)}) \leq \min_u \; \phi_t(u) \leq \phi_t(x^*) \leq (1-\lambda_t) f(x^*) + \lambda_t \phi_0(x^*)$$ which implies $$f(x^{(t)})-f(x^*) \leq \lambda_t (\phi_0(x^*)-f(x^*)).$$

In other words, the convergence rate of $f(x^{(t)})$ is upper bounded by that of $\lambda_t.$ In essence, the idea behind estimate sequences is to push the job of computing rates to something agnostic of function characteristics–in this case, the decaying sequence $\lambda_t$. In theory, this gives us a huge benefit in terms of generality; if I can prove that a family of functions $\phi_t$ (called an estimate sequence, or a potential function) satisfy my upper and lower bounding conditions, then all I have to do is to recompute the $\lambda_t$s for each new method, under each new assumption, and get my rate.

Quadratic estimate sequences

There are many possibilities for estimate sequences, but the simplest are quadratics, e.g. $$\phi_t(u) = \phi_t^* + \frac{\gamma_t}{2}\|u-v^{(t)}\|_2^2.$$

Clearly, $\phi_t^* = \min_u\; \phi_t(u)$, with minimizer $u =v^{(t)}$. To be more precise, Nesterov [1] defines these functions recursively: $$\phi_{t+1}(u) := (1-\omega_t) \phi_t(u) + \omega_t \left( f(y^{(t)}) + \nabla f(y^{(t)})^T (u-y^{(t)}) + \frac{\mu}{2}\|u-y^{(t)}\|_2^2\right)$$ where $\omega_t\in (0,1)$ may be user-designed and $\mu$ is the strong convexity parameter of $f$. Then, using induction, the minimizer over $$\phi_{t+1}(u) = (1-\omega_t) \underbrace{\left(\phi_t^* + \frac{\gamma_t}{2}\|u-v^{(t)}\|_2^2\right)}_{\phi_t(u)} + \omega_t \left( f(y^{(t)}) + \nabla f(y^{(t)})^T (u-y^{(t)}) + \frac{\mu}{2}\|u-y^{(t)}\|_2^2\right)$$ can be found by setting the gradient w.r.t. $u$ to $0,$ arriving at $$((1-\omega_t)\gamma_t + \omega_t \mu) v^{(t+1)} = (1-\omega_t)\gamma_t v^{(t)} + \omega_t \mu y^{(t)}-\omega_t\nabla f(y^{(t)}). \qquad (\otimes). $$ Noting that $$\|u-z\|_2^2 – \|v^{(t+1)}-z\|_2^2 = \|u-v^{(t+1)}\|_2^2 + 2(u-v^{(t+1)})^T(v^{(t+1)}-z)$$ we see that $$\phi_{t+1}(u) – \phi_{t+1}^* = \frac{(1-\omega_t) \gamma_t + \omega_t \mu}{2} \|u-v^{(t+1)}\|_2^2$$ which gives the sequence $\gamma_{t+1} = (1-\omega_t) \gamma_t + \omega_t \mu$.

Verify lower bound property

In fact, the lower bounding condition (L) immediately pops out whenever $\phi_{t+1}$ is mixed with $\phi_t$ and a function that lower bounds $f(u)$ $$\phi_{t+1}(u) := (1-\omega_t) \phi_t(u) + \omega_t f_L(u), \qquad f_L(u) \leq f(u) \; \forall u$$ and the sequence $$\lambda_0 = 1, \qquad \lambda_{t+1} = (1-\omega_t) \lambda_t.$$ In this case, we use $$ f_L(u) = f(y^{(t)}) + \nabla f(y^{(t)})^T(u-y^{(t)}) + \frac{\mu}{2}\|u-y^{(t)}\|_2^2,$$ but you can imagine inserting other lower bounds here instead.  The bulk of the “work” in applying these sequences to methods, then, rests in verifying the upper bonuding condition (U).

Vanilla gradient descent

The rule of thumb for picking $y^{(t)}$ is that it is the thing of which you take your gradient in the method. In vanilla gradient descent, there is only one variable, so $y^{(t)} = x^{(t)}$.

Going back to $(\otimes)$, and noting that $x^{(0)} = v^{(0)}$, we see that for $t = 1$, $$\gamma_{1} v^{(1)} = \gamma_1 x^{(0)} – \omega_0 \nabla f(x^{(0)}) $$ which, picking $\omega_0 = \alpha \gamma_1$, assigns $v^{(1)} = x^{(1)}$. In fact, more generally, assigning $$\omega_t = \alpha \gamma_{t+1} \Rightarrow v^{(t+1)} = x^{(t+1)}. $$ Thus, the recursion for the estimate sequences boils down to $$\phi_{t+1}(x^{(t+1)}) = (1-\omega_t) \phi_t^* + \gamma_{t+1}\left(f(x^{(t)}) + \alpha \nabla f(x^{(t)})^T(x^{(t+1)}-x^{(t)}) + \frac{1}{2}\|x^{(t+1)}-x^{(t)}\|_2^2\right). \qquad (\star)$$

Assume that for some $t$, $f(x^{(t)}) \leq \phi_t^*$. Then $(\star)$ reduces to the inequality

\begin{eqnarray*}\phi_{t+1}(x^{(t+1)}) &\geq & f(x^{(t)}) + \gamma_{t+1}\left( \alpha \nabla f(x^{(t)})^T(x^{(t+1)}-x^{(t)}) + \frac{1}{2}\|x^{(t+1)}-x^{(t)}\|_2^2\right)\\&= &f(x^{(t)}) – \frac{\gamma_{t+1}\alpha^2}{2}\| \nabla f(x^{(t)})\|_2^2\end{eqnarray*}

and as long as $$\frac{\gamma_{t+1} \alpha^2}{2} \leq \alpha \left(1-\frac{L\alpha}{2}\right) \iff \alpha \geq \frac{2-\omega_t}{L},$$ we can invoke the descent lemma concludes $$f(x^{(t+1)}) \leq f(x^{(t)}) – \alpha\left(1-\frac{L\alpha}{2}\right)\|\nabla f(x^{(t)})\|_2^2 $$ and we have shown the inductive step $f(x^{(t+1)}) \leq \phi_{t+1}(x^{(t+1)})$. Therefore the upper bounding property is satisfied, and thus the convergence of $f(x^{(t)}) \to f^*$ in VGD depends only on the rate of decay of $\lambda_t$. The actual derivation of the rate is messy, so I moved it to the Appendix.

Nesterov’s accelerated method

The way we get to Nesterov’s accelerated method is to provide some generalizations in this scheme. First, we will not fix $v^{(t)}$ nor $y^{(t)}$, and just allow them to float. Then, Lemma 2.2.3 in [1] directly gives the following

\begin{eqnarray*}\gamma_{t+1} &=& (1-\omega_t) \gamma_t + \omega_t \mu\\ v^{(t+1)} &=& \frac{1}{\gamma_{t+1}} ((1-\omega_t) \gamma_t v_t + \omega_t \mu y^{(t)} – \omega_t \nabla f(y^{(t)}))\\ \phi^*_{t+1} &=& (1-\omega_t) \phi^*_t + \omega_t f(y^{(t)}) – \frac{\omega_t^2}{2\gamma_{t+1}}\|\nabla f(y^{(t)})\|_2^2 \\ &&\qquad + \frac{\omega_t(1-\omega_t)\gamma_t}{\gamma_{t+1}}\left(\frac{\mu}{2}\|y^{(t)}-v^{(t)}\|_2^2 + \nabla f(y^{(t)})^T(v^{(t)}-y^{(t)})\right). \end{eqnarray*}

It can be shown that this is the unique and specific definitions of $\gamma_t$, $v^{(t)}$, and $\phi_t^*$ which preserves the “canonical form” $$\phi_t(u) =\phi_t^* + \frac{\gamma_t}{2}\|u-v^{(t)}\|_2^2.$$

Nesterov’s method pops out by enforcing the  upper bounding property. Using again a recursive argument, we assume that $\phi^*_t\geq f(x^{(t)})$. We can use convexity to “transfer” the $f(x^{(t)})$ to $f(y^{(t)})$: $$ f(x^{(t)}) \geq f(y^{(t)}) + \nabla f(y^{(t)})^T(x^{(t)}-y^{(t)}) $$ and after simplifying, we have

\begin{eqnarray*} \phi^*_{t+1}&\geq & f(y^{(t)}) + \underbrace{(1-\omega_t) \nabla f(y^{(t)})^T(x^{(t)}-y^{(t)}+\frac{\omega_t\gamma_t}{\gamma_{t+1}}(v^{(t)}-y^{(t)}))}_{A} \\ &&\qquad – \frac{\omega_t^2}{2\gamma_{t+1}}\|\nabla f(y^{(t)})\|_2^2 + \underbrace{\frac{\omega_t(1-\omega_t)\gamma_t}{\gamma_{t+1}}\left(\frac{\mu}{2}\|y^{(t)}-v^{(t)}\|_2^2 \right)}_{=B}. \end{eqnarray*}

Term $B$ can be dropped. Nesterov’s method then comes after forcing $A = 0$, e.g. defining  $$x^{(t)}-y^{(t)}+\frac{\omega_t\gamma_t}{\gamma_{t+1}}(v^{(t)}-y^{(t)}) = 0. $$ It is left to force $$f(x^{(t+1)}) \leq f(y^{(t)}) – \frac{\omega_t^2}{2\gamma_{t+1}}\|\nabla f(y^{(t)})\|_2^2$$ which can be done by assigning $$x^{(t+1)} = y^{(t)} – \alpha_t \nabla f(y^{(t)})$$ where $\alpha_t = \frac{\omega_t^2}{\gamma_{t+1}}$.

 

The rest is (somewhat painful) simplification, where $\gamma_t$ and $v^{(t)}$ can be solved out. In particular, when $\mu > 0$, we can choose $\omega_0 = \sqrt{\mu/L}$, $\gamma_0 = \mu$ and the constant step size version reduces to the recursion

\begin{eqnarray*} x^{(t+1)} &=& y^{(t)} – \frac{1}{L} \nabla f(y^{(t)})\\ y^{(t+1)} &=& x^{(t+1)} + \beta (x^{(t+1)}-x^{(t)}) \end{eqnarray*}

for $\beta = \frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L} + \sqrt{\mu}}.$

Leftover thoughts

At first glance, it’s hard to not see this whole thing as a glorified homework assignment. However, looking more deeply into recent literature, it is clear that estimate sequences play an important role in both giving rates and in designing more accelerated methods. The term “estimate sequence” is used synonymously with “potential functions”, and suggests that we should view $\phi_t$ as a sequence of Lyaponuv functions, or energy-like functions.

Generalizations of quadratic estimate sequences

The most straightforward generalization of quadratic estimate sequences is to extend to generalized norms $(\|z\|_2 \to \sqrt{z^THz})$, or to general Bregman divergences. The error term itself can also be generalized; here we study $f(x^{(t)}) – f^*$, but both the gradient norm and the variable distance to optimality are also valid errors to bound. The choice of the estimate sequence should inherently depend on the problem geometry; quadratic arises naturally from the $L$-smooth $\mu$-strongly convex assumptions, since they are always with respect to the Euclidean norm; however, if they are taken w.r.t. other norms, then other geometries may be more useful.

Use in automatic proof generators

One of the biggest appeals of estimate sequence analysis is that it is a step toward automating the process of providing convergence proofs. Although there are many knobs to twiddle, somehow the way that they are twiddled are fairly well-defined. This problem space has no doubt been the “holy grail” of convergence-proof-writers since the beginning of time, with many interesting contributions to date.

Appendix: Rate derivations

Vanilla gradient descent

Since we have assigned $$\gamma_{t+1} = (1-\omega_t)\gamma_t + \mu \omega_t = \frac{\omega_t}{\alpha},$$ then $$\omega_t = (1-\omega_t)\omega_{t-1} + \mu \alpha\omega_t \iff 1- \omega_t = \frac{1-\alpha\mu}{1 + \omega_{t-1} – \alpha \mu}.$$ Now let us look at the case of $f$ strongly-convex $(\mu > 0)$ and convex-not-strongly $(\mu = 0)$ separately.

  • Strongly convex case: If $\mu > 0$, we take $ \omega_0 > \mu/L$ and $\alpha = 2/L$. Via recursion, we have  $$ 1- \omega_t < \frac{1-2(\mu/L)}{1 + (\mu/L) – 2(\mu /L)} = 1-\frac{(\mu/L)}{1-(\mu/L)} < 1-\frac{\mu}{L}. $$ This gives the linear convergence rate of $\lambda_t = (1-\frac{\mu}{L})^t$.
  • Convex, not strongly: If instead $\mu = 0$ then $1-\omega_t = \frac{1}{1+\omega_{t-1}}$. Then telescoping gives $$
    \frac{1}{\omega_t} = \frac{1}{\omega_{t-1}} + 1 \quad \overset{\text{telescope}}{\Rightarrow} \quad \frac{1}{\omega_t}-\frac{1}{\omega_0} = t \quad \iff \quad (1-\omega_t) = \frac{1+(t-1)\omega_0}{t\omega_0 + 1}
    $$
    and telescoping again, multiplicatively,
    $$
    \lambda_t = \prod_{\tau=1}^t (1-\omega_\tau) = \prod_{\tau=1}^t \frac{1+(\tau-1) \omega_0}{\tau \omega_0 +1} = \frac{1}{t \omega+1}
    = O(1/t).
    $$

Nesterov’s accelerated method

We see that the recursion on $\omega_t$ is $$L\omega_t^2 = (1-\omega_t) L\omega_{t-1}^2 + \alpha_t \mu.$$

  • Strongly convex case: Assume constant step size $\alpha_t = 1/L$. Then, suppose that $\omega_0 \geq \sqrt{\mu/L}$. Then, recursively, $$L\omega_t^2 \geq \mu \iff \omega_t \geq \sqrt{\mu/L} \iff \lambda_t = \left(1-\sqrt{\frac{\mu}{L}}\right)^t$$
  • Convex, not strongly: In the case that $\mu = 0$, we have $$\lambda_{t+1} = \frac{\omega_t^2}{\omega_{t-1}^2} \lambda_t = \frac{\omega_t^2}{\omega_0^2}.$$

Then,  $$\frac{1}{\sqrt{\lambda_{t+1}}} – \frac{1}{\sqrt{\lambda_t}} = \frac{\lambda_t-\lambda_{t+1}}{\sqrt{\lambda_t\lambda_{t+1}}(\sqrt{\lambda_t}+\sqrt{\lambda_{t+1}})} \overset{\lambda \text{ decreasing}}{\geq}\frac{\lambda_t-\lambda_{t-1}}{2\lambda_t\sqrt{\lambda_{t+1}}} \overset{\lambda_{t+1}=(1-\omega_t)\lambda_t}{=}\frac{\omega_t}{2\sqrt{\lambda_{t+1}}} \geq \frac{\omega_0}{2}. $$

Telescoping and rearranging give $$ \frac{1}{\sqrt{\lambda_t}}-1 \geq \frac{t\sqrt{\gamma_0}}{2\sqrt{L}} \iff \lambda_t \leq \frac{4L}{(2\sqrt{L}+\sqrt{\gamma_0} t)^2} = O(L/t^2).$$

 

References

  • [1] Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course.

Leave a Reply