The next couple posts will focus on our favorite second order method: Newton’s method. I’ve been going through them, partly as “review” (in quotes because I ended up learning a lot of new things) and partly to develop some intuition as to when acceleration-by-leveraging-second-ordery-info might actually help. This first post will be super short, and focus on analysis of Newton’s method under two strong-convexity-like conditions.
Consider the unconstrained minimization of $f(x)$ where $f$ is everywhere twice differentiable. Then Newton’s method iterates as $$x^{(t+1)} = x^{(t)} – (\nabla^2 f(x^{(t)}))^{-1}\nabla f(x^{(t)}).$$ Unlike gradient descent, there is no “step size”; the curvature information is already incorporated in the Hessian inverse. Also, as a responsible optimist, I must point out that in practice we never take the Hessian inverse, but instead compute the matrix-inverse-vector-product using some iterative method, like conjugate gradient.
Suppose the Hessian is $M$-smooth $$\|\nabla^2 f(x)-\nabla^2 f(y)\|_2 \leq M\|x-y\|_2.\quad (\star)$$ We all learn Newton’s method as having a locally quadratic convergence rate. The usual question is then: what is the global convergence rate? Well, actually, if you stare and squint at the proofs given in Nesterov’s works, you see the answer is actually right in front of you.
- If the function is convex but not strongly convex, then Newton’s method may not work at all. In fact, for any point $x^{(t)}$, if $\nabla^2 f(x^{(t)})$ has 0 eigenvalues, you can either take a damped Newton step (via line search) or switch to gradient steps. So, unsurprisingly, the rate here should be $O(1/t)$, the same as in gradient descent.
- If the function is $\mu$-strongly convex and $\mu > 0$, then Newton’s method has a locally quadratic convergence rate, and the proof is like 3 lines.
- If the function is $\mu$-strongly convex at the optimum, then we still cannot say anything globally, but we can say that the method has local quadratic convergence. The proof is a bit longer, and the point at which quadratic convergence kicks in could be much slower, though.
In other words, it’s an all-or-nothing game. Either I get quadratic convergence, or I am better off switching to gradient descent (at least until we’re close enough to the quadratic regime).
The proof for this convergence is actually really cute, once you know all the pieces. Use the integral form for $\nabla f(x^{(t)})$: $$\nabla f(x^{(t)}) -\nabla f(x^*)= \int_0^1 \nabla^2 f(x^* – \tau (x^{(t)}-x^*) ) (x^{(t)}-x^*) d\tau. $$ Then, since $\nabla f(x^*) = 0$,
\begin{eqnarray*}
x^{(t+1)}-x^* &=& x^{(t)} -( \nabla^2 f(x^{(t)}))^{-1} \nabla f(x^{(t)})\\
&=& x^{(t)} – ( \nabla^2 f(x^{(t)}))^{-1} \left( \int_0^1 \nabla^2 f(x^* – \tau (x^{(t)}-x^*) ) (x^{(t)}-x^*) d\tau\right)\\
&=&( \nabla^2 f(x^{(t)}))^{-1}\left(\int_0^1\nabla^2 f(x^{(t)}) – \nabla^2 f(x^* – \tau (x^{(t)}-x^*) d\tau \right)(x^{(t)}-x^*).
\end{eqnarray*}
Taking norms,
\begin{eqnarray*}
\|x^{(t+1)}-x^*\|_2 &\leq & \| \nabla^2 f(x^{(t)}))^{-1}\|_2\left(\int_0^1 M\|x^*-\tau (x^{(t)}-x^*)-x^*\|_2d\tau\right)\|x^{(t)}-x^*\|_2 \\
&=& \| \nabla^2 f(x^{(t)}))^{-1}\|_2\;\underbrace{(\int_0^1 \tau d\tau)}_{=1/2}\; M \|x^{(t)}-x^*\|_2^2 \\
&=& \frac{M}{2}\| \nabla^2 f(x^{(t)}))^{-1}\|_2 \|x^{(t)}-x^*\|_2^2.
\end{eqnarray*}
Now the question is, what is an upper bound on $\| \nabla^2 f(x^{(t)}))^{-1}\|_2$? Well, if $f$ is $\mu$-strongly convex everywhere, then $\| \nabla^2 f(x^{(t)}))^{-1}\|_2 \leq 1/\mu$ everywhere, and we can say globally that
$$ \|x^{(t+1)}-x^*\|_2 \leq \frac{M}{2\mu} \|x^{(t)}-x^*\|_2^2. $$
Now, if $\| \|x^{(t)}-x^*\|_2 < \frac{2\mu}{M}$, this implies contraction, and henceforth, for $t’ > t$, there will be quadratic convergence. If that hasn’t yet occurred, a simple line search add-on should make sure that a damped Newton’s step makes progress, globally.
If, on the other hand, we loosen our requirements to simply that $\lambda_{\min}(\nabla^2 f(x^*)) \geq \mu$, then by ($\star$), $\lambda_{\min}(\nabla^2 f(x^{(t)}) \geq \mu – M\|x^{(t)}-x^*\|_2$, and
$$ \|x^{(t+1)}-x^*\|_2 \leq \frac{M}{2(\mu-M\|x^{(t)}-x^*\|_2^2)} \|x^{(t)}-x^*\|_2^2\leq \frac{M}{2\mu(1-\theta))} \|x^{(t)}-x^*\|_2^2 $$
if $\|x^{(t)}-x^*\|_2 \leq \frac{\mu\theta}{M}$ for some $\theta < 1$. Nesterov picks $\theta = 2/3$, but anything will do.
I don’t think Nesterov himself was very satisfied with this rate; in particular, the second case where we’re adding a new parameter we can’t measure ($\theta$) on top of the $L$-continuous Hessian and $\mu$-strongly convex, it is hard to list exactly which functions are satisfying these conditions. It seems like this was the motivation behind the next two topics: Newton with cubic regularization for generalized functions, and Newton’s method for self-concordant functions.
Reference
[1] Nesterov, Y. Lectures on Convex Optimization. 2018, Chapter 1.2.
Edit July 12 2002: Thanks Mark Schmidt for pointing out that there is a serious flaw in my statement that Newton’s method can converge quadratically, globally. We can guarantee globally that $\|x^{(t+1)}-x^*\|_2 \leq \frac{M}{2\mu}\| \nabla^2 f(x^{(t)}))^{-1}\|_2 \|x^{(t)}-x^*\|_2^2$, but unless $\|x^{(t)}-x^*\|_2 < \frac{2\mu}{M}$, this does not imply contraction. In other words, even if $f$ has an $L$-continuous Hessian and is $\mu$-strongly convex, we still need to use line search and damped Newton in the initial phases until the locally quadratic convergence rate kicks in. tl;dr, no global quadratic convergence (sorry!). This has been amended in the post.