I’ve been racking my brain lately if there is an appropriate place to insert “small proofs”, e.g. stuff that isn’t trivial, I can’t find it anywhere, but isn’t cool enough to inspire a whole new paper. For now, I think I’ll put it here.
Background
We investigate $\underset{x}{\min}\, f(x)$ using gradient descent
$$ x^{(k+1)} = x^{(k)} – \alpha \nabla f(x^{(k)}) \qquad\qquad\text{(GD)}$$
As I’m working to pull together notes for a short course, I realized all my proofs are “average case proofs”, e.g. averaged function suboptimality or averaged gradient norm squared converges like $O(1/k)$. But because GD is a “descent method”, the averaged function suboptimality is a last-iterate-proof in disguise, since
$$f(x^{(k+1)})\leq f(x^{(k)}) \Rightarrow f(x^{(k)}) \leq \frac{1}{k} \sum_{i=1}^k f(x^{(i)}) = O(1/k).$$
I wanted to do the same thing for the gradient norm case, but I could not find this proof around, even though intuitively it should be true. So I concocted one. Please let me know if this already exists, otherwise please cite this thing (url buried somewhere is probably sufficient) if you would like to steal it.
Statement: If $f$ is convex and $0<\alpha < 2/L$ then $f(x^{(k+1)})\leq f(x^{(k)}) $ for all $k$.
If this is true, then we can generalize average gradient results to last iterate results.
$$\|\nabla f(x^{(k)})\|_2^2 \leq \frac{1}{k} \sum_{i=1}^k \|f(x^{(i)})\|_2^2 = O(1/k).$$
Proof:
First,
$$ \| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = ( \nabla f(x^{(k+1)})-\nabla f(x^{(k)}))^T( \nabla f(x^{(k+1)})+\nabla f(x^{(k)})).$$
Using Taylor’s approximation theorem, there should exist some $z$ “between” $x^{(k)}$ and $x^{(k+1)}$ where $H = \nabla^2 f(z)$ satisfies
$$H(x^{(k+1)}-x^{(k)} ) = \nabla f(x^{(k+1)})-\nabla f(x^{(k)}).$$
Using this substitution,
\begin{eqnarray*}
\| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 &=& (H(x^{(k+1)}-x^{(k)} ) )^T(H(x^{(k+1)}-x^{(k)} ) +2\nabla f(x^{(k)}))\\
&=& (H(x^{(k+1)}-x^{(k)} ) )^T(H(x^{(k+1)}-x^{(k)} ) -\frac{2}{\alpha}(x^{(k+1)}-x^{(k)})).
\end{eqnarray*}
Now we make life easier and just write $s = x^{(k+1)}-x^{(k)}$ to get
$$
\| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = s^TH^THs -\frac{2}{\alpha} s^THs.
$$
Now we invoke convexity. Since $f$ is convex, then a Cholesky decomposition exists $H = LL^T$, and we encode $u = L^Ts$, to get
$$ \| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = u^THu -\frac{2}{\alpha} \|u\|_2^2.$$
which is $\leq 0$ if $0<\alpha < 2/L$.
Notes:
It’s unclear to me if convexity is really needed, or if smoothness alone should be sufficient. It seems like there should be a way around requiring convexity, since the biggest property used was $Lu^Tu \geq u^THu$ which is true even if $f$ is not convex, but other little parts of the proof seems nontrivial. Maybe to be continued…
Edit: quick counterexample for nonconvex case: Suppose $f(x) = -x^2$ and start with $x^{(0)}\neq 0$. Then at every step, the gradient norm necessarily increases. We could say it still has an attainable min if we substitute this for, say, a sine function, but this violation still happens. Sad.
Edit: related works, generated through discussion:
- Class Notes by Mark cover the $O(1/k)$ convergence rate of average gradient norm$^2$ for nonconvex functions: https://www.cs.ubc.ca/~schmidtm/Courses/5XX-S20/S2.pdf
- Tighter bound $O(1/k^2)$ by Adrien and Francis using their computer-aided methods https://arxiv.org/pdf/1902.00947.pdf, Theorem 3
Isn’t convexity needed to assume the Cholesky decomposition?
Yeah, but actually convexity is needed for the result to be true anyway. I was hoping to get around with it by just doing LDL decomposition (which only requires symmetry), but the counterexample of $f(x) = -x^2$ suggests that would be fruitless.