tl;dr
In a former version of this post, I conjectured that while $L$-smoothness and convexity were not equivalent in definition, they were tied in some of the “equivalent” (not equivalent) versions of their statements.
To clarify, my former statement:
“Convexity and smoothness are totally independent concepts and should not be confused with each other “
I now still believe is still true, and (about 80%) carries through to equivalent forms as well. Thus no groundbreaking optimization proofs were harming in the creation of this blog post.
But, there are still minor details to iron out.
“Equivalences”
The following are all considered “equivalent” definitions of $L$-smoothness. (tl;dr: they are not equivalent).
- [D] $\|\nabla f(x)-\nabla f(y)\|_2\leq L\|x-y\|_2$
- [Q] $f(x)-f(y)-\nabla f(y)^T(x-y) \leq \frac{L}{2}\|x-y\|_2^2$
- [Q2] $ g(x) = \frac{L}{2}x^Tx-f(x)$ is convex
- [M] $(\nabla f(x)-\nabla f(y))^T(x-y) \leq L\|x-y\|_2^2$
- [H] Largest eigenvalue of $\nabla^2 f(x)$ is less than $L$
Now, some details:
- [D] is in fact the definition of $L$-smoothness. It does not and cannot depend on convexity, since if $f$ is $L$-smooth, then so is $-f$.
- [Q], [Q2], [M] are all equivalent. See also citation [1] below.
- Under the assumption that the Hessian exists everywhere, [Q], [Q2], [M] $\iff$ [H].
New conjecture (probably exists somewhere, I’m too lazy to find it and rather prove it myself):
- [D] $\iff$ [Q], [Q2], [M] applies for $f$ and $-f$
- [D] + Hessian exists everywhere $\iff$ largest magnitude eigenvalue of $\nabla^2 f(x)$ is less than $L$
- [Q], [Q2], [M] alone are trivially true for concave functions.
- Relationship with convexity: [D] $\Rightarrow$ [Q], [Q2], [M]. If additionally the function is convex, then [Q], [Q2], [M] $\Rightarrow$ [D]. But it is a sufficient, not necessary condition (as evidenced by $f(x) = \sin(x)$ is $L$-smooth.)
Proof of statement 3: This is pretty simple, and we can just go after [Q2]: if $f$ is concave then $-f$ is convex, and thus $g$ is convex.
Proof of statement 2: This is more of a logic statement than anything else. It is still true that $L$-smoothness forces Hessian magnitude to be less than $L$, and once simple way to see this is to use the Taylor approximation trick: e.g. given that $\nabla^2 f(x)$ is continuous everywhere, for every $x$ and $y$, there exists some $w$ where $$\nabla f(x)-\nabla f(y) = \nabla^2 f(w)(x-y).$$ Therefore, $$\|\nabla f(x)-\nabla f(y)\|_2^2 = \|\nabla f(w)\|_2^2\|x-y\|_2^2,$$ which imposes that $\|\nabla f(w)\|_2 \leq L$ if [D] is to hold.
Proof of statement 4: First, without extra bells or whistles, Nesterov already showed us that [D] $\Rightarrow$ [Q], [Q2],[M]. To see this, we just prove for [Q]. First, the Bregman divergence can be written as an integral over inner products:
$$f(x)-f(y) = \int_0^1 \frac{\partial f (u(s))}{\partial s} ds = \int_0^1 \sum_{k=1}^n \frac{\partial f(u)}{\partial u_k} \frac{\partial u_k(s)}{\partial s} ds= \int_0^1 \nabla f(u(s))^T(x-y) ds$$
where $u(t) = (1-t) y + t x\in \mathbb R^n$. Therefore,
$$f(x)-f(y) – \nabla f(y)^T(x-y)= \int_0^1 (\nabla f(u(s)) – \nabla f(y))^T(x-y) ds$$
and using Cauchy-Schwartz and imposing $L$-smoothness,
$$f(x)-f(y) – \nabla f(y)^T(x-y) \overset{\mathrm{C-S}}{\leq} \int_0^1 \|\nabla f(u(s)) – \nabla f(y)\|_2\|x-y\|_2 ds \overset{\mathrm{D}}{\leq} L \int_0^1 \|u-y\|_2\|x-y\|_2 ds.$$
Since $\|u-y\|_2 = s\|x-y\|_2$, the integral reduces to
$$L \int_0^1 \|u-y\|_2\|x-y\|_2 ds = L \|x-y\|_2^2\int_0^1 s ds = \frac{L}{2}\|x-y\|_2^2.$$
Now suppose [Q2] is true, and also $f$ is convex. And, for fun, let’s just assume that the Hessian exists everywhere. Then, since the eigenvalues of $\nabla^2 g(x)$ are $\geq 0$ and $f$ is convex, then $0 \preceq \nabla^2 f(x) \preceq LI$. Note that [D] only requires $-LI \preceq \nabla^2 f(x) \preceq LI$; hence the sufficient but not necessary bit.
Finally, Proof of statement 1: Suppose that [Q2] holds for $f$ and $-f$. Then the eigenvalues of $f$ must be bounded between $LI$ and $-LI$. This exactly gives the Hessian condition necessary for [D].
A could-be-fun-addendum: proving all of this without using the Hessian, aka for cases where the Hessian may not exist.
And, since it’s been a while, a cozy photo of a little friend
Citations
[1]: Blog of Xingyu Zhou, 2 pages in particular: https://xingyuzhou.org/blog/notes/Lipschitz-gradient and https://xingyuzhou.org/blog/notes/strong-convexity. By far the best set of reference notes I have seen on this topic.