# Small proof: $L$-smoothness definitions and equivalences

### tl;dr

In a former version of this post, I conjectured that while $L$-smoothness and convexity were not equivalent in definition, they were tied in some of the “equivalent” (not equivalent) versions of their statements.

To clarify, my former statement:

“Convexity and smoothness are totally independent concepts and should not be confused with each other “

I now still believe is still true, and (about 80%) carries through to equivalent forms as well. Thus no groundbreaking optimization proofs were harming in the creation of this blog post.

But, there are still minor details to iron out.

### “Equivalences”

The following are all considered “equivalent” definitions of $L$-smoothness. (tl;dr: they are not equivalent).

• [D] $\|\nabla f(x)-\nabla f(y)\|_2\leq L\|x-y\|_2$
• [Q] $f(x)-f(y)-\nabla f(y)^T(x-y) \leq \frac{L}{2}\|x-y\|_2^2$
• [Q2] $g(x) = \frac{L}{2}x^Tx-f(x)$ is convex
• [M] $(\nabla f(x)-\nabla f(y))^T(x-y) \leq L\|x-y\|_2^2$
• [H] Largest eigenvalue of $\nabla^2 f(x)$ is less than $L$

Now, some details:

• [D] is in fact the definition of $L$-smoothness. It does not and cannot depend on convexity, since if $f$ is $L$-smooth, then so is $-f$.
• [Q], [Q2], [M] are all equivalent. See also citation [1] below.
• Under the assumption that the Hessian exists everywhere, [Q], [Q2], [M] $\iff$ [H].

New conjecture (probably exists somewhere, I’m too lazy to find it and rather prove it myself):

1. [D] $\iff$ [Q], [Q2], [M] applies for $f$ and $-f$
2. [D] + Hessian exists everywhere $\iff$ largest magnitude eigenvalue of $\nabla^2 f(x)$ is less than $L$
3. [Q], [Q2], [M] alone are trivially true for concave functions.
4. Relationship with convexity: [D] $\Rightarrow$ [Q], [Q2], [M].  If additionally the function is convex, then [Q], [Q2], [M] $\Rightarrow$ [D]. But it is a sufficient, not necessary condition (as evidenced by $f(x) = \sin(x)$ is $L$-smooth.)

Proof of statement 3: This is pretty simple, and we can just go after [Q2]: if $f$ is concave then $-f$ is convex, and thus $g$ is convex.

Proof of statement 2: This is more of a logic statement than anything else. It is still true that $L$-smoothness forces Hessian magnitude to be less than $L$, and once simple way to see this is to use the Taylor approximation trick: e.g. given that $\nabla^2 f(x)$ is continuous everywhere, for every $x$ and $y$, there exists some $w$ where $$\nabla f(x)-\nabla f(y) = \nabla^2 f(w)(x-y).$$ Therefore, $$\|\nabla f(x)-\nabla f(y)\|_2^2 = \|\nabla f(w)\|_2^2\|x-y\|_2^2,$$ which imposes that $\|\nabla f(w)\|_2 \leq L$ if [D] is to hold.

Proof of statement 4: First, without extra bells or whistles, Nesterov already showed us that [D] $\Rightarrow$ [Q], [Q2],[M]. To see this, we just prove for [Q]. First, the Bregman divergence can be written as an integral over inner products:
$$f(x)-f(y) = \int_0^1 \frac{\partial f (u(s))}{\partial s} ds = \int_0^1 \sum_{k=1}^n \frac{\partial f(u)}{\partial u_k} \frac{\partial u_k(s)}{\partial s} ds= \int_0^1 \nabla f(u(s))^T(x-y) ds$$
where $u(t) = (1-t) y + t x\in \mathbb R^n$. Therefore,
$$f(x)-f(y) – \nabla f(y)^T(x-y)= \int_0^1 (\nabla f(u(s)) – \nabla f(y))^T(x-y) ds$$
and using Cauchy-Schwartz and imposing $L$-smoothness,

$$f(x)-f(y) – \nabla f(y)^T(x-y) \overset{\mathrm{C-S}}{\leq} \int_0^1 \|\nabla f(u(s)) – \nabla f(y)\|_2\|x-y\|_2 ds \overset{\mathrm{D}}{\leq} L \int_0^1 \|u-y\|_2\|x-y\|_2 ds.$$

Since $\|u-y\|_2 = s\|x-y\|_2$, the integral reduces to
$$L \int_0^1 \|u-y\|_2\|x-y\|_2 ds = L \|x-y\|_2^2\int_0^1 s ds = \frac{L}{2}\|x-y\|_2^2.$$

Now suppose [Q2] is true, and also $f$ is convex. And, for fun, let’s just assume that the Hessian exists everywhere. Then, since the eigenvalues of $\nabla^2 g(x)$ are $\geq 0$ and $f$ is convex, then $0 \preceq \nabla^2 f(x) \preceq LI$. Note that [D] only requires $-LI \preceq \nabla^2 f(x) \preceq LI$; hence the sufficient but not necessary bit.

Finally, Proof of statement 1: Suppose that [Q2] holds for $f$ and $-f$. Then the eigenvalues of $f$ must be bounded between $LI$ and $-LI$. This exactly gives the Hessian condition necessary for [D].

A could-be-fun-addendum: proving all of this without using the Hessian, aka for cases where the Hessian may not exist.

And, since it’s been a while, a cozy photo of a little friend

Citations

[1]: Blog of Xingyu Zhou, 2 pages in particular: https://xingyuzhou.org/blog/notes/Lipschitz-gradient and https://xingyuzhou.org/blog/notes/strong-convexity. By far the best set of reference notes I have seen on this topic.