Newton’s method II : Self concordant functions

tl;dr: Self-concordance. definition, some examples, some interesting properties. Proof of Newton’s rates left as exercise for reader ūüôā

An interesting generalization of strongly convex functions are these self-concordant functions, which can be defined as
$$
|D^3 f(x)[u,u,u]|\leq 2M\|u\|_{\nabla^2 f(x)}^2
$$
Here, the notation $$D^3 f(x) [u,v,w] = \sum_{i,j,k}\frac{\partial^3 f(x)}{\partial x_i \partial x_j \partial x_k} \, u_i \, v_j\,  w_k$$ is a way of denoting the 3rd order tensor derivative of $f$, and we write $\|u\|_H = \sqrt{|u^THu|}$ to denote the norm induced by the symmetric square matrix $H$.

This class of functions is usually studied in association with Newton’s method, which, as shown in a previous post, has quadratic convergence if the function is $\mu$-strongly-convex, and local quadratic convergence if the function has a continuous Hessian map and is $\mu$-strongly-convex at a unique minimizer $x^*$.

In fact, the notion can be made even more general by the observation that Newton’s method is affine invariant. That is, consider two equivalent problems where $g(u) = f(Au)$ for $x = Au$:
$$
\min_x\; f(x)\quad \iff \quad \min_u g(u). $$ Then the Newton step is identical, e.g. $$ x^+ Рx \,=\,  Р\beta (\nabla^2 f(x))^{-1} \nabla f(x)  \,=\, Р\beta A (\nabla^2 f(u))^{-1} \nabla f(u) \,=\, Au^+-Au. $$

So here’s a fun fact. There can be functions which are $\mu$-strongly-convex, but, after an affine change of variables, are¬†no longer strongly convex at all! (An easy example of this is the above one where $A$ is not full row rank). Somehow this restriction isn’t so interesting, though, because as we know even gradient descent is somewhat agnostic to low-rank affine transformations. (The gradient is always orthogonal to nullspaces of this transformation.) The issue is conditioning, e.g. when $A$ has a wide spectral range. As an example, suppose that $A = \mathbf{diag}(1/k)$ for $k = 1,…,n$. Then gradient descent will have trouble with the coordinates corresponding to the smallest eigenvalues of $A$, which will be poorly represented in comparison to the largest eigenvalues. On the other hand, Newton’s method would invert $A$ in its gradient computation, somehow evening this depression so that no effect is felt. (Of course the issue with multiplying a large number with its inverse still has to be dealt with numerically…)

By the way, in case you weren’t aware, quadratic convergence is very fast. Like, very, very, very fast.

 

 

The main example of a self-concordant and not-strongly-convex function was the log barrier $$
\phi(x) = -\sum_i \log(b_i-a_i^Tx)
$$ which could be used to enforce linear inequality constraints $a_i^Tx\leq b_i$ for $i = 1,…,m$. To see that it is not strongly convex, just take as a simple example $m = 1$, $b = 0$, and $a = -1$. Then $$ \phi(x) = -\log(x), \quad \phi'(x) = -\frac{1}{x}, \quad \phi”(x) = \frac{1}{x^2}$$ and taking $x\to +\infty$¬† shows that $\phi(x)$ cannot be $\mu$-strongly convex.

More generally, defining $c_i = \left(\frac{a_i^Tu}{b_i-a_i^Tx}\right)^2$, we see that

$$
|D^3 \phi(x)[u,u,u]| = \sum_i c_i^3 = \sum_i c_i^{3/2} \leq  \left(\sum_i c_i\right)^{3/2} = (u^T\nabla^2 \phi(x)u)^{3/2}
$$

since in general, for $a\geq 0$ and $b\geq 0$, $\log(a^{3/2}+b^{3/2}) \leq \tfrac{3}{2}\log(a+b)$.

I’m not really going to talk much more about barrier functions, though, because as beautiful as they are, they don’t really seem to be very present in machine learning applications. In contrast, since in ML / Big Data stuff we generally use gradient methods (and are whining enough when we have to use a full gradient rather than a stochastic one), the heavy machinery needed for this generalization doesn’t seem necessary. In particular, while log barriers are self-concordant, logistic regression and exponential margin regression are not. Take

$$f(x) =  \log(\sigma(a^Tx)),\quad \sigma(\xi) = \frac{1}{1+e^{-\xi}}$$

then

$$
u^T\nabla^2 f(x) u = \sigma(a^Tx)(1-\sigma(a^Tx)) (a^Tu)^2 \overset{\|x\|_2\to +\infty}{\to} 0
$$

$$
|D^3 f(x)[u,u,u] |= |(\sigma(1-a^Tx)^2 -\sigma(a^Tx)^2)(a^Tu)^3| \overset{\|x\|_2\to +\infty}{\to} |a^Tu|^3
$$

and it is clear that at this asymptotic limit, self-concordance is not present.

Ok, so why even care about this property?

Well, to me, it seems that even for first order methods, we are always trying to use quadratic approximations, usually with respect to the Euclidean norm. More recently, people have been all up in arms about using Bregman divergences instead as a metric, because they are more geometrically representative, or using Hessian estimates of Fisher Information. Well, in a way, self-concordant properties are kind of like the “gold standard” of such methods, because the idea is to use as much quadratic information as possible,¬† but with respect to the norm induced by the current Hessian. Without proving, here are some very familiar looking inequalities I pulled from [1]:

Hessian conditioning

  • If $f$ is globally¬† $\mu$-strongly convex:
    $$
    \mu I  \preceq \nabla^2 f(x) \preceq L I, \qquad \forall x
    $$
  • If $f$ is locally strongly convex at $x^*$, globally $M$-continuous Hessian
    $$
    0  \preceq \nabla^2 f(x) \preceq (L+M) I, \qquad \|x-x^*\|_2 \leq \frac{\mu}{M}
    $$
  • If $f$ is $M$-self-concordant
    $$
    (1-Mr)^2\nabla^2 f(x) \preceq \nabla^2 f(x^*) \preceq \frac{1}{(1-Mr)^2}\nabla^2 f(x), \qquad \|x-x^*\|_{\nabla^2 f(x)} \leq  1/M.
    $$

Optimality

  • If $f$ is globally¬† $\mu$-strongly convex:
    $$f(x)-f^* \geq \frac{\mu}{2}\|x-x^*\|_2^2 $$
  • If $f$ is locally strongly convex at $x^*$, globally $M$-continuous Hessian
    \begin{eqnarray*}
    f(x)-f^* &=& \int_0^1 \nabla f\underbrace{(x^*+t(x-x^*))^T}_{=z}(x-x^*)d\tau \\
    &=&  \int_0^1 (\nabla f(z)-\underbrace{\nabla f(x^*)}_{=0})^T(x-x^*)d\tau \\
    &\geq& \int_0^1 \frac{\mu}{t}\|z-x^*\|_2^2 \\
    &=& \mu \|x-x^*\|_2^2 \int_0^1 tdt \\
    &=& \frac{\mu}{2}\|x-x^*\|_2^2
    \end{eqnarray*}
  • If $f$ is $M$-self-concordant
    $$f(x)-f^* \geq \frac{1}{M^2}\omega(M \|x-x^*\|_{\nabla^2 f(x^*)}), \quad \omega(\tau) = \tau-\log(1+\tau).$$

Convex conjugates

Recall that for the conjugate function $f^*(z) = \sup_x\;x^Ty-f(x)$, then if both functions are twice differentiable, then via Fenchel Young,

$$
\nabla f(x) = z \iff x = \nabla f^*(z)
$$

and

$$\nabla^2 f^*(\nabla f(x)) = (\nabla^2 f(x))^{-1}.$$

Well, from that, it is evident that if $f$ is $L$-smooth then $f^*$ is $1/L$ strongly convex, and vice versa.

In comparison, if $f$ is $M$-self-concordant, then $f^*$ is $M$-self-concordant as well! Somehow the “gap” between $L$ and $\mu$ is omitted! To see this, note that, first note that
$$
D^3 f^*(z)[u,u,u] = \lim_{t\to 0} \frac{u^T(\nabla^2 f^*(z+tu)-\nabla^2 f^*(z))u}{t} =g'(0)$$
where for $\nabla f(x(t)) = z+tu$ and we write $x = x(0)$,
$$g(t) = u^T\nabla^2 f^*(z+tu)u = u^T\nabla^2 f(x(t))^{-1} u.
$$
The matrix $H(t) = \nabla^2 f(x(t))$ is a matrix map, and applying chain rule,
$$
g'(0) =\frac{\partial}{\partial t} u^TH(t)^{-1}u|_{t=0} = u^TH(0)^{-1} \left(\lim_{t\to 0} \frac{H(t)-H(0)}{t}\right)H(0)^{-1} u = u^T\nabla^2 f^*(z) \left(\lim_{t\to 0} \frac{H(t)-H(0)}{t}\right)\nabla^2 f^*(z)u.
$$
Next, using the first order Taylor expansion, $\nabla f(x+tv) \approx \nabla f(x) + t\nabla^2 f(x)v$, and thus
$$ \nabla^2 f(x+tv)^{-1} = \nabla^2 f^*(\nabla f(x+tv)) \approx \nabla^2 f^*(\nabla f(x)+t\nabla^2 f(x) v)$$
with equality as $t\to 0$. Taking $z = \nabla f(x)$, then $u = \nabla^2 f(x) v \iff v = \nabla^2 f^*(z)u$, and therefore at $t\to 0$,
$$
H(t)  = \nabla^2 f(x(t)) =\nabla^2 f(x+tv) = \nabla^2 f(x+  t\nabla^2 f^*(z) u)$$
and $$g'(0) = D^3 f^*(z)[u,u,u]  =D^3 f(x)[\nabla^2 f^*(z) u,\nabla^2 f^*(z) u,\nabla^2 f^*(z) u].$$
From the self-concordance of $f$, it follows that
$$
D^3 f(x)[\nabla^2 f^*(z) u,\nabla^2 f^*(z) u,\nabla^2 f^*(z) u] \leq 2M(u^T\nabla^2 f^*(z)\nabla^2 f(x)\nabla^2 f^*(z)u)^{3/2} =2M(u^T\nabla^2 f^*(z)u)^{3/2}
$$
Thus, $f$ is $M$-self-concordant if and only if $f^*$ is $M$-self-concordant, for convex $f$.

 

 

Rates

Using these properties, we can show that if a function $f$ is $M$ self concordant, then it has a local quadratic rate, and the number of non-local steps is bounded and usually small in practice. In summary, we use the Hessian-warped residual measure  $\lambda(x) = \|\nabla f(x)\|_{\nabla^2 f(x)^{-1}} $, and at each step we take $$x^{(t+1)} = x^{(t)}-\beta \nabla^2 f(x^{(t)})^{-1} \nabla f(x^{(t)}).$$

  • Global convergence:¬† when $\lambda(x) \geq \frac{1}{2M}$, then using $\beta = \frac{1}{1+M\lambda(x^{(t)})}$, we have a guaranteed decrease of $$f(x^{(t+1)}) \leq f(x^{(t)}) – \frac{C}{M^2},$$ for some computable constant $C$. This implies that in a finite, computable number of steps, we reach local convergence.
  • Local convergence: when $\lambda(x) \leq \frac{1}{2M}$, then using no damping ($\beta = 1$), we are guaranteed quadratic convergence; that is, $$f(x^{(t)})-f(x^*)=O(c^{2^t}).$$

Since the proofs of these results are both extensive, esoteric, and well-documented in many sources, I will forgo them.

Extensions

There are many interesting questions still to ask, and to be honest, a lot of people are already asking them. For example:

  • What about sketched Hessians? When there is error in computing the Hessian, can we still guarantee a local quadratic rate? The answer, according to [2], is unfortunately no; there can be superlinear convergence under special circumstances, but it is not obvious how to recover the quadratic local rate. Somehow this isn’t too surprising; even vanilla Newton’s method with $\beta \neq 1$ can’t achieve this rate easily.
  • What happens if $f$ is nonconvex? It turns out that, like $L$-smoothness, that self-concordance doesn’t necessarily imply convexity. While there is not a ton of work in this area, one notable one is [3], showing that the idea of at least self-concordant convex-concave saddle functions has been explored.

However, overall, it seems that though we like this property of “near-quadratic-ness”, maybe because it is rather involved to work with or simply because we don’t really do that much second-order optimization in large-scale settings, this property (compared to $L$-smoothness and $\mu$-strong-convexity) is relatively less explored.

References

  1. Lectures on Convex Optimization, by Yurii Nesterov (Section 5.1)

  2. “Newton Sketch: A Linear-time Optimization Algorithm with Linear-Quadratic Convergence,” by Mert Pilanci and Martin J. Wainwright.
  3. “On Self-Concordant Convex-Concave Functions”, by Arkadi Nemirovski.

Leave a Reply