At some point in every optimist’s life, we are going to be asked the age-old question: “How does your method compare with Adam?” or, even more annoyingly, “Does this analysis help understand Adam?” I mean, Adam is a nice method that works very well for a very unique set of optimization problems, and not, like, the answer to all the problems in the universe (42). That being said, even if Adam (the method) isn’t exactly Adam (the first man), we should probably know a little bit about what it’s about.

In case you were living under a rock for the past 10 years, Adam refers to ADAptive Moment estimation, this magical method that came out in 2015 by Kingma and Ba, and is widely hailed as the one training method that actually seems to work well. It was the latest in adaptive step size methods, starting with AdaGrad (Duchi, Hazan, Singer 2011), but including AdaDelta (Zeiler 2012), RMSProp (Tieleman & Hinton, 2012), etc.

 Adam (Kingma and Ba, 2015) (typo above: should be $\alpha_t = \alpha/\sqrt{t}$, not constant $\alpha$) Not Adam Or Adam

Anyway, so I finally decided it was time to bite the bullet, and learn about what this Adam thing is all about, follow the journey to a proper convergence proof, and give some comments as to how it may work in practice.  This post will largely be a literature survey about two parts of this story: the original paper, and the followup proof

[1] Kingma and Ba. Adam: A Method for Stochastic Optimization

[2] Reddi, Kale, Kumar. On the Convergence of Adam and Beyond.

## Related prior works

As previously mentioned, Adam comes as a member of the “adaptive step size committee”, a string of methods that followed AdaGrad, the often hailed seminal work of Duchi et al who first proposed adaptively adjusting the step size to help deep network training deal with the vanishing gradient problem. Followup work, like RMSProp, and Adam itself, can be seen as “descendants” of this work.  While that paper in itself holds an important place in this journey, in the interest of time I will not mention it much more here.

## What the papers do vs what people actually do

Before going on, I need to point out, indeed, highlight, three assumptions (1-3) that are critical in the proofs but clearly not used in practice. Additionally, we will mention a bug (4), which was later fixed in [2], and just for completeness a minor thing to remember (5).

1. The proof assumes $f_t$, the online loss function, is convex in its parameters.
2. The proof uses a decaying step size $\alpha_t = \alpha/\sqrt{t}$. In practice, we use a constant $\alpha$.
3. The proof uses a decaying mixing parameter $\beta_{1,t} = \beta_{1,0}\lambda^{t-1}$ for some $0 < \lambda < 1$.
4. (Bug, fixed in later work) We must require $\sqrt{v_t}/\alpha_t$ is a decreasing sequence.
5. (Minor) the proof assumes $\epsilon = 0$. (In practice, $\epsilon \approx 0$.)

Now, I should do a disclaimer and say that everyone and their sister, including me, often resorts to unreasonable assumptions the night before the deadline. It’s unsatisfying when theory doesn’t quite match practice, but the truth is doing convergence proofs is hard, especially for very general assumptions. However, what makes me feel that this is a bit less ok than the usual case is that Adam is often hailed as the method to use ([1] is cited over 100,000 times). Therefore, the fact that we as a community don’t seem as interested in closing this loop is a bit baffling to me. For this reason, at the end of this post, I will try to highlight a few papers that tried to “fix the proof” by justifying these assumptions. If I have missed out a work in this category that you know of, please do let me know and I will update it.

(By the way, I want to clarify, this is not a rant against the authors of [1]. I am not trying to take away the significance of their contribution. It’s more of a question of whether, as a community, we are satisfied with this story being “done”.)

## Proof of Adam ([1], with help from [2])

For $g_t = \nabla_\theta(f_t(\theta_{t-1}))$, consider the method

\begin{eqnarray*}
m_t &=& \beta_{1,t}m_{t-1} + (1-\beta_{1,t})g_t, \quad \hat m_t = \frac{m_t}{1-\beta_1^t}\\
v_t &=& \beta_{2}v_{t-1} + (1-\beta_{2})g^2_t, \quad \hat v_t = \frac{v_t}{1-\beta_2^t}\\
\theta_t &=& \theta_{t-1}\, -\, \frac{\alpha_t }{\sqrt{\hat v_t}}\hat m_t
\end{eqnarray*}

where we initialize $m_0 = v_0 = 1$ and $\beta_{1,0} = \beta_1$. Here, $m_t,v_t,\theta_t,g_t$ are all vectors in $\mathbb R^d$ and with an extreme abuse of notation, all multiplications and divisions between these vectors are taken element-wise. The scalars are $\alpha_t,\beta_{1,t},\beta_2$, and of course $t$.

Theorem: Assume that

• the functions $f_t(\theta) = \mathcal L_\theta(x_t,y_t)$ are convex in $\theta\in \mathbb R^d$
• the gradients of $f_t$ are bounded everywhere $\|\nabla f_t(\theta)\|_2 \leq G$, $\|f_t(\theta)\|_\infty \leq G_\infty$
• the distance between any two $\theta_t$ generated by Adam is bounded $$\|\theta_n-\theta_m\|_2\leq D, \quad \|\theta_m-\theta_n\|_\infty \leq D_\infty, \text{ for any } m,n \in \{1,…,T\}$$

Pick parameters such that $\beta_1,\beta_2 \in [0,1)$, $\beta_1^2/\sqrt{\beta_2} < 1$, $\alpha_t = \alpha/\sqrt{t}$, $\lambda\in (0,1)$, and define $\beta_{1,t}=\beta_1\lambda^{t-1}$. Then for all $T\geq 1$, the regret $$R(T) = \sum_{t=1}^Tf_t(\theta_t)-\min_\theta \sum_{t=1}^T f_t(\theta)$$ is upper bounded
\begin{multline*}
R(T)\leq
\frac{D^2}{2\alpha}\frac{ \sqrt{G}}{\sqrt{1-\beta_2}}
+ \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} \frac{\alpha\sqrt{\log(T)+1} }{2(1-\beta_1\lambda)} \|g_{1:T}\|_1 \\
+ \frac{1}{1-\beta_1} \frac{dG_\infty\sqrt{ \beta_2}}{\sqrt{ \beta_2}-\beta_1^2} + \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}(1-\lambda)^2}
\end{multline*}
In terms of big-O, the first, third, and fourth term are constants, and overall we get an $R(T) = O(\sqrt{\log(T)}\|g_{1:T}\|_1$ term. Here I should note that in the two papers [1] and [2], actually this is a $\|g_{1:T}\|_2$  term, but in recreating the proof I was not able to recover that. That being said, these norms are exchangable, so we’ll just leave it as is. The  $\|g_{1:T}\|_2$ also appears in AdaGrad, and the argument is that if your gradients are sparse enough, this guy is also $O(\sqrt{T})$. Seems like a rough assumption, but let’s go with the flow for now.

Additionally, it should be noted that there does not appear to be a significant speedup over SGD. In [1], the argument is that Adam is $O(\log(d)/\sqrt{t})$, (which got boosted in terms of $t$ by [2]), but in no way can I recreate the proof so that the dependency on $d$ is less than linear. (I’d be happy to correct it if people feel differently.)

Either way, the fact that Adam does not claim to have a faster convergence rate than SGD in terms of $t$, is somewhat significant.

## Now let’s do the proofs

Using the update rule $\theta_{t+1}=\theta_t-\frac{\alpha_t\hat m_t}{\sqrt{\hat v_t}}$ where $\epsilon = 0$
then we may write

\begin{eqnarray*}
(\theta_{t+1,i}-\theta_i^*)^2&=&(\theta_{t,i}-\theta_i^*)^2 – 2\alpha_t\frac{\hat m_{t,i}}{\sqrt{\hat v_{t,i}}}(\theta_{t,i}-\theta_i^*) + \alpha^2_t\frac{\hat m^2_{t,i}}{\hat v_{t,i}}\\
&=&(\theta_{t,i}-\theta_i^*)^2 – \frac{2(1-\beta_{1,t})\alpha_tg_{t,i}}{(1-\beta^t_1)\sqrt{\hat v_{t,i}}}(\theta_{t,i}-\theta_i^*)
– \frac{ 2\alpha_t\beta_{1,t} m_{t-1,i} }{\sqrt{\hat v_{t,i}}(1-\beta^t_1)}(\theta_{t,i}-\theta_i^*)
+ \alpha^2_t\frac{\hat m^2_{t,i}}{\hat v_{t,i}}.
\end{eqnarray*}

Summing, and pulling out the term $g_t^T(\theta_t-\theta^*)$

\begin{eqnarray*}
g_t^T(\theta_t-\theta^*) &=& \sum_{i=1}^d\left(\left((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta_i^*)^2\right)\frac{\sqrt{\hat v_{t,i}}}{2\alpha_t }\frac{1-\beta_1^t}{1-\beta_{1,t}} \,+\, \frac{\alpha_t \hat m_{t,i}^2}{2\sqrt{\hat v_{t,i}}}\frac{1-\beta_1^t}{1-\beta_{1,t}} \,-\, \frac{\beta_1 m_{t-1,i}}{(1-\beta_{1,t})}(\theta_{t,i}-\theta_i^*)\right)
\end{eqnarray*}

The last term can be bound using
$$|m_{t-1}^T(\theta_t-\theta^*)|\leq \frac{\|\alpha_t^{1/2}\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\alpha_t^{-1/2}\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2}$$
where $\hat V_t = \mathbf{diag}(\hat v_{t})$, giving
\begin{eqnarray*}
g_t^T(\theta_t-\theta^*) &\leq & \sum_{i=1}^d\left(\left((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta_i^*)^2\right)\frac{\sqrt{\hat v_{t,i}}}{2\alpha_t }\frac{1-\beta_1^t}{1-\beta_{t,1}}\,+\, \frac{\alpha_t \hat m_{t,i}^2}{2\sqrt{\hat v_{t,i}}} \frac{1-\beta_{1}^t}{1-\beta_{t,1}}\right) \\
&&+ \frac{\beta_{t,1}}{1-\beta_{t,1}} \left(\frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}\right)
\end{eqnarray*}

From convexity , for $\theta^* = \min_\theta \sum_{t=1}^T f_t(\theta)$ we have
$$f_t(\theta_t)-f_t(\theta^*)\leq g_t^T(\theta_t-\theta^*),$$
and
\begin{eqnarray*}
R(T) &\leq &   \sum_{t=1}^T\Bigg[\frac{1}{2\alpha_t} \frac{1-\beta_1^t}{1-\beta_{t,1}} (\|\hat V^{1/4}_t(\theta_{t}-\theta^*)\|_2^2 -\|\hat V^{1/4}_t(\theta_{t+1}-\theta^*)\|_2^2  ) \,+\, \frac{\alpha_t}{2} \frac{1-\beta_1^t}{1-\beta_{t,1}}\|\hat V_t^{-1/4}\hat m_t\|_2^2  \\&&+ \frac{\beta_{t,1}}{1-\beta_{t,1}} \left(\frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}\right)\Bigg]\\
&=& R_1(T)+R_2(T)+R_3(T) + R_4(T)
\end{eqnarray*}

where

\begin{eqnarray*}
R_1(T) &=& \sum_{t=1}^T\frac{1}{2\alpha_t} (\|\hat V^{1/4}_t(\theta_{t}-\theta^*)\|_2^2-\|\hat V^{1/4}_t(\theta_{t+1}-\theta^*)\|_2^2 )\\
R_2(T) &=&\sum_{t=1}^T \frac{\alpha_t}{2}\frac{1-\beta_1^t}{1-\beta_{t,1}} \|\hat V_t^{-1/4}\hat m_t\|_2^2\\
R_3(T) &=&\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}} \frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} \\
R_4(T) &=&\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}}  \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}
\end{eqnarray*}

### Simple bounds

• First, note that, element-wise, $$|v_{t,i}| = \sum_{\tau = 1}^t \beta_2^{t-\tau} g_{\tau,i}^2\leq G_\infty^2 \sum_{\tau’ = 0}^{t-1} \beta_2^{\tau’} \leq \frac{G_\infty^2}{1-\beta_2}$$
where we use a change of variables $\tau’ = t – \tau$ and upper bound with the infinite geometric series.  So, $\|v_t\|_2 \leq d\|v_t\|_\infty \leq \frac{d G_\infty^2}{1-\beta_2}$.
• Another important inequality. Consider two sequences $a_k\geq 0$ and $b_k\geq 0$. By Holder $p = 1$,
$$\sum_k a_k = \sum_k \frac{a_k}{\sqrt{b_k}}\cdot \sqrt{b_k} \leq \left(\sum_j \frac{a_j}{\sqrt{b_j}}\right)\left(\max_l \sqrt{b_l}\right)\overset{\|\cdot\|_\infty \leq \|\cdot\|_2}{\leq} \left(\sum_j \frac{a_j}{\sqrt{b_j}}\right)\left(\sqrt{\sum_l b_l}\right)$$
Therefore,
$$\frac{\sum_k a_k}{\sqrt{\sum_l b_l}} \leq \sum_j \frac{a_j}{\sqrt{b_j}} \qquad\qquad(\oplus)$$
• Lemma 2 in [1], fixed in Lemma 2 in [2]. We now need to put a bound on $\sum_t \frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}}$. In [1], they try to do a proof in which this term becomes $O(\sqrt{t})$; in [2], they tightened it to $O(\sqrt{\log(T)})$. In addition to being a better rate, [2] seemed to also have fewer bugs in the proof, so I think it makes sense to just present their version.
\begin{eqnarray*}
\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} = \sum_{i=1}^d\frac{\sqrt{1-\beta_2^t}(1-\beta_{1,0})^2}{(1-\beta_1^t)^2(1-\beta_2)}\frac{1}{\sqrt{t}}\frac{(\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j}g_{k,i})^2}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}
\end{eqnarray*}
Using Cauchy Schwartz,
$$\sum_{k=1}^t(1-\beta_{1,0})\prod_{j=k}^t\beta_{1,j}g_{k,i}\leq \sqrt{\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j} \sum_{l=1}^t \prod_{j=l}^t\beta_{1,j} g^2_{l,i}}$$
so we may write
\begin{eqnarray*}
\sum_{i=1}^d\frac{(\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j}g_{k,i})^2}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}} &=& \sum_{i=1}^d\frac{\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j} \sum_{l=1}^t \prod_{j=l}^t\beta_{1,j} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
&=& \sum_{i=1}^d\frac{\sum_{k=1}^t\beta_1^{t-k} \sum_{l=1}^t \beta_1^{t-l} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
&\overset{\sum_{k=1}^t\beta_1^{t-k}\leq \frac{1}{1-\beta_1}}{\leq}& \frac{1}{1-\beta_1}\sum_{i=1}^d\frac{\sum_{l=1}^t \beta_1^{t-l} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
&\overset{(\oplus)}{\leq}& \frac{1}{1-\beta_1}\sum_{i=1}^d\sum_{k=1}^t\frac{ \beta_1^{t-k} g^2_{k,i}}{\sqrt{ \beta_2^{t-k}g_{k,i}^2}}\\
&=& \frac{1}{1-\beta_1}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1\\
\end{eqnarray*}
Now, let’s sum over $t$.
\begin{eqnarray*}
\sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&  \frac{\sqrt{1-\beta_2^t}(1-\beta_{1,0})^2}{(1-\beta_1^t)^2(1-\beta_2)(1-\beta_1)}\sum_{t=1}^T\frac{1}{\sqrt{t}}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1
\end{eqnarray*}
The next step is to use the assumption $\beta_1^2\leq \sqrt{\beta_2}$ to get
$$\frac{\sqrt{1-\beta_2^t}}{(1-\beta_1^t)^2}(1-\beta_{1,0})^2 \overset{\beta_1^2\leq \sqrt{\beta_2}}{\leq} \frac{\sqrt{1-\beta_1^{4t}}}{(1-\beta_1^t)^2}(1-\beta_{1,0})^2 \overset{\beta_{1,0}\leq 1}{\leq}\frac{\sqrt{1-\beta_1^{4t}}}{(1-\beta_1^t)^2}\leq 1$$
since the 1-D curve $g(x) = \frac{1-x^4}{(1-x)^4}$ is decreasing in the interval $0\leq x \leq 1$ and $g(0) = 1$. So, more simply,
\begin{eqnarray*}
\sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{t=1}^T\frac{1}{\sqrt{t}}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1
\end{eqnarray*}We now switch the $t$ and the $k$ summations, and use $\tau = t-k$. Then
\begin{eqnarray*}
\sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{k=1}^T\sum_{t=k}^T\frac{1}{\sqrt{t}}\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1\\
&\leq&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{k=1}^T\sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau+k}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\|g_k\|_1\\
&\overset{\tau+k\geq\tau}{\leq}&  \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\sum_{k=1}^T\|g_k\|_1
\end{eqnarray*}
Now we can just work with this term
$$\sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\overset{C-S}{\leq}\sqrt{\sum_{\tau=0}^{T}\frac{1}{\tau}}\sqrt{\sum_{\tau’=0}^T\frac{ \beta_1^{2\tau’} }{ \beta_2^{\tau’}}}\leq \sqrt{\log(T)+1}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }}$$
which gives an overall rate of
\begin{eqnarray*}
\sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq&   \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} \sqrt{\log(T)+1}\|g_{k,1:T}\|_1 \\&=& A \sqrt{\log(T)+1}\|g_{k,1:T}\|_1
\end{eqnarray*}
where we absorb the constant
$$A= \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} .$$

Now we are ready to tackle the terms one by one. First,

### Term $R_1(T)$

Looking at each element,

\begin{eqnarray*}
\sum_{t=1}^T\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} ((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta^*_i)^2 )&=&
\frac{\hat v^{1/2}_{1,i}}{2\alpha_1} (\theta_{1,i}-\theta_i^*)^2
-\frac{\hat v^{1/2}_{T-1,i}}{2\alpha_{T-1}} ((\theta_{t,i}-\theta_i^*)^2)\\
&&
+
\sum_{t=2}^{T-1}\left(\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} -\frac{\hat v^{1/2}_{t-1,i}}{2\alpha_{t-1}}\right) (\theta_{t,i}-\theta_i^*)^2\\
&\leq&
\frac{\hat v^{1/2}_{1,i}}{2\alpha_1} (\theta_{1,i}-\theta_i^*)^2
-\frac{\hat v^{1/2}_{T-1,i}}{2\alpha_{T-1}} ((\theta_{T,i}-\theta_i^*)^2)
\end{eqnarray*}

only if
$$\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} \leq\frac{\hat v^{1/2}_{t-1,i}}{2\alpha_{t-1}} \qquad\qquad (\star)$$

for all $t$! ($(\star)$ was actually the biggest point of [2], which corrects the proof in [1] by forcing this equation to be true in AMSGrad.)

This simplifies

$$R_1(T) \leq \frac{1}{2\alpha_1} \|\hat V^{1/4}_1(\theta_{1}-\theta^*)\|_2^2-\frac{1}{2\alpha_{T-1}}\|\hat V^{1/4}_{T-1}(\theta_{T}-\theta^*)\|_2^2 \leq \frac{1}{2\alpha} \|\hat V^{1/4}_1(\theta_{1}-\theta^*)\|_2^2 \leq \frac{D^2}{2\alpha}\frac{ \sqrt{G}}{\sqrt{1-\beta_2}}$$

### Term $R_2(T)$

We want to use one of our “small bounds” results here. Explicitly,

$$R_2(T) = \frac{\alpha}{2} \sum_{t=1}^T \frac{1-\beta_1^t}{1-\beta_{1,0}\lambda^t}\frac{\|\hat V_t^{-1/4}\hat m_t\|_2^2}{\sqrt{t}} \leq \frac{\alpha}{2} \sum_{t=1}^T \frac{1}{1-\beta_1\lambda}\frac{\|\hat V_t^{-1/4}\hat m_t\|_2^2}{\sqrt{t}} \leq \frac{\alpha A}{2(1-\beta_1\lambda)} \sqrt{\log(T)+1}\|g_{k,1:T}\|_1$$

### Term $R_3(T)$

I’m actually a little confused as to why neither paper seemed to really address what was going on with $R_3(T)$. In particular, if we expand,

$$R_3(T) =\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}} \frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2}$$

\begin{eqnarray*}
\|\hat V_t^{-1/4}m_{t-1}\|_2^2\, -\, \|\hat V_t^{-1/4}m_{t} \|_2^2 &=& \|\hat V_t^{-1/4}(m_{t-1}-m_t) \|_2^2  \,-\, 2(\hat V_t^{-1/4}m_{t})^T(\hat V_t^{-1/4}(m_{t-1}-m_{t}))  \\
&=& (1-\beta_{1,t})^2 \|\hat V_t^{-1/4}(m_{t-1}-g_t) \|_2^2 \, -\, 2(1-\beta_{1,t})(\hat V_t^{-1/4}m_{t})^T(\hat V_t^{-1/4}(m_{t-1}-g_{t}))
\end{eqnarray*}
and $(1-\beta_{1,t})^2\to 1$ as $t\to\infty$. So it’s not like we can bound $R_3$ using the same tricks from $R_2$. On the other hand, assuming that $|\hat V^{-1/4}m_{t-1}\|_2^2$ is a bounded quantity, and $\frac{\beta_{t,1}}{1-\beta_{t,1}}$ is a summable sequence, it’s more likely that they just saw $R_3(T) \leq \text{constant}$ and just didn’t bother with it. Well, for completeness, let’s bother with it. First,
\begin{eqnarray*}
\|\hat V_t^{-1/4}m_{t-1}\|_2^2 &\leq & \|V_t^{-1/4}m_{t-1}\|_2^2 \\
&\leq& \sum_{i=1}^d \frac{(\sum_{\tau=1}^t\beta_{1}^\tau g_{i,\tau})^2}{\sqrt{\sum_{\tau=1}^t \beta_2^\tau g^2_{i,\tau}}}  \\
&\overset{\oplus}{\leq}&  \sum_{\tau=1}^t \frac{(\beta_{1}^\tau )^2}{\sqrt{ \beta_2^\tau }} \|g_t\|_1\\
&\overset{\|g_t\|_1\leq dG_\infty}{\leq}& dG_\infty \sum_{\tau=1}^t \frac{(\beta_{1}^\tau )^2}{\sqrt{ \beta_2^\tau }}\\
&\leq& \frac{dG_\infty\sqrt{ \beta_2}}{\sqrt{ \beta_2}-\beta_1^2} =:B
\end{eqnarray*}

so we’re safe!

### Term $R_4(T)$

Using $\alpha_t=\alpha/\sqrt{t}$.

\begin{eqnarray*}
R_4(T) &=&\sum_{t=1}^T   \|\hat V_t^{1/4}(\theta_t-\theta^*)\|_2^2\frac{\beta_{t,1}}{2\alpha_t(1-\beta_{t,1})}\\
&=&\frac{D^2}{2\alpha}\sum_{t=1}^T   \|v_t^{1/2}\|_2\frac{\beta_{t,1}\sqrt{t}}{(1-\beta_{t,1})(1-\beta_2^t)}
\end{eqnarray*}

Note that, element-wise, $$|v_{t,i}| = \sum_{\tau = 1}^t \beta_2^{t-\tau} g_{\tau,i}^2\leq G_\infty^2 \sum_{\tau’ = 0}^{t-1} \beta_2^{\tau’} \leq \frac{G_\infty^2}{1-\beta_2}$$
where we use a change of variables $\tau’ = t – \tau$ and upper bound with the infinite geometric series. This gets us to

$$R_4(T) = \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}}\sum_{t=1}^T\frac{\lambda^t\beta_{0,1}\sqrt{t}}{(1-\lambda^t\beta_{0,1})(1-\beta_2^t)}\leq \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}}\sum_{t=1}^T\lambda^t\sqrt{t}$$

and using some fairly liberal tricks, for $\lambda \in (0,1)$:
$$\sum_{t=1}^T\lambda^t\sqrt{t} \leq \sum_{t=1}^T\lambda^t t \leq \int_0^\infty \lambda^t t dt = \frac{1}{(1-\lambda)^2}$$

we get
$$R_4(T) = \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}}\sum_{t=1}^T\frac{\lambda^t\beta_{0,1}\sqrt{t}}{(1-\lambda^t\beta_{0,1})(1-\beta_2^t)}\leq \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}(1-\lambda)^2}$$

Sidebar: After 3 days of pouring through [1] and aided by [2], I really feel like [2] is an underrated work, that really makes the theoretical analysis of [1] clean and persevering. If you think about it, at this point, Adam isn’t just one person’s genius idea, it’s a method that a ton of people are using every day, in a slightly different form than the one presented. So it’s not unreasonable to expect that the convergence analysis should be a group effort, not just the work of a single team. In that spirit…

In addition to applying band-aids all over the proof of [1], [2] also ended up proposing a new (slightly modified) method, which fixes the issue raised in ($\star$). The fix is pretty simple: using the same updates for $m_t$ and $v_t$, we apply elementwise
$$\hat v_t = \max(\hat v_{t-1},v_t)$$

and update

$$\theta_{t+1} = \theta_t – \frac{\alpha_tm_t}{\sqrt{\hat v_t}}.$$

This ensures that $(\star)$ is always true, and thus the proof holds well.

## Followup works

It is clear also that I am not the first one on the bandwagon, of skepticism, powered by the 1: huge assumptions used in this proof and 2: no clear acceleration in terms of $t$ shown. It turns out that this bandwagon is actually pretty chock-a-full of optimizers singing more or less the same tune. There are more followup works being written to acknowledge the weaknesses of Adam in a better way, one of them being

• [3] Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search) by Sharan Vaswani, Issam Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien

which argues that adaptive methods can use constant step sizes in the interpolation regime (where $\theta = \theta^*$ implies $\nabla f_t(\theta) = 0$ for all $t$), which seems strong but actually quite frequent in overparametrized neural networks.

Another notable work is

Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier

which extend the proof of Adam’s convergence to nonconvex functions, and show a rate of $O(\log(T)/\sqrt{T})$, matching Adagrad, and improving upon SGD by a factor of $1-\beta_1$. Curious as I am about the guts of this proof, I’m afraid that’s going to have to wait for another day.

Finally, a fun reference. Francesco Orobana also has a nice rant about this too in his blog:

The points here are a bit different from the points I was trying to make, in literally pulling out the weaknesses of Adam; rather, he is the opposite side of the coin which is that the numerical success of Adam is still quite surprising. (I agree.)

Over all, I believe the subject of Adam remains interesting, important, and controversial. Probably better proofs can still be made.

#### Acknowledgements

Thanks Sharan Vaswani for helpful discussion in dissecting this.

It turns out that some of my previous statements on PGD may not have been, well, correct. At first I thought it was just a few typos, but as I studied my notes further, I realized there may be some fundamental flaws. So this blog post revisits some ideas and hopefully irons out some past mistakes.

By the way, there aren’t any mistakes on past blog posts, but there is in my online SpeedOpto notes.

First, a review on how PGD works, and what is true about it. Consider the constrained optimization problem

$$\min_{x\in \mathcal D} f(x)$$

where $f$ is $L$-smooth, and $\mathcal D$ is a closed convex compact set. (For now I will not assume $f$ is convex, since it doesn’t really matter as much. However, $\mathcal D$ convex is important.) Then PGD takes steps as

$$x^{(k+1)} = \mathbf{proj}_{\mathcal D}(x^{(k)}-\alpha\nabla f(x^{(k)}))\tag{PGD}$$

where $\mathbf{proj}_{\mathcal D}$ is the Euclidean projection solution to the minimization problem

$$\mathbf{proj}_{\mathcal D}(x) = \arg\min_{y\in {\mathcal D}}\|y-x\|_2.$$

Previously, I made the statement that PGD cannot hurt GD, in the sense that

• PGD approaches a stationary point at least as fast as GD, and
• PGD has the same objective function decay rate as GD.

The first statement is still true. In particular, in the first case, we have that for any point in $\mathcal D$ (and in particular for $x^*\in \mathcal D$),

$$\|\mathbf{proj}_\mathcal D(x)-x^*\|_2\leq\|x-x^*\|_2$$

with equality only if $x\in \mathcal D$.

However, I am now realizing that we cannot so easily prove the second one. In fact, the usual statements for PGD and Mirror Descent are much weaker, namely, that for constant step size. $f(x)\to f(x^*)$ at a rate of $O(1/k) + O(\alpha \|\nabla f(x^*)\|_2^2)$ and taking $\alpha = 1/\sqrt{k}$ we may  get an aggregate $O(1/\sqrt{k})$ rate. (but it is not a just-in-time rate.)

Second however: if instead, we view the projection operation as a proximal operator of an indicator, we should get back our rate of $O(1/k)$. This I will explore in a follow up post.

## What’s wrong with the descent lemma of PGD?

Let us simplify each step of PGD as

\begin{eqnarray*} y &=& x – \alpha \nabla f(x) \\ x^+ &=& \mathbf{proj}_{\mathcal D}(y).\end{eqnarray*}

Now, it’s tempting to try to write equations like this:

\begin{eqnarray*}
f(x^+)-f(x) &\leq& \nabla f(x)^T(x^+-x)  + \frac{L}{2}\|x^+-x\|_2^2\\
&\overset{+y-y}{=}& \nabla f(x)^T(x^+-y )+\nabla f(x)^T(y-x)  + \frac{L}{2}\|x^+-y\|_2^2 + L(x^+-y)^T(y-x) + \frac{L}{2}\|x-y\|_2^2\\
&\overset{y-x = -\alpha \nabla f(x)}{=}&-\alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2+ (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2
\end{eqnarray*}

However, at this point we actually get stuck, because in general, it is very hard to bound this residual term

$$r := (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2.$$

Previously, I tried to argue that this thing is negative ($r<0$); however, after drawing some pictures, it is clear that  $\nabla f(x)^T(x^+-y )$ will more often be positive (these two terms are aligned), and for $\alpha < 1/L$, $r>0$. In fact, the best bound I could come up with is actually, for $\alpha \leq 1/L$,

$$r \leq \alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2\tag{descent}$$

which shows descent, but you can’t get a rate from it. (See appendix.) This makes sense, since $f(x^+)$ is not in general less than $f(y)$, so you wouldn’t expect the same technique for GD to give as aggressive a rate (in terms of $f$) as PGD. In fact, usually you would expect the opposite; $y$ steps in the maximally descending direction, but $x^+$ is a fix that brings back feasibility and can only hamper progress in objective value decay.

As a counterexample, first consider something like $f(x) = (x-1)^2$ over the constraint $x \geq 0$, and suppose that we already are at $x^{(k)} = 0$. Clearly, $\|\nabla f(x^{(k)})\|_2 = 2 > 0$, but hereonafter, all future iterates will have the same objective value. So we really can only show descent, not a descent amount that’s related to the gradient of $f$.

### the proof (corrected)

After knocking my head around a bit, I have come to the conclusion that the best way to go around this is to actually use the (normal, accepted) way of the mirror descent proofs. First, we need two important inequalities in our arsenal: the 3 point equality

$$2(x-y)^T(x-z) = \|x-y\|_2^2 + \|z-x\|_x^2 – \|z-y\|_2^2\tag{3 pt =}$$

and the “projections bring me closer to the set” inequality (see past posts)

$$\|y-x^*\|_2 \geq \|x^+-x^*\|_2^2\tag{closer}$$

\begin{eqnarray*}
f(x)-f(x^*) &\leq& \nabla f(x)^T(x-x^*)  \\
&=& \frac{1}{\alpha}(x-y)^T(x-x^*)\\
&\overset{\text{(3 pt =)}}{=}&\frac{1}{2\alpha}(\underbrace{\|x-y\|_2^2}_{\alpha^2\|\nabla f(x)\|_2^2} + \|x^*-x\|_2^2 – \|x^*-y\|_2^2)\\
&\overset{\text{(closer)}}{\leq}&\frac{1}{2\alpha}(\alpha^2\|\nabla f(x)\|_2^2 + \|x^*-x\|_2^2 – \|x^*-x^+\|_2^2)
\end{eqnarray*}

and after telescoping, we get

$$\frac{1}{k}\sum_{i=1}^k(f(x^{(i)}) – f(x^*))\leq \frac{\alpha}{2}\underbrace{\frac{1}{k}\sum_{i=1}^k\|\nabla f(x^{(i)})\|_2^2 }_{\overline{\nabla f(x^{(k)}}} + \frac{\|x^*-x^{(0)}\|_2^2 – \|x^*-x^{(k)}\|_2^2}{2k\alpha}.$$

This averaged gradient term (AG) converges to $\frac{\alpha}{2}\|\nabla f(x^*)\|_2^2$ which is not 0 in general. Recall also that we showed PGD descends, so that $f(x^{(i+1)}) \leq f(x^{(i)})$. Thus we can conclude

$$f(x^{(i)}) – f(x^*)\leq \frac{\|x^*-x^{(0)}\|_2^2 – \|x^*-x^{(k)}\|_2^2}{2k\alpha} + \frac{\alpha}{2} \|\overline{\nabla f(x^{(k)}})\|_2^2\to O(1/k) + O(\alpha \|\nabla f(x^*)\|_2^2).$$

## Appendix

proof that

$$r= (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2 \leq \alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2$$

Actually, this is just a consequence of the definition of projection, which says that, since $x\in \mathcal D$, then

$$\|x^+-y\|_2 \leq \|x-y\|_2 =\alpha \|\nabla f(x)\|_2.$$

Therefore, if $\alpha \leq \frac{1}{L}$, then using Cauchy Schwartz,

\begin{eqnarray*}
r&=&(1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2 \\
&\overset{\text{C-S}}{\leq} &(1-\alpha L)\|\nabla f(x)\|_2\|x^+-y\|_2 + \frac{L}{2}\|x^+-y\|_2^2\\
&\overset{\text{closer}}{\leq}&(1-\alpha L)\|\nabla f(x)\|_2\|x-y\|_2 + \frac{L}{2}\|x-y\|_2^2\\
&\overset{\|x-y\|_2=\alpha\|\nabla f(x)\|_2}{\leq}&\alpha (1-\frac{\alpha L}{2})\|\nabla f(x)\|_2^2
\end{eqnarray*}

# Video time! Discretization methods on quadratic and logistic regression

Hi all, this is another “tiny project” that I’ve been wanting to do for a while. Lately, I’ve been somewhat obsessed with method flows and discretization methods. While we can sit here and write integrals until our faces turn blue, I think it’s time to just simulate some stuff, and just see what happens. In this post I explore 4 problems:

### Problems

• Quadratic optimization, somewhat ill conditioned (duh.)
• Logistic regression, well separated
• Logistic regression, poorly separated

The methods include

• GD = Euler’s discretization (explicit)
• MDPT = Midpoint method (explicit)
• RK4 = 4th order Runge Kutta method (explicit)
• BE = Backward Euler (implicit)
• BDF2 = 2nd order Backward Differentiation Formula (implicit)
• TZ = Trapezoidal rule (implicit).

Here, an approximate implicit method = 1 gradient step, and a full implicit method = 100 gradient steps.

$$f(x) = \frac{1}{2}x^TQx$$

$$f(x) = \sqrt{x^TQx}$$

### Logistic regression

In the logistic regression models,

$$f(\theta) = \frac{1}{m}\log(1+\exp(-y_ix_i^T\theta)$$

where we generate data using the 2-blob model

$$x_i = y_i c + z_i$$

and $\|c\|_2$ is tuned to create separation. Included is not just the trajectory over the landscape, but also the classification barrier, so you can see whether the suboptimality actually has real effect on bad classification or not.

### Discussion

I had a few questions going into this tiny experiment, some of which were answered, some of which are still ephemeral.

• Do implicit methods work? Answer:: Hard to say. They definitely do hug the flow better, and in the case of quadratic or normalized quadratic, they seem to help a lot with conditioning and being able to pick large step sizes. However, for logistic regression, they don’t really contribute well to margin orientation, which seems to be better affected by explicit noisy methods.
• Are their approximate versions sufficient? Note that an approximate implicit method has the same per-iteration complexity as an explicit method. Answer: yes, the approximate methods are actually almost as good as the full ones, with regards to conditioning robustness and flow hugging!
• Does it make sense to consider quadratic problems to represent machine learning tasks? Answer: This I kind of knew, but the answer is not really. Well, actually, it depends. For poorly separated blobs, a quadratic approximation is not bad, and seems to well-characterize what’s going on. But, on the other hand, for poorly separated blobs, the margin orientation is somewhat arbitrary and overfits on training data idiosyncrasies. On the other hand, when the blob is well separated, the step size choices, the flow characterization, all of that doesn’t seem to align well with margin orientation!

More studies need to be made in order to capture embedding learning as well, but anyway, it’s called a tiny project for a reason!

Finally, I leave you an image of our two favorite cheerleaders.

# Another tiny project: problem repository

I’ve started a (rough, but hopefully can be refined over time) repository of test problems for basic convex methods. It’s amazing how, despite the wealth of optimization research, our “standardized test cases” are still rather limited, or application-directed. I’d like to go after the same spirit as things like HuggingFace, or UF’s Sparse Matrix Repository, or UCI’s machine learning repository, and host a convex optimization repository.

It’s small now, but with water, sunlight, and a bit of exercise, hopefully it will grow into a healthy and strong contribution to society!

# Some proofs of continuous time things

In trying to understand how continuous time interpretations can help optimization research, I find myself doing a lot of “recreating the wheel”, but in a nontrivial way; I am getting these proofs that I’m sure are standard knowledge, but it’s not like we learned them back when we were kids. So I’m going to try and go through some of them here.

Here are some super simple proofs to get us started.

This one is actually super easy and classical, with one of the earliest citations [1]. Here, we’re looking at unconstrained methods

$$\min_{x} \quad f(x)\tag{(U)}$$

and the method is very simply

\begin{eqnarray*}
\mathbf x_{k+1} &=& \mathbf x_k   – \alpha \nabla f(\mathbf x_k)
\end{eqnarray*}

with flow

\begin{eqnarray*}
\dot x(t) &=& -\nabla f(x(t)).
\end{eqnarray*}

Super classical. Then the convergence proof of the flow is simply

$$\frac{d}{dt} f(x(t))= \nabla f(x)^T\dot x = -\|\nabla f(x)\|_2^2.$$

Right away, we get method descending, and (after showing that $\|\nabla f(x)\|_2$ also decays)  a rate

$$\|\nabla f(x(t))\|_2^2 \leq \frac{1}{t} \int_0^t\|\nabla f(x(\tau))\|_2^2d\tau \leq \frac{f(x(0))-f(x^*)}{t}$$

I’ve already discussed this one a bit in a previous post, so I’ll just stop here. Note that I did not say anything about convexity, compactness, etc.

Now let’s make the problem slightly more complicated by doing the projected version.  Here, we want to solve

$$\min_{x\in\mathcal D} \quad f(x)\tag{C}$$

where $\mathcal D\subset \mathbb R^n$ is a convex, closed set.

We use the repeated iteration

\begin{eqnarray*}
\mathbf x_{k+1} &=& \mathrm{proj}_{ \alpha g}(\mathbf x_k   – \alpha \nabla f(\mathbf x_k))\tag{PGD}
\end{eqnarray*}

where

$$\mathrm{proj}_{g}(z):= \arg\min_{x\in \mathcal D} \; \|x-z\|_2.$$

One way to reframe the method, then, is iteratively solving the following problem at each step

\begin{eqnarray*}
\mathbf x_{k+1} &=& \arg\min_x \frac{1}{2}\|x – \mathbf x_k   + \alpha \nabla f(\mathbf x_k)\|_2^2 + I_{\mathcal D}(x)
\end{eqnarray*}

where $I_{\mathcal D}(x)$ is the indicator for the constraint $x\in \mathcal D$.

From that, we can write the optimality condition

$$\mathbf x_{k} – \mathbf x_{k+1} \in \alpha \nabla f(\mathbf x_k) + \mathcal N_{\mathcal D}(\mathbf x_{k+1}) \tag{PGD-opt}$$

and $\mathcal N_{\mathcal D}(x)$ is the normal cone of $\mathcal D$ at $x$, described as

$$\mathcal N_{\mathcal D}(x) := \partial I_{\mathcal D}(x) = \{z : z^T(x-x’) \geq 0, \; \forall x’\in \mathcal D\}. \tag{NC-def}$$

And, again using approximations of $\dot x(t) \approx \frac{\mathbf x_{k+1} – \mathbf x_{k}}{\alpha}$ we get

$$-\dot x – \nabla f(x) \in \mathcal N_{\mathcal D}(x). \tag{PGD-flow}$$

(I’m  a bit skeptical to do this, because then it seems this method is the same as the subgradient method, which I kind of assumed would have a worse rate. But, then again… maybe not in the flow? After all, what is the harm of nonsmoothness in the continuous time regime? Still, I’d feel more comfortable if some $\frac{d}{dt} \nabla f(x)$ term appeared somewhere. For now we’ll go with it.)

Now here comes a crucial step. In the usual proof of the projected gradient method, we needed to show that projections cannot increase your distance to $x^*$. I wrote this as a “funny little proof” in a previous post. But in the continuous version, what we really want to say is that projections can only slow you down in your desired direction–that is, the $\dot x$ from PGD can’t be faster than the $\dot x$ from good ole GD. Concretely, I want to say that the $\dot x$ defined in (PGD-flow) satisfies

$$\|\dot x\|_2 \leq \|\nabla f(x)\|_2. \tag{Proj-Contr-Cond.}$$

This is going to work as follows. We define $v(t) = -\dot x(t) – \nabla f(x(t))$. Then by (PGD-Flow), $v\in \mathcal N_\mathcal D(x)$, or $v^T(x-x’) \geq 0$, for all $x’\in \mathcal D$. I’m going to pick $x’ = x – c \dot x$, for some imperceptibly small $c > 0$. Note that since the trajectory is always feasible, such a $c>0$ should exist, if we assume we initialize the method somewhere in $\mathcal D$.  Then, I can say that $v^T\dot x \geq 0$.

Then, using Pythagorean theorem,

$$\|\nabla f(x)\|_2^2 = \|v+\dot x\|_2^2 = \|v\|_2^2 +\underbrace{ 2v^T\dot x}_{\geq 0} + \|\dot x\|_2^2 \geq \|\dot x\|_2^2 \tag{Proj-Contr-Cont}$$

And we’re done! See below for the super-high-tech figure.

Ok, Now we can move onto the main event: convergence proof!

$$\frac{d}{dt} f(x(t)) = \nabla f(x)^T\dot x \leq (\nabla f(x) + v)^T{\dot x} = -\|\dot x\|_2^2$$

and therefore

$$\frac{f(0)-f(x^*) }{t}\geq \frac{1}{t}\int_0^t \|\dot x(\tau)\|_2^2 d\tau$$

e.g. we have an $O(1/t)$ bound on the average speed of the method. Note that it makes perfect sense why we can’t seem to be able to bound the average gradient of the method, since in a constrained problem, $\nabla f(x(t))\not\to 0$.

The proof is more -or-less trivially extended to the proximal case, e.g. we solve

$$\min_{x} \quad f(x) + g(x)\tag{(U2)}$$

using the repeated iteration

$$\mathbf x_{k+1} = \mathrm{prox}_{ \alpha g}(\mathbf x_k – \alpha \nabla f(\mathbf x_k)), \qquad \mathrm{prox}_{g}(z):= \arg\min_x \; g(x) + \frac{1}{2}\|x-z\|_2^2.$$

Edit: It occurred to me that a very strange thing happens when $x(t)$ is on a corner of $\mathcal D$, which brings out a problem with nonsmoothness. In this case, $\dot x(t)$ is not really well defined: we have

$$\dot x_+(t) = \lim_{\Delta>0,\Delta \to 0} \frac{x(t+\Delta)-x(t)}{\Delta}, \qquad \dot x_-(t) = \lim_{\Delta>0,\Delta \to 0} \frac{x(t)-x(t-\Delta)}{\Delta}$$

and they’re both different. Note that our relationship (Proj-Contr-Cont) only applies to $\dot x_+$, not $\dot x_-$ in this case. See image above.

[1] Polyak, Boris T. “Introduction to optimization. 1987.” Optimization Software, Inc, New York.

# A fun convex set fact

This is actually not a new thing at all, but a thing I played with a couple years ago, and I ended up using in a numerical optimization lecture to try to impress some undergrads. They were not impressed. Since then I’ve tuned it a bit and just kind of take it for granted now, but now that I think about it, it is kind of nontrivial, and has some interesting significance. It’s come up again in the short course, so I figured I’d write it up here as well.

Statement: Suppose $\mathcal C$ is a convex set in $\mathbb R^n$, and $y = \mathbf{proj}_{\mathcal C}(z)$, for any point $z\in \mathbb R^n$. (Euclidean projection.) Then for any point $x\in \mathcal C$, $$\|y – x\|_2 \leq \|z-x\|_2.$$

Significance: This is part of a larger story that says “projecting on convex sets cannot hurt convex optimization”.  Basically, you should think of $x = x^*$ as an optimal part of some constrained optimization problem. This basically says that my contraction is a give-in, that every time I do descent-step-plus-projection, the projection step will always drive me closer to $x^*\in \mathcal C$. What I found interesting and nontrivial, though, is that this proof does not depend on any objective function $f$, so that while we really want to use this statement for $x = x^*$, it actually applies for any $x\in \mathcal C$.

Proof by picture:

Note that the assumption that $\mathcal C$ is a convex set is essential; counterexamples easily exist when the set is not convex.

Proof: First, we can assume that $z\not\in \mathcal C$, since if $z\in \mathcal C$, then $z = y$ and the statement is trivially true.

Second, we should assume that $x\neq y$. Since we know that the projection on a convex set is always unique, we can encode this further and say that there exists some $\epsilon > 0$ where $\|y-z\|_2 + \epsilon \leq \|x-z\|_2$.

Now, the fact that $y$ is the projected point means that it’s the solution to the convex optimization problem
$$y = \arg\min_{y’\in \mathcal C} \; \|y’-z\|_2^2,$$ and thus we can apply the normal cone condition $-\nabla f(y)\in \mathcal N_{\mathcal C}(y)$:

$$(z-y)^T(x-y)\leq 0 \qquad\qquad \text{(normal cone condition, since x^*\in \mathcal C)}.$$

Then,

\begin{eqnarray*}
\|z-x\|_2^2 &=& \|(z-y) + (y-x)\|_2^2\\
&=& \underbrace{\|z-y\|_2^2}_{\geq 0} + \|y-x\|_2^2 + 2\overbrace{\underbrace{(z-y)^T}_{-\nabla f(z)^T}(y-x)}^{\geq 0 \text{ by N.C.}}\\
&\geq &
\|y-x\|_2^2.
\end{eqnarray*}

You may now be asking yourself, ok, but where did convexity fit in? Isn;t this only supposed to work when $\mathcal C$ is a convex set? (You can easily draw counterexamples that are nonconvex; see the star figure above.) Well, the normal cone of a nonconvex set $\mathcal S$ is the same as the normal cone of its convex hull $\mathbf{conv}(\mathcal S)$. When doing projections on $\mathcal S$ nonconvex, there is no way of ensuring that the  negative gradient will lie in the normal cone, which is a key property needed to make this proof work. But, if you are lucky and it does, then it probably means that you are taking steps that work equally well in the convex relaxation, and all will be fine! (Don’t count on it though!)

After posting findings

The extension of this statement to general Bregman divergences can be found in Nick’s course notes: https://www.cs.ubc.ca/~nickhar/F18-531/Notes20.pdf

Edit (July 2022): Thanks Mathurin Massias who  showed me how to drastically shorten the proof. (The previous version was like 5x longer.)

# Small proof: $L$-smoothness definitions and equivalences

### tl;dr

In a former version of this post, I conjectured that while $L$-smoothness and convexity were not equivalent in definition, they were tied in some of the “equivalent” (not equivalent) versions of their statements.

To clarify, my former statement:

“Convexity and smoothness are totally independent concepts and should not be confused with each other “

I now still believe is still true, and (about 80%) carries through to equivalent forms as well. Thus no groundbreaking optimization proofs were harming in the creation of this blog post.

But, there are still minor details to iron out.

### “Equivalences”

The following are all considered “equivalent” definitions of $L$-smoothness. (tl;dr: they are not equivalent).

• [D] $\|\nabla f(x)-\nabla f(y)\|_2\leq L\|x-y\|_2$
• [Q] $f(x)-f(y)-\nabla f(y)^T(x-y) \leq \frac{L}{2}\|x-y\|_2^2$
• [Q2] $g(x) = \frac{L}{2}x^Tx-f(x)$ is convex
• [M] $(\nabla f(x)-\nabla f(y))^T(x-y) \leq L\|x-y\|_2^2$
• [H] Largest eigenvalue of $\nabla^2 f(x)$ is less than $L$

Now, some details:

• [D] is in fact the definition of $L$-smoothness. It does not and cannot depend on convexity, since if $f$ is $L$-smooth, then so is $-f$.
• [Q], [Q2], [M] are all equivalent. See also citation [1] below.
• Under the assumption that the Hessian exists everywhere, [Q], [Q2], [M] $\iff$ [H].

New conjecture (probably exists somewhere, I’m too lazy to find it and rather prove it myself):

1. [D] $\iff$ [Q], [Q2], [M] applies for $f$ and $-f$
2. [D] + Hessian exists everywhere $\iff$ largest magnitude eigenvalue of $\nabla^2 f(x)$ is less than $L$
3. [Q], [Q2], [M] alone are trivially true for concave functions.
4. Relationship with convexity: [D] $\Rightarrow$ [Q], [Q2], [M].  If additionally the function is convex, then [Q], [Q2], [M] $\Rightarrow$ [D]. But it is a sufficient, not necessary condition (as evidenced by $f(x) = \sin(x)$ is $L$-smooth.)

Proof of statement 3: This is pretty simple, and we can just go after [Q2]: if $f$ is concave then $-f$ is convex, and thus $g$ is convex.

Proof of statement 2: This is more of a logic statement than anything else. It is still true that $L$-smoothness forces Hessian magnitude to be less than $L$, and once simple way to see this is to use the Taylor approximation trick: e.g. given that $\nabla^2 f(x)$ is continuous everywhere, for every $x$ and $y$, there exists some $w$ where $$\nabla f(x)-\nabla f(y) = \nabla^2 f(w)(x-y).$$ Therefore, $$\|\nabla f(x)-\nabla f(y)\|_2^2 = \|\nabla f(w)\|_2^2\|x-y\|_2^2,$$ which imposes that $\|\nabla f(w)\|_2 \leq L$ if [D] is to hold.

Proof of statement 4: First, without extra bells or whistles, Nesterov already showed us that [D] $\Rightarrow$ [Q], [Q2],[M]. To see this, we just prove for [Q]. First, the Bregman divergence can be written as an integral over inner products:
$$f(x)-f(y) = \int_0^1 \frac{\partial f (u(s))}{\partial s} ds = \int_0^1 \sum_{k=1}^n \frac{\partial f(u)}{\partial u_k} \frac{\partial u_k(s)}{\partial s} ds= \int_0^1 \nabla f(u(s))^T(x-y) ds$$
where $u(t) = (1-t) y + t x\in \mathbb R^n$. Therefore,
$$f(x)-f(y) – \nabla f(y)^T(x-y)= \int_0^1 (\nabla f(u(s)) – \nabla f(y))^T(x-y) ds$$
and using Cauchy-Schwartz and imposing $L$-smoothness,

$$f(x)-f(y) – \nabla f(y)^T(x-y) \overset{\mathrm{C-S}}{\leq} \int_0^1 \|\nabla f(u(s)) – \nabla f(y)\|_2\|x-y\|_2 ds \overset{\mathrm{D}}{\leq} L \int_0^1 \|u-y\|_2\|x-y\|_2 ds.$$

Since $\|u-y\|_2 = s\|x-y\|_2$, the integral reduces to
$$L \int_0^1 \|u-y\|_2\|x-y\|_2 ds = L \|x-y\|_2^2\int_0^1 s ds = \frac{L}{2}\|x-y\|_2^2.$$

Now suppose [Q2] is true, and also $f$ is convex. And, for fun, let’s just assume that the Hessian exists everywhere. Then, since the eigenvalues of $\nabla^2 g(x)$ are $\geq 0$ and $f$ is convex, then $0 \preceq \nabla^2 f(x) \preceq LI$. Note that [D] only requires $-LI \preceq \nabla^2 f(x) \preceq LI$; hence the sufficient but not necessary bit.

Finally, Proof of statement 1: Suppose that [Q2] holds for $f$ and $-f$. Then the eigenvalues of $f$ must be bounded between $LI$ and $-LI$. This exactly gives the Hessian condition necessary for [D].

A could-be-fun-addendum: proving all of this without using the Hessian, aka for cases where the Hessian may not exist.

And, since it’s been a while, a cozy photo of a little friend

Citations

[1]: Blog of Xingyu Zhou, 2 pages in particular: https://xingyuzhou.org/blog/notes/Lipschitz-gradient and https://xingyuzhou.org/blog/notes/strong-convexity. By far the best set of reference notes I have seen on this topic.

I’ve been racking my brain lately if there is an appropriate place to insert “small proofs”, e.g. stuff that isn’t trivial, I can’t find it anywhere, but isn’t cool enough to inspire a whole new paper. For now, I think I’ll put it here.

## Background

We investigate $\underset{x}{\min}\, f(x)$ using gradient descent

$$x^{(k+1)} = x^{(k)} – \alpha \nabla f(x^{(k)}) \qquad\qquad\text{(GD)}$$

As I’m working to pull together notes for a short course, I realized all my proofs are “average case proofs”, e.g. averaged function suboptimality or averaged gradient norm squared converges like $O(1/k)$. But because GD is a “descent method”, the averaged function suboptimality is a last-iterate-proof in disguise, since

$$f(x^{(k+1)})\leq f(x^{(k)}) \Rightarrow f(x^{(k)}) \leq \frac{1}{k} \sum_{i=1}^k f(x^{(i)}) = O(1/k).$$

I wanted to do the same thing for the gradient norm case, but I could not find this proof around, even though intuitively it should be true. So I concocted one. Please let me know if this already exists, otherwise please cite this thing (url buried somewhere is probably sufficient) if you would like to steal it.

Statement: If $f$ is convex and $0<\alpha < 2/L$ then $f(x^{(k+1)})\leq f(x^{(k)})$ for all $k$.

If this is true, then we can generalize average gradient results to last iterate results.

$$\|\nabla f(x^{(k)})\|_2^2 \leq \frac{1}{k} \sum_{i=1}^k \|f(x^{(i)})\|_2^2 = O(1/k).$$

Proof:

First,

$$\| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = ( \nabla f(x^{(k+1)})-\nabla f(x^{(k)}))^T( \nabla f(x^{(k+1)})+\nabla f(x^{(k)})).$$

Using Taylor’s approximation theorem, there should exist some $z$ “between” $x^{(k)}$ and $x^{(k+1)}$  where $H = \nabla^2 f(z)$ satisfies

$$H(x^{(k+1)}-x^{(k)} ) = \nabla f(x^{(k+1)})-\nabla f(x^{(k)}).$$

Using this substitution,

\begin{eqnarray*}
\| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 &=& (H(x^{(k+1)}-x^{(k)} ) )^T(H(x^{(k+1)}-x^{(k)} ) +2\nabla f(x^{(k)}))\\
&=& (H(x^{(k+1)}-x^{(k)} ) )^T(H(x^{(k+1)}-x^{(k)} ) -\frac{2}{\alpha}(x^{(k+1)}-x^{(k)})).
\end{eqnarray*}

Now we make life easier and just write $s = x^{(k+1)}-x^{(k)}$ to get

$$\| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = s^TH^THs -\frac{2}{\alpha} s^THs.$$

Now we invoke convexity. Since $f$ is convex, then a Cholesky decomposition exists $H = LL^T$, and we encode $u = L^Ts$, to get

$$\| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = u^THu -\frac{2}{\alpha} \|u\|_2^2.$$

which is $\leq 0$ if $0<\alpha < 2/L$.

Notes:

It’s unclear to me if convexity is really needed, or if smoothness alone should be sufficient. It seems like there should be a way around requiring convexity, since the biggest property used was $Lu^Tu \geq u^THu$ which is true even if $f$ is not convex, but other little parts of the proof seems nontrivial. Maybe to be continued…

Edit: quick counterexample for nonconvex case: Suppose $f(x) = -x^2$ and start with $x^{(0)}\neq 0$. Then at every step, the gradient norm necessarily increases. We could say it still has an attainable min if we substitute this for, say, a sine function, but this violation still happens. Sad.

Edit: related works, generated through discussion:

# Achievable online predictions

Hi all, it’s been a while! I took a step back from this blog to focus on teaching and research and such, but now that the last class is over for the fall, time to get back to this!

Today’s topic is now focused more on learning theory, which is an area I’m trying to learn more about. I’m going to cover Lemma 1 (and related topics) from [1], which was first introduced in [2]:

[1] A tutorial on Online Supervised Learning with Applications to Node Classification in Social Networks, by Alexander Rakhlin and Karthik Sridharan.

[2] Behavior of sequential predictors of binary sequences, by Thomas Cover.

Many thanks to Baojian for the topic introduction and Reza and Nikhil for discussions.

# The scenario

The goal here is sequence prediction. Let’s make life easy for us so far, and say that we are dealing with binary sequences, e.g. $y_t \in \{1,-1\}$ for $t = 1,…$. The goal is to estimate $\hat y_t\in \{1,-1\}$ based on observations $y_1,…,y_{t-1}$, using a distribution $p_t = \text{pr}(\hat y_t = 1)$, and we will represent this with a surrogate variable $q_t\in[-1,1]$ where $q_t = \mathbb E[\hat y_t]$ under the distribution $p_t$.

## Some examples

• Suppose that the sequence is extremely repetitive. That is, either $y_t = (1,1,1,1,…)$ or $y_t = (-1,-1,-1,-1,…)$. Then by $t = 2$, we can make perfect predictions by just picking $q_t = y_1$.
• Suppose that the sequence has repeating blocks of length $K$. That is, if $K = 3$, then a possible $y_t = (1,1,1,-1,-1,-1,-1,-1,-1,1,1,1,-1,-1,-1,…)$. Then we could at best do something with 60% accuracy, if we adopt the strategy of $\hat y_t = y_{t-1}.$ However, note that we could also construct an adversarial $y’_t = (1,-1,1,-1,1,-1,…)$ where this strategy would fail 100% of the time.

These examples show that in some real world prediction scenario, where there is some semblance of structure in the sequences, we might be able to come up with some reasonable algorithm. However, the existence of a not-terrible-algorithm depends on a structural prior, e.g. the set of allowable sequences must exist in some set smaller than $2^T$ where $T$ is the sequence length.

This concept also made more sense to me after I learned that these ideas were first explored by Thomas Cover in [2],  who primarily deals with codes over noisy channels. (The second example given is exactly a repetition code.) It makes sense to conclude that in the absence of any codes, no message can survive noise; on the other hand, enforcing some kind of coding technique limits the types of messages ultimately sent , and also limits the types of messages of which we need to “protect” from noise.

## Multiclass extension

In order to borrow most of the tools from the proofs in [1], the rest of the post will discuss this scenario in the multiclass framework, e.g. instead of $y_t\in \{-1,1\}$, rather we consider $y_t\in \{1,2,…,K\}$ for some $K\geq 2$. (Note that $K = 2$ reduces to the binary case.) In other words, feasible strings with some kind of structural prior to keep in your head may look like:

• (1,1,1,2,2,2,1,1,1,3,3,3,1,1,1,4,4,4,3,3,3,2,2,2)
• (cat, cat, cat, cat, cat, cat, cat, cat, cat) (as opposed to dog)
• (春,春,夏,秋,秋,秋,冬,春,夏,夏,夏,秋,秋,冬,冬,冬,冬)

You get the picture.

# The no-free-lunch theorem

The first theorem we explore is an “online version” of the no free lunch theorem. In particular, it says that for any algorithm $q_t$, if we do not impose a prior, then for binary sequences of length $T$, it must be that

$$\mathbb E_y \mathbb E_{\hat y\sim p}\left[\frac{1}{T}\sum_{t=1}^T \mathbb 1_{\hat y_t\neq y_t}\right] = \frac{1}{2}$$

and more generally, for $K$-class sequences, where both $y_t$ and $\hat y_t$ can pick values in $\{1,2,…,K\}$,

$$\mathbb E_y \mathbb E_{\hat y\sim p}\left[\frac{1}{T}\sum_{t=1}^T \mathbb 1_{\hat y_t\neq y_t}\right] = 1-\frac{1}{K}.$$

The proof for this is quite straightforward. The key observation is to see that there is no such thing as a “strong strategy $p$”, and in fact, for any $p$,

$$\mathbb E_{y_T} \mathbb E_{\hat y_T\sim p}\left[ \mathbb 1_{\hat y_t\neq y_t}| y_1,…y_{T-1}\right] = 1-\frac{1}{K}.$$

The second example above shows this tragedy; no matter what you predict, there exists an equal number of scenarios where that prediction was wrong. Applying this across all time $t = 1,…,T$ generalizes the result.

# Cover’s achievability lemma

But wait! Is this end of the story? Shall we despair and just call it quits? Nah, actually this is where Cover’s achievability theorem arises, and it’s quite cool. It basically says that while we cannot improve the universal accuracy of any algorithm, we can impose the way that it fails, by deciding which sequences we prioritize and which we throw in the gutter. (Think repetition codes = good, arbitrary sequences = bad.)

We call this goal function $\phi:\{-1,1\}^T\to [0,1]$, e.g. $\phi(y) =$ the probability that the sequence $y\in \{-1,1\}^T$ is guessed wrong. We say that $\phi$ is stable if for any two binary sequences $y_1\in \{-1,1\}^T$ and $y_2 \in \{-1,1\}^T$, if the Hamming distance between $y_1$ and $y_2$ is at most 1, then $|\phi(y_1)-\phi(y_2)|\leq \frac{1}{T}$. Written more explicitly,

$$|\phi(…,1,…)-\phi(…,-1,…)|\leq \frac{1}{KT}.$$

Then Cover’s theorem says that this error distribution $\phi$ is achievable if and only if $\mathbb E_y[\phi(y)] \geq 1/2$ for binary sequences, and $\mathbb E_y[\phi(y)] \geq 1-1/K$ for multiclass sequences. Intuitively, this kind of makes sense, but there still is a somewhat finality in the statement, which says that while we can move things around, we really can never improve on our worst case scenario.

The rest of this blog post covers the proof of this achievability lemma, which is constructive in actually showing how one can define $q_t$ and $p_t$ given $\phi$.

# Proof of Cover’s lemma

## Water filling

Let us first consider the following multiclass min-max problem

$$\min_{q\in \Delta_K } \max_{y\in \{1,2,…,K\}} E_{\hat y\sim q} [ 1_{\hat y\neq y}] + R(y) \qquad (\star).$$

Now define a vector $r$ where $r_k = R(k)$, and assume that $r_i-r_j \leq 1/K$ for all $i,j$, e.g. stability condition. Define also $e_y$ the one-hot vector whose $y$th term is 1. Then $(\star)$ becomes

$$\min_{q\in \Delta_K } \max_{y\in \{1,2,…,K\}} (\mathbf{1}-q+r)^Te_y \qquad (\star\star)$$

and therefore the optimal $y = \arg\max_k\, 1-q_k+r_k$. Therefore if $q$ wants to minimize things, it will try to force the maximum value of $\mathbf{1}-q+r$ to be as small as possible, e.g.

$$(\star) \iff (\star\star)\quad \iff\quad \min_{q\in \Delta_K } \max_k \; 1-q_k+r_k.$$

This is accomplished by “water filling”:

1. Start with $q_k = 0$ for all $k$.
2. At each iteration, find all the indices $k = \arg\max_k \; 1-q_k+r_k$, and then fill up $q_k$ until a (or several) new indices $i$ are in the max.
3. Keep doing this until $\sum_k q_k = 1$, e.g. you run out of “water supply”.

Mathematically, this means that  $q_k =\max\{0, r_k-c\}$, where $c$ is chosen such that  $\sum_k q_k =1$. See below an example where $K = 9$.

Note also that the stability condition enforces that $\sum_{i=1}^K r_{\max}-r_i \leq 1$, so in fact the flag $q_k = 0$ will never happen, and we can just write $q_k = r_k-c$. Explicitly, this means

$$c = \frac{ \sum_{i=1}^K r_i }{K}- \frac{1}{K}$$

and

$$(\star)\quad =\quad 1+c = \frac{\sum_{i=1}^K r_i}{K}+\left(1-\frac{1}{K}\right) = \mathbb E_{y\in \{1,…,K\}} [R(y)] + \left(1-\frac{1}{K}\right).$$

## Work backwards.

Let us now introduce what everything is.

Let’s first write out our regret over a predetermined sequence $y$, as

$$\text{Total regret}(y) = E_{q_1}[\mathbb 1_{y_1\neq \hat y_1}] + E_{q_2|y_1}[\mathbb 1_{y_2\neq \hat y_2}] + E_{q_3|y_1,y_2}[\mathbb 1_{y_3\neq \hat y_3}] + \cdots + E_{q_T|y_1,…,y_{T-1}}[\mathbb 1_{y_T\neq \hat y_T}]$$

Again, we cannot reasonably bound the total regret over all choices of $y$, but we can (and want to) reasonably show that this total regret faction can be upper bounded by $T\cdot \phi(y)$, provided $\phi(y)$ has the correct stability condition, and has average $\mathbb E_y[\phi(y)] = 1-1/K$.

Suppose now that at iteration $t-1$, we had incurred $M_{t-1}(y_1,…,y_{t-1})$ mistakes. Then the expected number of total mistakes at step $t$ will be

$$M_t(y_1,…,y_t) = \mathbb E_{q_t}[\mathbb 1_{y_t\neq \hat y_t}] + M_{t-1}(y_1,…,y_{t-1}).$$

We impose that $M_T(y_1,…,y_T) = T\phi(y_1,…,y_T)$. Then, we work backwards, and see that by assigning $R_t(y_t) = -M_t(y_1,…,y_t)$, and using the previous waterfilling method (optimal $q$ for worst case $y$),

$$\min_q\max_y \mathbb E_{q_t}[\mathbb 1_{y_t\neq \hat y_t}] – M_t(y_1,…,y_t)=\left(1-\frac{1}{K}\right) – \frac{\sum_{y_t} M_t(y_1,…,y_t)}{K}.$$

Therefore, using this strategy $q$,

$$-M_{t-1}(y_1,…,y_{t-1}) = \mathbb E_{q_t}[\mathbb 1_{y_t\neq \hat y_t}] – M_{t}(y_1,…,y_{t}) \leq \left(1-\frac{1}{K}\right) – \mathbb E_{y_t}[M_t(y_1,…,y_t)] .$$

By telescoping, we get to

$$-M_0 \leq T\left(1-\frac{1}{K}\right) – \mathbb E_y[\phi(y_1,…,y_T)]$$

which, under the assumption that $\mathbb E_y[\phi(y_1,…,y_T)] \geq \left(1-\frac{1}{K}\right)$ implies that $M_0 \geq 0$, e.g. this solution is  feasible. Since it was optimized at each step, it is also optimal.

# What’s the point?

First off, there is some beauty of course in knowing that while the no free lunch theorem says you can’t ever do better than random on average, you can decide how that randomness looks. In machine learning, this means you can force label patterns that don’t appear often to be always predicted wrong, but label patterns that appear all the time to be almost always correct. What’s more, the proof is constructive, and the sequence $q_t$ exactly gives the algorithm!

The other, more practical end of it is just that it makes proof writing for this type of problem easier in general. From now on, rather than providing algorithms and proofs, all I have to say is present a $\phi$ a function over all label sequences, and show that it satisfies

• stability: $\phi(…,i,..)-\phi(…,j,…)<\frac{1}{TK}$
• average error $\mathbb E_y[\phi(y)] \geq (1-1/K)$

and I know that an algorithm exists to achive this $\phi$! Then, taking expectations over a distribution more amenable to my data, say, $\mathcal D$, gives the new performance result!

• average error over all possible labels:  $\mathbb E_{y\sim \mathrm{Unif}}[\phi(y)] = (1-1/K)$,
• regret I actually care about $\mathbb E_{y\sim \mathcal D}[\phi(y)]) =$ possibly much lower!

In other words, while there is in general no free lunch, if I know where to look I may still find some cheaper lunches lying around! (at the cost of expensive lunches elsewhere…)

# Newton’s method II : Self concordant functions

tl;dr: Self-concordance. definition, some examples, some interesting properties. Proof of Newton’s rates left as exercise for reader 🙂

An interesting generalization of strongly convex functions are these self-concordant functions, which can be defined as
$$|D^3 f(x)[u,u,u]|\leq 2M\|u\|_{\nabla^2 f(x)}^2$$
Here, the notation $$D^3 f(x) [u,v,w] = \sum_{i,j,k}\frac{\partial^3 f(x)}{\partial x_i \partial x_j \partial x_k} \, u_i \, v_j\, w_k$$ is a way of denoting the 3rd order tensor derivative of $f$, and we write $\|u\|_H = \sqrt{|u^THu|}$ to denote the norm induced by the symmetric square matrix $H$.

This class of functions is usually studied in association with Newton’s method, which, as shown in a previous post, has quadratic convergence if the function is $\mu$-strongly-convex, and local quadratic convergence if the function has a continuous Hessian map and is $\mu$-strongly-convex at a unique minimizer $x^*$.

In fact, the notion can be made even more general by the observation that Newton’s method is affine invariant. That is, consider two equivalent problems where $g(u) = f(Au)$ for $x = Au$:
$$\min_x\; f(x)\quad \iff \quad \min_u g(u).$$ Then the Newton step is identical, e.g. $$x^+ – x \,=\, – \beta (\nabla^2 f(x))^{-1} \nabla f(x) \,=\, – \beta A (\nabla^2 f(u))^{-1} \nabla f(u) \,=\, Au^+-Au.$$

So here’s a fun fact. There can be functions which are $\mu$-strongly-convex, but, after an affine change of variables, are no longer strongly convex at all! (An easy example of this is the above one where $A$ is not full row rank). Somehow this restriction isn’t so interesting, though, because as we know even gradient descent is somewhat agnostic to low-rank affine transformations. (The gradient is always orthogonal to nullspaces of this transformation.) The issue is conditioning, e.g. when $A$ has a wide spectral range. As an example, suppose that $A = \mathbf{diag}(1/k)$ for $k = 1,…,n$. Then gradient descent will have trouble with the coordinates corresponding to the smallest eigenvalues of $A$, which will be poorly represented in comparison to the largest eigenvalues. On the other hand, Newton’s method would invert $A$ in its gradient computation, somehow evening this depression so that no effect is felt. (Of course the issue with multiplying a large number with its inverse still has to be dealt with numerically…)

By the way, in case you weren’t aware, quadratic convergence is very fast. Like, very, very, very fast.

The main example of a self-concordant and not-strongly-convex function was the log barrier $$\phi(x) = -\sum_i \log(b_i-a_i^Tx)$$ which could be used to enforce linear inequality constraints $a_i^Tx\leq b_i$ for $i = 1,…,m$. To see that it is not strongly convex, just take as a simple example $m = 1$, $b = 0$, and $a = -1$. Then $$\phi(x) = -\log(x), \quad \phi'(x) = -\frac{1}{x}, \quad \phi”(x) = \frac{1}{x^2}$$ and taking $x\to +\infty$  shows that $\phi(x)$ cannot be $\mu$-strongly convex.

More generally, defining $c_i = \left(\frac{a_i^Tu}{b_i-a_i^Tx}\right)^2$, we see that

$$|D^3 \phi(x)[u,u,u]| = \sum_i c_i^3 = \sum_i c_i^{3/2} \leq \left(\sum_i c_i\right)^{3/2} = (u^T\nabla^2 \phi(x)u)^{3/2}$$

since in general, for $a\geq 0$ and $b\geq 0$, $\log(a^{3/2}+b^{3/2}) \leq \tfrac{3}{2}\log(a+b)$.

I’m not really going to talk much more about barrier functions, though, because as beautiful as they are, they don’t really seem to be very present in machine learning applications. In contrast, since in ML / Big Data stuff we generally use gradient methods (and are whining enough when we have to use a full gradient rather than a stochastic one), the heavy machinery needed for this generalization doesn’t seem necessary. In particular, while log barriers are self-concordant, logistic regression and exponential margin regression are not. Take

$$f(x) = \log(\sigma(a^Tx)),\quad \sigma(\xi) = \frac{1}{1+e^{-\xi}}$$

then

$$u^T\nabla^2 f(x) u = \sigma(a^Tx)(1-\sigma(a^Tx)) (a^Tu)^2 \overset{\|x\|_2\to +\infty}{\to} 0$$

$$|D^3 f(x)[u,u,u] |= |(\sigma(1-a^Tx)^2 -\sigma(a^Tx)^2)(a^Tu)^3| \overset{\|x\|_2\to +\infty}{\to} |a^Tu|^3$$

and it is clear that at this asymptotic limit, self-concordance is not present.

Well, to me, it seems that even for first order methods, we are always trying to use quadratic approximations, usually with respect to the Euclidean norm. More recently, people have been all up in arms about using Bregman divergences instead as a metric, because they are more geometrically representative, or using Hessian estimates of Fisher Information. Well, in a way, self-concordant properties are kind of like the “gold standard” of such methods, because the idea is to use as much quadratic information as possible,  but with respect to the norm induced by the current Hessian. Without proving, here are some very familiar looking inequalities I pulled from [1]:

Hessian conditioning

• If $f$ is globally  $\mu$-strongly convex:
$$\mu I \preceq \nabla^2 f(x) \preceq L I, \qquad \forall x$$
• If $f$ is locally strongly convex at $x^*$, globally $M$-continuous Hessian
$$0 \preceq \nabla^2 f(x) \preceq (L+M) I, \qquad \|x-x^*\|_2 \leq \frac{\mu}{M}$$
• If $f$ is $M$-self-concordant
$$(1-Mr)^2\nabla^2 f(x) \preceq \nabla^2 f(x^*) \preceq \frac{1}{(1-Mr)^2}\nabla^2 f(x), \qquad \|x-x^*\|_{\nabla^2 f(x)} \leq 1/M.$$

Optimality

• If $f$ is globally  $\mu$-strongly convex:
$$f(x)-f^* \geq \frac{\mu}{2}\|x-x^*\|_2^2$$
• If $f$ is locally strongly convex at $x^*$, globally $M$-continuous Hessian
\begin{eqnarray*}
f(x)-f^* &=& \int_0^1 \nabla f\underbrace{(x^*+t(x-x^*))^T}_{=z}(x-x^*)d\tau \\
&=&  \int_0^1 (\nabla f(z)-\underbrace{\nabla f(x^*)}_{=0})^T(x-x^*)d\tau \\
&\geq& \int_0^1 \frac{\mu}{t}\|z-x^*\|_2^2 \\
&=& \mu \|x-x^*\|_2^2 \int_0^1 tdt \\
&=& \frac{\mu}{2}\|x-x^*\|_2^2
\end{eqnarray*}
• If $f$ is $M$-self-concordant
$$f(x)-f^* \geq \frac{1}{M^2}\omega(M \|x-x^*\|_{\nabla^2 f(x^*)}), \quad \omega(\tau) = \tau-\log(1+\tau).$$

Convex conjugates

Recall that for the conjugate function $f^*(z) = \sup_x\;x^Ty-f(x)$, then if both functions are twice differentiable, then via Fenchel Young,

$$\nabla f(x) = z \iff x = \nabla f^*(z)$$

and

$$\nabla^2 f^*(\nabla f(x)) = (\nabla^2 f(x))^{-1}.$$

Well, from that, it is evident that if $f$ is $L$-smooth then $f^*$ is $1/L$ strongly convex, and vice versa.

In comparison, if $f$ is $M$-self-concordant, then $f^*$ is $M$-self-concordant as well! Somehow the “gap” between $L$ and $\mu$ is omitted! To see this, note that, first note that
$$D^3 f^*(z)[u,u,u] = \lim_{t\to 0} \frac{u^T(\nabla^2 f^*(z+tu)-\nabla^2 f^*(z))u}{t} =g'(0)$$
where for $\nabla f(x(t)) = z+tu$ and we write $x = x(0)$,
$$g(t) = u^T\nabla^2 f^*(z+tu)u = u^T\nabla^2 f(x(t))^{-1} u.$$
The matrix $H(t) = \nabla^2 f(x(t))$ is a matrix map, and applying chain rule,
$$g'(0) =\frac{\partial}{\partial t} u^TH(t)^{-1}u|_{t=0} = u^TH(0)^{-1} \left(\lim_{t\to 0} \frac{H(t)-H(0)}{t}\right)H(0)^{-1} u = u^T\nabla^2 f^*(z) \left(\lim_{t\to 0} \frac{H(t)-H(0)}{t}\right)\nabla^2 f^*(z)u.$$
Next, using the first order Taylor expansion, $\nabla f(x+tv) \approx \nabla f(x) + t\nabla^2 f(x)v$, and thus
$$\nabla^2 f(x+tv)^{-1} = \nabla^2 f^*(\nabla f(x+tv)) \approx \nabla^2 f^*(\nabla f(x)+t\nabla^2 f(x) v)$$
with equality as $t\to 0$. Taking $z = \nabla f(x)$, then $u = \nabla^2 f(x) v \iff v = \nabla^2 f^*(z)u$, and therefore at $t\to 0$,
$$H(t) = \nabla^2 f(x(t)) =\nabla^2 f(x+tv) = \nabla^2 f(x+ t\nabla^2 f^*(z) u)$$
and $$g'(0) = D^3 f^*(z)[u,u,u] =D^3 f(x)[\nabla^2 f^*(z) u,\nabla^2 f^*(z) u,\nabla^2 f^*(z) u].$$
From the self-concordance of $f$, it follows that
$$D^3 f(x)[\nabla^2 f^*(z) u,\nabla^2 f^*(z) u,\nabla^2 f^*(z) u] \leq 2M(u^T\nabla^2 f^*(z)\nabla^2 f(x)\nabla^2 f^*(z)u)^{3/2} =2M(u^T\nabla^2 f^*(z)u)^{3/2}$$
Thus, $f$ is $M$-self-concordant if and only if $f^*$ is $M$-self-concordant, for convex $f$.

Rates

Using these properties, we can show that if a function $f$ is $M$ self concordant, then it has a local quadratic rate, and the number of non-local steps is bounded and usually small in practice. In summary, we use the Hessian-warped residual measure  $\lambda(x) = \|\nabla f(x)\|_{\nabla^2 f(x)^{-1}}$, and at each step we take $$x^{(t+1)} = x^{(t)}-\beta \nabla^2 f(x^{(t)})^{-1} \nabla f(x^{(t)}).$$

• Global convergence:  when $\lambda(x) \geq \frac{1}{2M}$, then using $\beta = \frac{1}{1+M\lambda(x^{(t)})}$, we have a guaranteed decrease of $$f(x^{(t+1)}) \leq f(x^{(t)}) – \frac{C}{M^2},$$ for some computable constant $C$. This implies that in a finite, computable number of steps, we reach local convergence.
• Local convergence: when $\lambda(x) \leq \frac{1}{2M}$, then using no damping ($\beta = 1$), we are guaranteed quadratic convergence; that is, $$f(x^{(t)})-f(x^*)=O(c^{2^t}).$$

Since the proofs of these results are both extensive, esoteric, and well-documented in many sources, I will forgo them.

## Extensions

There are many interesting questions still to ask, and to be honest, a lot of people are already asking them. For example:

• What about sketched Hessians? When there is error in computing the Hessian, can we still guarantee a local quadratic rate? The answer, according to [2], is unfortunately no; there can be superlinear convergence under special circumstances, but it is not obvious how to recover the quadratic local rate. Somehow this isn’t too surprising; even vanilla Newton’s method with $\beta \neq 1$ can’t achieve this rate easily.
• What happens if $f$ is nonconvex? It turns out that, like $L$-smoothness, that self-concordance doesn’t necessarily imply convexity. While there is not a ton of work in this area, one notable one is [3], showing that the idea of at least self-concordant convex-concave saddle functions has been explored.

However, overall, it seems that though we like this property of “near-quadratic-ness”, maybe because it is rather involved to work with or simply because we don’t really do that much second-order optimization in large-scale settings, this property (compared to $L$-smoothness and $\mu$-strong-convexity) is relatively less explored.

## References

1. Lectures on Convex Optimization, by Yurii Nesterov (Section 5.1)

2. “Newton Sketch: A Linear-time Optimization Algorithm with Linear-Quadratic Convergence,” by Mert Pilanci and Martin J. Wainwright.
3. “On Self-Concordant Convex-Concave Functions”, by Arkadi Nemirovski.